CenterForOpenScience / osf.io

Facilitating Open Science
https://osf.io
Apache License 2.0
665 stars 325 forks source link

Feature request: Bulk download of SHARE data #5658

Open max-mapper opened 8 years ago

max-mapper commented 8 years ago

The page here recommends "mining" the SHARE data http://www.share-research.org/kb/mining/, but it would be a lot more convenient if there was a bulk download option available. The JSON API is great for web applications, but for people wanting to do research and/or analysis on the SHARE dataset a bulk download is essential.

Is it possible to get CSV or ZIP of the entire SHARE dataset? Thanks!

JeffSpies commented 8 years ago

Hi, Max, we currently don't have the whole dataset hosted in an easy to download format, but after some modeling changes that are coming, that will be made available. If you're actively using the dataset, shoot me an email, and we'll get you the dataset: jeff at cos dot io.

max-mapper commented 8 years ago

@JeffSpies thanks, is there a github issue I can track for the modeling changes you mentioned? Or something more technical I can dig into and maybe contribute a data export feature to?

emetsger commented 8 years ago

@JeffSpies @maxogden I'd be interested in tracking the modeling changes as well!

max-mapper commented 8 years ago

Looks like this is totally doable with Elasticsearch today, it just isn't exposed through the SHARE API. ES has a feature called Scroll: https://www.elastic.co/guide/en/elasticsearch/guide/current/scroll.html

It's demoed here: https://gist.github.com/drorata/146ce50807d16fd4a6aa#file-gistfile1-py-L17

The current API only allows the 'from' parameter which is not suitable for paging through millions of items: https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html. So the Scroll feature is the correct way to achieve this.

To initialize a scroll a new argument from HTTP would need to be added here https://github.com/CenterForOpenScience/osf.io/blob/707cd4fac2209aa181e18e9ac7e6e385f0cd06f9/website/search/share_search.py#L35

And then to query a scroll a new HTTP endpoint would need to be added that allows you to call es.scroll and pass in the scroll id you got when you initialized it.

A proposal for the HTTP API is:

I don't know Python or else I'd try for a PR, I'll try to find someone that can do it.

max-mapper commented 7 years ago

Any update on this? Is it possible to download the SHARE data as of 2017?

JeffSpies commented 7 years ago

Hi, Max, I was hoping to get you more concrete info, but since you're inquiring on multiple channels, I'll update you now and then again later: sprint planning is happening on Wednesday to see if we can get ES Scroll in the queue--it came up again recently with another user, so we thought now might be a good time.