Open max-mapper opened 8 years ago
Hi, Max, we currently don't have the whole dataset hosted in an easy to download format, but after some modeling changes that are coming, that will be made available. If you're actively using the dataset, shoot me an email, and we'll get you the dataset: jeff at cos dot io.
@JeffSpies thanks, is there a github issue I can track for the modeling changes you mentioned? Or something more technical I can dig into and maybe contribute a data export feature to?
@JeffSpies @maxogden I'd be interested in tracking the modeling changes as well!
Looks like this is totally doable with Elasticsearch today, it just isn't exposed through the SHARE API. ES has a feature called Scroll: https://www.elastic.co/guide/en/elasticsearch/guide/current/scroll.html
It's demoed here: https://gist.github.com/drorata/146ce50807d16fd4a6aa#file-gistfile1-py-L17
The current API only allows the 'from' parameter which is not suitable for paging through millions of items: https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html. So the Scroll feature is the correct way to achieve this.
To initialize a scroll a new argument from HTTP would need to be added here https://github.com/CenterForOpenScience/osf.io/blob/707cd4fac2209aa181e18e9ac7e6e385f0cd06f9/website/search/share_search.py#L35
And then to query a scroll a new HTTP endpoint would need to be added that allows you to call es.scroll
and pass in the scroll id you got when you initialized it.
A proposal for the HTTP API is:
?scroll=1m
is included in https://osf.io/api/v1/share/search
it should return the new scroll IDhttps://osf.io/api/v1/share/search/scroll
that calls es.scroll
internally I don't know Python or else I'd try for a PR, I'll try to find someone that can do it.
Any update on this? Is it possible to download the SHARE data as of 2017?
Hi, Max, I was hoping to get you more concrete info, but since you're inquiring on multiple channels, I'll update you now and then again later: sprint planning is happening on Wednesday to see if we can get ES Scroll in the queue--it came up again recently with another user, so we thought now might be a good time.
The page here recommends "mining" the SHARE data http://www.share-research.org/kb/mining/, but it would be a lot more convenient if there was a bulk download option available. The JSON API is great for web applications, but for people wanting to do research and/or analysis on the SHARE dataset a bulk download is essential.
Is it possible to get CSV or ZIP of the entire SHARE dataset? Thanks!