Batch download of data - Githubissues

strasser commented 10 years ago

This question came from Jonathan Cachat, UC Davis. He was interested in downloading all data from the UCSF DataShare Merritt collection.

I was able to test the python script - running it was very easy (given basic python and terminal knowledge). In little time, a “data” folder was downloaded into my target directory (attached as data-DASHscript.zip).

Drilling through the directories, we arrive at ARK identifiers folders. Within each of these, a DOI, data use agreement (mrt-datacite.xml), objectSize and target_link are provided = METADATA

Following the target_link in a browser, leads to a DataShare Data Use Agreement FORM - requiring Name, Affiliation, Email and checkbox. Once the Data Use Agreement is accepted the DATA is downloaded (attached as ark+=b7272=q6bg2kwf_version_5.zip). It appears that the “producer” folder contains actual DATA, while the “system” folder contains more METADATA.

From what I can tell - due to the Data Use Agreement Form, batch downloading of the DATA would not be possible - the form needs to be filled out every time. I am sure this could be automated by scripting the entry of information - but this would be very hacky.

Perry - this is why I requested a UN/PW to log into Merritt - I dont want to have to enter my information on the Data Use Agreement each time, and if having a login removes this barrier than I think a script pulling metadata, then following the target_link to download the actual data would be possible. It would not require implementing API data export mechanisms.

strasser commented 10 years ago

Response from Perry Willett, CDL:

A Merritt account isn’t going to help with this issue. The Data Use Agreement challenge will occur for anyone (with guest or regular acct) trying to download an object from the UCSF Datashare collection. That’s why I pointed you to a different collection. Trying it with this collection in our development server will work:

parseFeed14.py http://merritt-dev.cdlib.org/object/recent.atom?collection=ark:/99999/fk4dz07dd ucb

strasser commented 10 years ago

Reply from Jonathan:

Thank you, this example did work more smoothly with the target_link pointed at the dev server (example: http://merritt-dev.cdlib.org/d/ark%3a%2f99999%2ffk4d249hj/2)

I was able to download the DATA files without the DUA challenge.

Although I believe we are all on the same page, we are running through these exercises to determine if, at some point, for some reason, we wanted to migrate UCD deposits out of Merritt into another storage system down the road (scale = years).

Given that UCD DASH deposits would most likely be on a production system, will the ability to download without the DUA challenge be possible? Given adequate planning, would we be able to disable the DUA challenge to enable a batch pull? or are there other mechanisms in place?

strasser commented 10 years ago

Response from Perry:

The current thinking is that we won’t have data use agreements in Dash, and instead use a CC-BY license for all the datasets, so this shouldn’t be an issue. But if we do use DUAs, we have an undocumented workaround that will let you bypass the DUA and get to the content.

strasser commented 10 years ago

For more information: see our Agreements page.

CDLUC3 / dash

Batch download of data #18