NCEAS / metajam

Bringing data and metadata togetheR
https://nceas.github.io/metajam/
Apache License 2.0
16 stars 7 forks source link

Handle dryad URLs #50

Open isteves opened 6 years ago

isteves commented 6 years ago

Currently, because of the way the data URL for dryad is constructed, it doesn't work with our function. check_version ends up looking for nonsensical results because it keeps chunking the URL and eventually looking for anything that matches 1. I've changed the breaking point to nchar(pid) > 5 (instead of 0) to account for this to some extent. 4163fb99f6a04d1e9ef37b33030da1e9d1c52690

Not sure what the logic of dryad URL's is, so more investigation is needed!

download_d1_data("https://datadryad.org/bitstream/handle/10255/dryad.181477/experiement1.txt?sequence=1", ".")
mbjones commented 6 years ago

For some related issues on the structure of Dryad identifiers in DataONE, see https://redmine.dataone.org/issues/7896

gothub commented 6 years ago

@brunj7 what is the origin of the URL in the above example from @isteves ? It doesn't look like a DataONE Dryad identifier or a DataONE URL. The changes that we discussed to make check_version more efficient would only work for DataONE identifiers or DataONE URLs.

brunj7 commented 6 years ago

@gothub sorry for the confusion. The idea is that scientists could also go on each data repository and get the URL from there. The KNB check_version("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/msleckman.40.1") seems to conform to what we discussed; but we should also handle PASTA check_version("https://pasta.lternet.edu/package/data/eml/edi/195/2/51abf1c7a36a33a2a8bb05ccbf8c81c6").

The DRYAD URL comes from this package https://datadryad.org/resource/doi:10.5061/dryad.7ns4pk2 for the dataset experiment_1.txt. It seems that https://datadryad.org/bitstream/handle/10255/dryad.181477/experiement1.txt will also resolve and if I search for dryad.181477 on their repo I find the corresponding data package; so more likely their internal identifier?

Side note: when I search on dataONE for this DOI (10.5061/dryad.7ns4pk2) I get 5 hits...more likely related to the problem Matt mentioned, but if I search for the "DRYAD" dataset identifier (dryad.181477) I get 0 hit.

So we might have to understand the URL logic behind DRYAD if we want to support it.

gothub commented 6 years ago

Here is the corresponding DataONE URL for the above Dryad id: https://cn.dataone.org/cn/v2/resolve/https://doi.org/10.5061/dryad.7ns4pk2/1/bitstream

brunj7 commented 6 years ago

@gothub following our discussion I think it would make sense to add a rule to prioritize the DataONE URLs and then default to the current system if it fails to make the fct more efficient.

This being said that does not solve the mapping problem between DRAYD URLs and corresponding DataONE ones.