Closed mankoff closed 3 years ago
I proposed a simple Python script to download the data using multiprocessing, in https://github.com/GEUS-PROMICE/Sentinel-1_Greenland_Ice_Velocity/pull/6 .
The following bash script requires jq
but then downloads everything to the current folder:
for PID in $(curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | jq -r '.data.latestVersion.files[] | .dataFile.persistentId'); do
wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done
Or, creating a temporary file on disk with the list of URLs:
curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | jq -r '.data.latestVersion.files[] | .dataFile.persistentId' > urls.txt
for PID in $(cat urls.txt); do
wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done
@robertfausto
Or, not relying on jq
, the following works with default bash tools:
for PID in $(curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | tr ',' '\n' | grep persistentId | cut -d'"' -f4); do
wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done
Warning - the above script works for the IV dataset because there are no sub-folders. This simple bash script does not work for datasets with files in sub-folders (possibly files with the same names in sub-folders). The script Adrien is writing should support this. It can be tested on this dataset: https://doi.org/10.22008/FK2/XKQVL7
This correctly creates sub-folders and places files in them
export SERVER=https://dataverse01.geus.dk
export DOI=10.22008/.... # Fill in dataset DOI
curl ${SERVER}/api/datasets/:persistentId?persistentId=doi:${DOI} > dv.json
cat dv.json | tr ',' '\n' | grep -E "persistentId|directoryLabel|filename" | cut -d'"' -f4 > urls.txt
while read -r dir; do
mkdir -p ${dir}
read -r FILE_DOI
read -r fname
wget --continue "${SERVER}/api/access/datafile/:persistentId?persistentId=${FILE_DOI}" -O ${dir}/${fname}
done < urls.txt
rm dv.json urls.txt # cleanup
Well... Ugh. The above script expects the directoryLabel
field to exist. If it does not, things don't work well. So for the IV dataset, the simpler scripts should be used. The longer script immediately above this may be useful elsewhere...
Ok I've provided a download script in the dataverse. In the Notes
section, I have
export SERVER=https://dataverse01.geus.dk
export DOI=10.22008/promice/data/sentinel1icevelocity/greenlandicesheet
curl ${SERVER}/api/datasets/:persistentId?persistentId=doi:${DOI} > dv.json
cat dv.json | tr ',' '\n' | grep -E '"persistentId"' | cut -d'"' -f4 > urls.txt
while read -r PID; do
curl -O -J $SERVER/api/access/datafile/:persistentId?persistentId=${PID}
done < urls.txt
rm dv.json urls.txt # cleanup
And the DV is updated to 3.1 with this change (changing metadata; see https://dataverse01.geus.dk/dataset.xhtml?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet&version=3.1 ). I suggest closing #6.
The upgraded dataverse provides a dirIndex
view to the data files. This makes it easy to download everything with wget
or a web-browser downloader such as DownThemAll, and reduces the need for more complicated bash or Python scripts to fetch data.
Please see http://doi.org/10.22008/promice/data/ice_discharge/d/v02 for the text I'm currently putting in the Notes section on all of the datasets. I include the HTML version of the text below. Chose "Edit" from "..." menu on this comment to see raw HTML.
Direct link to most recent files: https://dataverse.geus.dk/api/datasets/:persistentId/dirindex?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02
wget download command:
wget -r -e robots=off -nH --cut-dirs=3 --content-disposition "https://dataverse.geus.dk/api/datasets/:persistentId/dirindex?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02"
wget download command included
Provide simple download script or command.
Not so simple currently due to Dataverse limitations. This will be simple as soon as we upgrade to Dataverse 5.0.
Once we're running 5.0, here is the history behind the new behavior https://github.com/IQSS/dataverse/issues/7084 and bulk download should be done with something like
wget --recursive -nH --cut-dirs=3 --content-disposition http://dataverse_URL/api/datasets/NNNN/fileaccess
(exact URL and API string still TBD).Until then,https://guides.dataverse.org/en/latest/api/native-api.html#accessing-downloading-files and a non-trivial shell script, or Python with PyDataVerse as a dependency: https://pydataverse.readthedocs.io/en/latest/ - Note, this uses an API key which should not be shared publicly.