Download script - Githubissues

GEUS-Glaciology-and-Climate / Sentinel-1_Greenland_Ice_Velocity

Ice velocity maps of the Greenland Ice Sheet margin derived from ESA Sentinel-1 synthetic aperture radar data.

https://dataverse01.geus.dk/dataverse/Ice_velocity

6 stars 2 forks source link

Download script #5

Closed mankoff closed 3 years ago

mankoff commented 3 years ago

Provide simple download script or command.

Not so simple currently due to Dataverse limitations. This will be simple as soon as we upgrade to Dataverse 5.0.

Once we're running 5.0, here is the history behind the new behavior https://github.com/IQSS/dataverse/issues/7084 and bulk download should be done with something like wget --recursive -nH --cut-dirs=3 --content-disposition http://dataverse_URL/api/datasets/NNNN/fileaccess (exact URL and API string still TBD).

Until then,https://guides.dataverse.org/en/latest/api/native-api.html#accessing-downloading-files and a non-trivial shell script, or Python with PyDataVerse as a dependency: https://pydataverse.readthedocs.io/en/latest/ - Note, this uses an API key which should not be shared publicly.

AdrienWehrle commented 3 years ago

I proposed a simple Python script to download the data using multiprocessing, in https://github.com/GEUS-PROMICE/Sentinel-1_Greenland_Ice_Velocity/pull/6 .

mankoff commented 3 years ago

The following bash script requires jq but then downloads everything to the current folder:

for PID in $(curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | jq -r '.data.latestVersion.files[] | .dataFile.persistentId'); do
  wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done

Or, creating a temporary file on disk with the list of URLs:

curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | jq -r '.data.latestVersion.files[] | .dataFile.persistentId' > urls.txt

for PID in $(cat urls.txt); do
  wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done

mankoff commented 3 years ago

@robertfausto

mankoff commented 3 years ago

Or, not relying on jq, the following works with default bash tools:

for PID in $(curl https://dataverse01.geus.dk/api/datasets/:persistentId?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet | tr ',' '\n' | grep persistentId | cut -d'"' -f4); do
  wget --content-disposition --continue "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=${PID}"
done

mankoff commented 3 years ago

Warning - the above script works for the IV dataset because there are no sub-folders. This simple bash script does not work for datasets with files in sub-folders (possibly files with the same names in sub-folders). The script Adrien is writing should support this. It can be tested on this dataset: https://doi.org/10.22008/FK2/XKQVL7

mankoff commented 3 years ago

This correctly creates sub-folders and places files in them

export SERVER=https://dataverse01.geus.dk
export DOI=10.22008/....  # Fill in dataset DOI

curl ${SERVER}/api/datasets/:persistentId?persistentId=doi:${DOI} > dv.json
cat dv.json | tr ',' '\n' | grep -E "persistentId|directoryLabel|filename" | cut -d'"' -f4 > urls.txt
while read -r dir; do
    mkdir -p ${dir}
    read -r FILE_DOI
    read -r fname
    wget --continue "${SERVER}/api/access/datafile/:persistentId?persistentId=${FILE_DOI}" -O ${dir}/${fname}
done < urls.txt
rm dv.json urls.txt # cleanup

mankoff commented 3 years ago

Well... Ugh. The above script expects the directoryLabel field to exist. If it does not, things don't work well. So for the IV dataset, the simpler scripts should be used. The longer script immediately above this may be useful elsewhere...

mankoff commented 3 years ago

Ok I've provided a download script in the dataverse. In the Notes section, I have

export SERVER=https://dataverse01.geus.dk
export DOI=10.22008/promice/data/sentinel1icevelocity/greenlandicesheet

curl ${SERVER}/api/datasets/:persistentId?persistentId=doi:${DOI} > dv.json
cat dv.json | tr ',' '\n' | grep -E '"persistentId"' | cut -d'"' -f4 > urls.txt
while read -r PID; do
    curl -O -J $SERVER/api/access/datafile/:persistentId?persistentId=${PID}
done < urls.txt
rm dv.json urls.txt # cleanup

And the DV is updated to 3.1 with this change (changing metadata; see https://dataverse01.geus.dk/dataset.xhtml?persistentId=doi:10.22008/promice/data/sentinel1icevelocity/greenlandicesheet&version=3.1 ). I suggest closing #6.

mankoff commented 2 years ago

The upgraded dataverse provides a dirIndex view to the data files. This makes it easy to download everything with wget or a web-browser downloader such as DownThemAll, and reduces the need for more complicated bash or Python scripts to fetch data.

Please see http://doi.org/10.22008/promice/data/ice_discharge/d/v02 for the text I'm currently putting in the Notes section on all of the datasets. I include the HTML version of the text below. Chose "Edit" from "..." menu on this comment to see raw HTML.

Direct link to most recent files: https://dataverse.geus.dk/api/datasets/:persistentId/dirindex?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02

wget download command:

wget -r -e robots=off -nH --cut-dirs=3 --content-disposition "https://dataverse.geus.dk/api/datasets/:persistentId/dirindex?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02"

robertfausto commented 2 years ago

wget download command included