cernopendata / cernopendata-client

CERN Open Data command-line client
http://cernopendata-client.readthedocs.io/
GNU General Public License v3.0
10 stars 9 forks source link

download-files: actual data files vs index files #25

Closed tiborsimko closed 4 years ago

tiborsimko commented 4 years ago

One complexity for download-files command is that some records, such as recid 1, have only index files listed. These index files contain locations to actual data files. Other records, such as recid 5500, have actual data files directly attached.

This difference exists because of large experimental AOD/AODSIM datasets which can consist of 10,000 files and it was not possible to store these is Invenio 3 JSON at reasonable performance, see https://github.com/cernopendata/opendata.cern.ch/issues/1562

This nuance exists already for get-file-locations command where it was solved in this way: the command return list of actual data file locations, unless option --no-expand is specified (which would return rich index files only). Compare:

$ cernopendata-client get-file-locations --recid 1 --protocol http  | wc -l
2916
$ cernopendata-client get-file-locations --recid 1 --protocol http --no-expand | wc -l
12

The goal of this issue is to make sure the download-files command behaves the same:

tiborsimko commented 4 years ago

Example record to support in this issue: 1.

tiborsimko commented 4 years ago

The task was addressed already as part of #22 and is working well with the latest master branch. Closing the issue.

$ cernopendata-client download-files --recid 1
==> Downloading file 1 of 2916
==> Downloading file: ./1/00E16FBB-9071-E011-83D3-003048673F12.root
^C
$ cernopendata-client download-files --recid 1 --no-expand
==> Downloading file 1 of 12
==> Downloading file: ./1/CMS_Run2010B_BTau_AOD_Apr21ReReco-v1_0000_file_index.json
==> Downloading file 2 of 120%)
==> Downloading file: ./1/CMS_Run2010B_BTau_AOD_Apr21ReReco-v1_0000_file_index.txt
==> Downloading file 3 of 12)
^C