cernopendata / cernopendata-client

CERN Open Data command-line client
http://cernopendata-client.readthedocs.io/
GNU General Public License v3.0
10 stars 9 forks source link

downloader: resume interrupted downloads #99

Open tiborsimko opened 3 years ago

tiborsimko commented 3 years ago

When one download a big 20 GB file, something can go wrong along the way, or the user can sleep or restart their machines. When this happens and user re-issues download-file command, this will start download again from zero, event though local directory already has parts of the file.

The goal of this issue to resume interrupted downloads from the last good state.

How to reproduce:

  1. Run local CERN Open Data instance as follows:
$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/atlas-2020-exactly2lep.json
$ firefox http://localhost/record/15007

The test record contains a big 20 GB file.

Now start a download and interrupt after a while:

$ cernopendata-client download-files --recid 15007 --server http://localhost
==> Downloading file 1 of 1
  -> File: ./15007/exactly2lep.zip

^C
$ ls -lh 15007/exactly2lep.zip
-rw-r--r-- 1 simko simko 15M Nov 19 11:40 15007/exactly2lep.zip

and resume download from scratch again:

$ cernopendata-client download-files --recid 15007 --server http://localhost

The downloader should recognise already available 15007/exactly2lep.zip file and should continue from there.

Note that the user can interrupt download any number of times until successful completion.

The behaviour could be configurable by a new --resume option, for example:

CC @katilp

ParthS007 commented 3 years ago

@tiborsimko

if the file is partial, it would check whether the user used --resume option, and if yes, then would continue from that point, and if no, it would ask the user whether the resume or redownload is wanted.

I have a couple of musings.

  1. We will know if a file is partially downloaded when it is not matching the checksum and size from the remote file.

  2. How we plan to handle the resuming of download? We will go like requests -> pycurl -> xrootd and use inbuilt functionality from the respective library or you have some different approach in mind?

ParthS007 commented 3 years ago

@katilp comment link on root forum: https://root-forum.cern.ch/t/running-time-dependence-on-cluster-distance-from-cern-for-jobs-on-gke-cluster-open-data/42447

ParthS007 commented 3 years ago

Resuming of downloads

  1. [x] Requests - https://github.com/cernopendata/cernopendata-client/pull/109
  2. [x] Pycurl - #117
  3. [ ] Xrootd -