downloader: resume interrupted downloads

tiborsimko commented 3 years ago

When one download a big 20 GB file, something can go wrong along the way, or the user can sleep or restart their machines. When this happens and user re-issues download-file command, this will start download again from zero, event though local directory already has parts of the file.

The goal of this issue to resume interrupted downloads from the last good state.

How to reproduce:

Run local CERN Open Data instance as follows:

$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/atlas-2020-exactly2lep.json
$ firefox http://localhost/record/15007

The test record contains a big 20 GB file.

Now start a download and interrupt after a while:

$ cernopendata-client download-files --recid 15007 --server http://localhost
==> Downloading file 1 of 1
  -> File: ./15007/exactly2lep.zip

^C
$ ls -lh 15007/exactly2lep.zip
-rw-r--r-- 1 simko simko 15M Nov 19 11:40 15007/exactly2lep.zip

and resume download from scratch again:

$ cernopendata-client download-files --recid 15007 --server http://localhost

The downloader should recognise already available 15007/exactly2lep.zip file and should continue from there.

Note that the user can interrupt download any number of times until successful completion.

The behaviour could be configurable by a new --resume option, for example:

when the downloader sees there is no target file yet, it proceeds to download as usual;
when the downloader sees the file, it checks it size and checksum, and
- if the file is complete, it will say that there is nothing to download, that the file is here and is already verified
- if the file is partial, it would check whether the user used --resume option, and if yes, then would continue from that point, and if no, it would ask the user whether resume or redownload is wanted.
Since we have only big files, I guess the resume behaviour could be the default though, and ask the user to do rm ... on the given file if the user would like not to do resume, but rather redownload.

CC @katilp

ParthS007 commented 3 years ago

@tiborsimko

if the file is partial, it would check whether the user used --resume option, and if yes, then would continue from that point, and if no, it would ask the user whether the resume or redownload is wanted.

I have a couple of musings.

We will know if a file is partially downloaded when it is not matching the checksum and size from the remote file.
How we plan to handle the resuming of download? We will go like requests -> pycurl -> xrootd and use inbuilt functionality from the respective library or you have some different approach in mind?

ParthS007 commented 3 years ago

@katilp comment link on root forum: https://root-forum.cern.ch/t/running-time-dependence-on-cluster-distance-from-cern-for-jobs-on-gke-cluster-open-data/42447

ParthS007 commented 3 years ago

Resuming of downloads

[x] Requests - https://github.com/cernopendata/cernopendata-client/pull/109
[x] Pycurl - #117
[ ] Xrootd -

cernopendata / cernopendata-client

downloader: resume interrupted downloads #99