Duke-GCB / DukeDSClient

Command line program to allow uploading, downloading, and managing projects in the duke-data-service.
MIT License
5 stars 6 forks source link

When downloading files validate their checksums #242

Closed johnbradley closed 5 years ago

johnbradley commented 5 years ago

Changes downloading to verify the contents of files downloaded. This is done by calculating an MD5 sum for each file and comparing it against the value provided by DukeDS.

Also prints messages informing the user about progress download and checking files. Example successful download messaging:

Fetching list of files/folders.
Downloading 1 files.
Done: 100%                         <- this line is where the download progress bar displays                         
Verifying contents of 1 downloaded files using file hashes.
All downloaded files have been verified successfully.

Fixes #240

johnbradley commented 5 years ago

Notes on how invalid checksums play out on the DukeDS side.

non-chunked

For a non-chunked (single-part) file that has an incorrect hash fails to create a download url within DukeDS.

Error 400
Reason:reported hash value does not match size computed by StorageProvider
Suggestion:You must begin a new upload process

These files error with a 400 in the DukeDS portal.

chunked

For chunked (multi-part) file that has an incorrect hash the invalid file will download without error.

dleehr commented 5 years ago

For a non-chunked (single-part) file that has an incorrect hash fails to create a download url within DukeDS.

Just so I understand this case - you provided an incorrect hash to the DukeDS API for the file (but correct hash for the single chunk). The upload succeeded but attempt to get a download URL for that file results in the above error?

johnbradley commented 5 years ago

For both above examples I provided a bogus checksum. In all cases DukeDS will ingest this invalid checksum value and store the file. For the single chunk file you cannot download this file: error 400. For multi-chunk files they download just fine, even though they are invalid.

johnbradley commented 5 years ago

Due to optimizations in downloading (fetching urls and file details together via /projects/{project_id}/files) if there are any invalid single-chunk files the whole project will not download when using ddsclient.

johnbradley commented 5 years ago

I agree that showing the files and their checksums/status would be an improvement. Printing this information per file also has parity with the upload command that prints out the checksums of uploaded files.

johnbradley commented 5 years ago

Updated example successful download messaging:

Fetching list of files/folders.
Downloading 1 files.
Done: 100%                              
Verifying contents of 1 downloaded files using file hashes.
data/sample2.txt ba1f2511fc30423bdbb183fe33f3dd0f md5 OK
All downloaded files have been verified successfully.