galaxyproject / cargo-port

cache packages permanently
https://depot.galaxyproject.org/software/
MIT License
12 stars 34 forks source link

Support verifying md5 hashes in addition to sha256 #169

Closed natefoo closed 3 years ago

natefoo commented 3 years ago

This change makes it possible to verify against an MD5 hash instead of a SHA256. It also will automatically use MD5 hashes from bioconda recipes if a sha256 hash is not specified in the recipe. It will also refuse to download an archive if a hash is not provided in the TSV.

Automatic mirroring via Bioconda recipes is broken because the stored hash (an empty string) does not match the actual download's hash. At first I thought this was a problem only for Bioconductor packages because the current Bioconductor package recipes only include an md5 and not a sha256 in their meta.yamls but then I discovered that the conda-to-cargo-port scripts never actually write the sha256 hash to check against. So I can't figure out how the bioconda mirroring ever worked for any package. If someone knows what is going on that would be nice since I'm not confident my changes here aren't going to break anything.

The hash verification failure is the cause of the issue reported in galaxyproject/galaxy-hub#721. Right now the daily Jenkins job that runs to update cargo-port pulls all of the new versions of Bioconductor packages but fails to store them because of the hash verification failure, then repeats again the next day:

INFO:root:URL missing, downloading https://bioconductor.org/packages/3.12/bioc/src/contrib/annotate_1.68.0.tar.gz to bioconductor-annotate/bioconductor-annotate_1.68.0_src_all.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1828k  100 1828k    0     0  15.9M      0 --:--:-- --:--:-- --:--:-- 15.9M
INFO:root:File hash d4fc1309f27b560a7438e2b8ec60f99aad09a79bad2b625115a398d2432a14e7
ERROR:root:File has bad hash! d4fc1309f27b560a7438e2b8ec60f99aad09a79bad2b625115a398d2432a14e7 !=  in bioconductor-annotate/bioconductor-annotate_1.68.0_src_all.tar.gz

This change should fix that error. Note that non-Bioconductor pacakges and anything else that does not fetch from Git (the process for which does not do hash verification) is failing similarly. So again, I have no idea how this ever worked for Bioconda packages.

natefoo commented 3 years ago

But Maybe we disable the 'delete' step of the verifier just for a few days to make sure it works? that's the only unsafe portion and would be sufficient risk mitigation.

Do you mean the call to cleanup_file in bin/process_urls.py? Or somewhere else?

Thanks for the review!

natefoo commented 3 years ago

I forgot to push a commit I made yesterday that re-adds requiring a hash and skips downloading if a hash isn't specified.

hexylena commented 3 years ago

yes, precisely. specifically this one: https://github.com/galaxyproject/cargo-port/blob/master/bin/process_urls.py#L82

(or maybe we hide it behind an environment variable CARGO_PORT_SAFE_MODE=true or so.)

natefoo commented 3 years ago

My only concern there is it may leave behind stuff I will have to find and clean up (although I am not clear on whether that fetch is to a staging area rather than the actual live depot since I didn't dig into it past this step). Also, if it's fetched but fails hash verification, the file will continue to exist and subsequent runs won't attempt to be fetched anymore due to the conditional above.

hexylena commented 3 years ago

that's fair. I'm not sure either.

mtmorgan commented 3 years ago

Thanks for looking into this; what is the time frame for deployment?

bgruening commented 3 years ago

@natefoo is that enough to get it running?

natefoo commented 3 years ago

I believe the Jenkins job that runs this will use the merged version so this should hopefully take effect on tomorrow's run.