INCATools / ontology-development-kit

Bootstrap an OBO Library ontology
http://incatools.github.io/ontology-development-kit/
BSD 3-Clause "New" or "Revised" License
219 stars 54 forks source link

Goals that download: Check for MD5 hashes first #840

Open joeflack4 opened 1 year ago

joeflack4 commented 1 year ago

Overview

Some make goals involve downloading the latest releases of artefacts at some stable URL. However, oftentimes when the goal is run, there has not been a change to the release artefact and the goal downloads it anyway. This results in slower build times.

One solution: Add a way to set/get file hashes (e.g. MD5). Before downloading, check the local hash against the hash at the remote. If they are the same, skip the download.

Implementation

  1. Artefact publishing parties would need to store the hash somewhere easily accessible. The best way is probably at a URL with a pattern like this: [NORMAL_ARTEFACT_URL].md5 For example https://github.com/monarch-initiative/omim/releases/latest/omim.ttl.md5
  2. Create a ShellScript or Python file (or simply insert/update the code block at the beginning of the make goal) that takes the normal artefact url and the path to the local copy of the file as params, constructs the MD5 URL, gets the md5, also checks the md5 of the local file (if the file exists), compares them, and returns a boolean (say, TRUE if the hashes match).
  3. If the hashes match, skip the download.

If for some reason it's impossible for the artefact publishing party to store the hash at a URL with the pattern described above, perhaps they could store it somewhere else (ideally another file at a URL with no contents other than the hash) instead, and we could have some kind of registry for those URLs.

gouttegd commented 1 year ago

The prepare_release goal in the ODK should also take care of computing the hashes for the release products, otherwise I fear that nobody is ever going to take the time to make the .md5 files “manually“.

If for some reason it's impossible for the artefact publishing party to store the hash at a URL with the pattern described above, perhaps they could store it somewhere else (ideally another file at a URL with no contents other than the hash) instead, and we could have some kind of registry for those URLs.

Very reluctant to do that. We are already dependent on one centralised system (purl.obolibrary.org) to distribute the ontologies themselves, let’s not add another one just to distribute the hashes.

If the “publishing party” can host somewhere artefacts that are easily in the range of dozens if not hundreds of MB, surely they can host hash files. If they can’t, too bad, but let’s just download the artefacts systematically then.