TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Improve PubMed download #352

Open gaurav opened 1 month ago

gaurav commented 1 month ago

Our PubMed download is currently somewhat constrained:

This means that we can sometimes end up with a situation where a couple of files have failed or become corrupted during transfer, and we either need to re-download all the files or come up with some hacky solution to redownload just the broken files. However, recursively downloading all the files downloads the MD5 checksums for every file as well, which we could use to come up with a built-in mechanism for detecting and working around this case:

  1. If files exist in the PubMed download directories, verify the file by checking its MD5 checksum against the expected value.
  2. Somehow signal to the recursive download system that we don't want to re-download verified files.