Our PubMed download is currently somewhat constrained:
Their website uses robots.txt to prevent us from downloading all of PubMed via HTTP, so we have to use FTP.
FTP doesn't have a (good? working?) method to check for file last changed dates, so our recursive download method currently starts at the beginning and re-downloads all PubMed files.
This means that we can sometimes end up with a situation where a couple of files have failed or become corrupted during transfer, and we either need to re-download all the files or come up with some hacky solution to redownload just the broken files. However, recursively downloading all the files downloads the MD5 checksums for every file as well, which we could use to come up with a built-in mechanism for detecting and working around this case:
If files exist in the PubMed download directories, verify the file by checking its MD5 checksum against the expected value.
Somehow signal to the recursive download system that we don't want to re-download verified files.
Our PubMed download is currently somewhat constrained:
robots.txt
to prevent us from downloading all of PubMed via HTTP, so we have to use FTP.This means that we can sometimes end up with a situation where a couple of files have failed or become corrupted during transfer, and we either need to re-download all the files or come up with some hacky solution to redownload just the broken files. However, recursively downloading all the files downloads the MD5 checksums for every file as well, which we could use to come up with a built-in mechanism for detecting and working around this case: