Improve efficiency of md5 checks

As mentioned in #46, the current md5 workflow in check-md5.nf downloads all files for a given run to disk before checking md5 hash values. This is pretty inefficient, and if runs have a large number or large files, this requires a lot of disk space. In general, this isn't too big a deal, as the time that the disk space is used is fairly short, so the cost is not too high, but it does still take a while just to download the files.

A different approach would be to try to check the md5 values on S3 without downloading, or potentially check while streaming the download?

The former is possible sometimes; as earlier mentioned in #25:

S3 stores an etag which is often (but not always) the md5 hash, so we can use that as a first pass for integrity checking with no additional calculations and only retrieving the file header. In the case where this fails, we should download the S3 object to check against the md5 we have stored.

The simplicity of the current approach was appealing, but it might be worth looking at again before too long.

AlexsLemonade / alsf-scpca

Improve efficiency of md5 checks #47