AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Improve efficiency of md5 checks #47

Closed jashapiro closed 3 years ago

jashapiro commented 3 years ago

As mentioned in #46, the current md5 workflow in check-md5.nf downloads all files for a given run to disk before checking md5 hash values. This is pretty inefficient, and if runs have a large number or large files, this requires a lot of disk space. In general, this isn't too big a deal, as the time that the disk space is used is fairly short, so the cost is not too high, but it does still take a while just to download the files.

A different approach would be to try to check the md5 values on S3 without downloading, or potentially check while streaming the download?

The former is possible sometimes; as earlier mentioned in #25:

S3 stores an etag which is often (but not always) the md5 hash, so we can use that as a first pass for integrity checking with no additional calculations and only retrieving the file header. In the case where this fails, we should download the S3 object to check against the md5 we have stored.

The simplicity of the current approach was appealing, but it might be worth looking at again before too long.

jashapiro commented 3 years ago

After some investigation of ETags, it seems that they are quite likely not to be md5 for most cases we are looking at. Even worse, they don't seem to be stable when copying around S3... so sticking with downloading and checking md5 may remain the best solution.