Closed jashapiro closed 3 years ago
After some investigation of ETags, it seems that they are quite likely not to be md5 for most cases we are looking at. Even worse, they don't seem to be stable when copying around S3... so sticking with downloading and checking md5 may remain the best solution.
As mentioned in #46, the current md5 workflow in
check-md5.nf
downloads all files for a given run to disk before checking md5 hash values. This is pretty inefficient, and if runs have a large number or large files, this requires a lot of disk space. In general, this isn't too big a deal, as the time that the disk space is used is fairly short, so the cost is not too high, but it does still take a while just to download the files.A different approach would be to try to check the md5 values on S3 without downloading, or potentially check while streaming the download?
The former is possible sometimes; as earlier mentioned in #25:
The simplicity of the current approach was appealing, but it might be worth looking at again before too long.