AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Automate checking of md5 hashes #25

Closed jashapiro closed 4 years ago

jashapiro commented 4 years ago

We ask submitters to provide md5 checksums for all files, but we do not yet have a system for systematically checking them after upload, or at other times.

S3 stores an etag which is often(but not always) the md5 hash, so we can use that as a first pass for integrity checking with no additional calculations and only retrieving the file header . In the case where this fails, we should download the S3 object to check against the md5 we have stored.

It may also be useful to add a separate metadata tag to each object with the md5 hash, but I have not explored this as of yet.

jashapiro commented 4 years ago

A script to perform this check via a nextflow workflow (ignoring etags) was added in #30.

It is not "automatic", and requires downloading every file, but it gets much of the work done. Since this should only really need to be done to verify the files are as expected upon receipt and perhaps occasionally aftwerward (and I think that the aws tools perform hash checks with download/sync).

I am going to close this for now; if we decide we need to update the current workflow, it can be reopened later.