HumanCellAtlas / ingest-central

Ingest Central is the hub repository for the ingest service
Apache License 2.0
0 stars 1 forks source link

Wranglers should be able to confirm that data files haven't been truncated. #252

Open justincc opened 5 years ago

justincc commented 5 years ago

From https://docs.google.com/spreadsheets/d/1TuTNj6CrBMnMAVwtMA0rj2zA1-wTKDsTFlEijpF4gQ8/edit#gid=533186563

claymfischer commented 5 years ago

Files move around a lot - perhaps the data generator copies the files from a server to a laptop, uses the HCA-CLI to copy to an upload area, and from there copies to ingest / data store, where the file is then downloaded by a data consumer. There is check summing available at some of these five transfers, but not all.

Though not particularly common, issues do occur during these transfers. The most common one we have seen is a truncation of a fastq file, which is very easy to spot and would indicate there was a problem at one of the file transfers.

justincc commented 5 years ago

@claymfischer Thanks for the info.

lauraclarke commented 5 years ago

@justincc If we get a client side file size and checksum (md5 is most common for other archives) it would be easier to spot. Fastq files don't have any end of file marker so I don't think there is any other way to check.

claymfischer commented 5 years ago

@justincc We can determine truncation with a very basic fastq validator - the number of lines should be divisible by four, and there should be a quality score for every base called.

I am not sure where checksums are being performed at this time, but I am curious if the initial upload does any. I'd reckon @willrockout and @hewgreen might know.

lauraclarke commented 5 years ago

@claymfischer That will only tell us if the file has been truncated mid fastq block and won't tell us if it has been truncated at a fastq block boundary

I am fairly certain client-side checksums or file sizes are not being collected at the moment so there would need to be changed to the upload tool to make this work

willrockout commented 5 years ago

@lauraclarke Agreed! I think for the future we should have labs transfer checksums over with files so we can compare across our process.

@claymfischer I believe checksums are being created when the files are uploaded to the UI.

lauraclarke commented 5 years ago

@willrockout maybe we should find out how much work it would be for those to be auto-generated on the client side by the upload tool, saves the contributor from having to calculate and provide them

willrockout commented 5 years ago

@lauraclarke I can't imagine it would be that difficult as its just running a one-liner that creates a file with two columns (checksums, files) and another one-liner to check files against that file.

justincc commented 5 years ago

This sounds like a cli ticket rather than ingest per se. Can we move it to the cli repository for further consideration?