Open justincc opened 5 years ago
Files move around a lot - perhaps the data generator copies the files from a server to a laptop, uses the HCA-CLI to copy to an upload area, and from there copies to ingest / data store, where the file is then downloaded by a data consumer. There is check summing available at some of these five transfers, but not all.
Though not particularly common, issues do occur during these transfers. The most common one we have seen is a truncation of a fastq file, which is very easy to spot and would indicate there was a problem at one of the file transfers.
@claymfischer Thanks for the info.
@justincc If we get a client side file size and checksum (md5 is most common for other archives) it would be easier to spot. Fastq files don't have any end of file marker so I don't think there is any other way to check.
@justincc We can determine truncation with a very basic fastq validator - the number of lines should be divisible by four, and there should be a quality score for every base called.
I am not sure where checksums are being performed at this time, but I am curious if the initial upload does any. I'd reckon @willrockout and @hewgreen might know.
@claymfischer That will only tell us if the file has been truncated mid fastq block and won't tell us if it has been truncated at a fastq block boundary
I am fairly certain client-side checksums or file sizes are not being collected at the moment so there would need to be changed to the upload tool to make this work
@lauraclarke Agreed! I think for the future we should have labs transfer checksums over with files so we can compare across our process.
@claymfischer I believe checksums are being created when the files are uploaded to the UI.
@willrockout maybe we should find out how much work it would be for those to be auto-generated on the client side by the upload tool, saves the contributor from having to calculate and provide them
@lauraclarke I can't imagine it would be that difficult as its just running a one-liner that creates a file with two columns (checksums, files) and another one-liner to check files against that file.
This sounds like a cli ticket rather than ingest per se. Can we move it to the cli repository for further consideration?
From https://docs.google.com/spreadsheets/d/1TuTNj6CrBMnMAVwtMA0rj2zA1-wTKDsTFlEijpF4gQ8/edit#gid=533186563