CDCgov / datasets-sars-cov-2

Benchmark datasets for WGS analysis of SARS-CoV-2. (https://peerj.com/articles/13821/)
Apache License 2.0
54 stars 18 forks source link

CoronaHiT routine Illumina read files are truncated #13

Closed BioWilko closed 2 years ago

BioWilko commented 2 years ago

Not sure what's causing this, I can manually fix the files but I really shouldn't need to. It causes any paired-end analysis to instantly fail.

Upon further testing so is VOI/VOC, I am fairly certain the method you are using to split the reads is flawed.

I've done a little digging and I think the use of fastq-dump --gzip is to blame, it is known to be buggy and the users inbuilt gzip should probably be used instead.

It was actually fastq-dump --split-files, I have submitted a PR.

lskatz commented 2 years ago

Thank you for this contribution. I'm going to look over this in a bit.

lskatz commented 2 years ago

So far this seems to work and I will hold back from accepting the PR until I square away all the new hashsums

BioWilko commented 2 years ago

Okay good to know, cheers for the update

lskatz commented 2 years ago

Thank you for your help on this. I have used your advice in the changes in #17.