divonlan / genozip

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too
Other
159 stars 12 forks source link

FASTQ read change after genozipping #25

Closed MatthewPace98 closed 2 years ago

MatthewPace98 commented 2 years ago

We have recently noticed an issue with our paired-end genozipped DNA FASTQ files, where upon genounzipping, some reads were changed. We used genozip v12.0.37 to execute the following command: genozip --reference Homo_sapiens_assembly38.ref.genozip --pair file_R1.fastq file_R2.fastq --threads 8

We also used Process from the multiprocessing python library, running 8 instances of genozip simultaneously.

The two files are identical in size, with only the line containing the nucleotide sequence sporadically being changed. Here’s an extract from one of the genounzipped files:

+
FF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF
@A00553:69:HYL2YDSXY:4:1101:6876:1000 1:N:0:TTGGACTC+CTGCTTCC
TATGCATTTCAATACTATAGGATTCACGTTAATAGAAATAACCAGATGAAATGCTTCTGGTATGTCACCTTCCCTACCCACATAAGCCAGTGTTTTTTTCTGTGAATAACAAAAACAGCAGAATTTACTTGCCTATCCGTAAGAAGTTACC

And the respective original file (notice the difference is solely in the nucleotides):

+
FF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF
@A00553:69:HYL2YDSXY:4:1101:6876:1000 1:N:0:TTGGACTC+CTGCTTCC
TCAGATCACAATGTATACAAATTTTTTTCCTGCTAGTTTTCTTTCACATTACTGCAATCTATCTCTTTTAAAAAAAGTATATAGTGCAGCTATTTCAGCCAGGCACGGTGGTTCATGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGG

We could not reproduce this result, and we could find no errors in the log files. Do you have any suggestions for a potential cause/remedy, please?

divonlan commented 2 years ago

Under investigation. So far, unable to reproduce.

divonlan commented 2 years ago

Closing for now, as neither I nor the user can reproduce this, and possibly fixed in newer version.

divonlan commented 2 years ago

Update: this issue doesn't replicate, it happened on a specific machine, that concurrently displayed unexplained errors of other tools. Possibly faulty hardware. I will add another layer of validation in Genozip, to prevent this (very rare) environmental issue from affecting Genozip.

divonlan commented 2 years ago

Update: this issue is now resolved in 13.0.21. See Release Notes.