algbio / themisto

Space-efficient pseudoalignment with a colored de Bruijn graph
GNU General Public License v2.0
50 stars 4 forks source link

Incorrect number of read alignments returned #36

Closed harry-thorpe closed 5 months ago

harry-thorpe commented 5 months ago

I have just updated to v3.2.1 and am experiencing a strange error, where the forward reads return the correct number of alignments, but the reverse reads don't. The problem only occurs when gzipping and sorting the output file, and the number of alignments returned for the reverse reads varies each time I run the themisto command. I'm happy to provide input files so you can reproduce the problem, but these are close to 100mb so not sure if GitHub is the best place to do that. I have illustrated the problem below:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output zcat ali_1.aln.gz | wc -l 253658 -- correct number of output alignments

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output zcat ali_2.aln.gz | wc -l 223647 -- incorrect number of output alignments. The combination of gzipped and sorted output produces the error with these input reads

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --gzip-output zcat ali_1.aln.gz | wc -l 253658

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --gzip-output zcat ali_2.aln.gz | wc -l 253658

harry-thorpe commented 5 months ago

It also works with sorted output, but no gzipping:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output cat ali_2.aln | wc -l 253658

jnalanko commented 5 months ago

Thanks for reporting this. I will investigate with high priority.

harry-thorpe commented 5 months ago

Thanks, I just updated my original comment - the problem appears to be the combination of sorting and gzipping - it works fine with either sorting or gzipping, but not both. And only for one input read file, which happens to be the reverse (_2) file. Perhaps the typically worse quality reverse reads are more likely to induce an edge case (more variable characters on the read quality lines)?

jnalanko commented 5 months ago

Meanwhile, could you check on your data if this error occurs if you limit the number of threads to 1?

harry-thorpe commented 5 months ago

Yep, still occurs:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 1 --sort-output-lines --gzip-output

zcat ali_2.aln.gz | wc -l 253645

I checked with 1 thread for gzipping and sorting separately too - no error

jnalanko commented 5 months ago

I could not find a reason for why this might happen. Sound like it might be memory corruption since the answer varies between runs. Could you send the input files somehow? For example, Google drive, Dropbox, or similar?

harry-thorpe commented 5 months ago

Yep - here: https://drive.google.com/drive/u/5/folders/1X7K7M6K8qN4_o6ntGvSrtlBNuK01VYAp

jnalanko commented 5 months ago

Thank you for the files! I was able to reproduce this and found the bug. It turns out that the output flush function in the zstr library is a lie, and it doesn't actually flush anything. This resulted in the output sometimes not being fully written to disk before sorting, which meant that some of the last lines of the output sometimes went missing. The number of missing lines was nondeterministic probably because with parallelism, the buffer flushing can happen at unpredictable times, leading a different number of missing lines each time.

It's now fixed in this commit: https://github.com/algbio/themisto/commit/a0e0698d3e1699c1040913d931f466ee277451cb

This fix will be included in the binaries for Themisto v3.2.2, which I hope to release today, or tomorrow the latest.

jnalanko commented 5 months ago

Version 3.2.2 is out, fixing this bug: https://github.com/algbio/themisto/releases/tag/v3.2.2.

harry-thorpe commented 5 months ago

Amazing, thanks! Will try it today :)

harry-thorpe commented 5 months ago

Thanks, works fine after the update!

jnalanko commented 3 months ago

I just noticed I had accidentally compiled the v3.2.2 release binary with max k = 255. It still works, but on the most common use case of k = 31, during index construction it uses up to 8 times more disk and is up to 8 times slower than with the usual max k = 31 setting. I recompiled with k = 31 and updated the binary. If you're running construction on that binary, and are finding it slow, I recommend using the updated binary attached to the release.