divonlan / genozip

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too
Other
159 stars 12 forks source link

Possible bug in BAM compression? #6

Closed lweasel closed 3 years ago

lweasel commented 3 years ago

Hi,

Really nice tool! The speed and compression improvements over, e.g. gzip, are very impressive.

I think there may be a potential bug in the compression of BAM files. Although the BAM file that I was originally trying has millions of records, I narrowed it down to the following. If I run genozip (v11.0.2) on a SAM file containing the following line, it works fine (genozip --threads 1 -f test.sam):

NS500125:680:HNHVYBGXG:2:11209:16805:14650 256 4 145637796 1 9M1494270N67M * 0 GAGTACGGGGAAGTCATGGAGGGAGACTAGTGCCTAGTATTTGCGGTGCCTGAAAACTTTCTTAAGAAGCAGTTGT A/AAAEEEEEEEEEEEEEAE/EAEEEEEE6AEAEEEEEEEEAEEE<EAAEEEEEEEEEEEEE/EEEAEEEEAAEAE NH:i:4 HI:i:4 AS:i:69 nM:i:1 XS:A:+

However, if I convert that SAM file to a BAM file (I'm using sambamba: sambamba view -S -f bam test.sam -o test.bam), and run genozip --threads 1 -f test.bam, I get the following output:

genozip test.bam : 0% op_len=1 too long in vb=1494270: [1] 28905 abort (core dumped) genozip --threads 1 -f test.bam

I think that it is complaining about the length of the number in the middle of the CIGAR string (i.e. 1494270). If I remove one digit from that number, and reconvert the SAM file to BAM, then genozip works without error.

divonlan commented 3 years ago

Thanks Owen for reporting this! I have fixed the bug..

lweasel commented 3 years ago

That's great, thank you!