quip compresson of SAM file

sandoro commented 11 years ago

I am evaluating quip for possible installation on a High Performance Computing cluster available to investigators at the National institutes of Health.

Using quip to compress a SAM file obtained from tophat output from mRNA data input, I see that in the quip decompression of the quip compressed SAM file, the TLEN field is always 0 (zero). Is this as expected? Assuming the answer is 'yes' I have these questions

How can the compression be validated after this change of the TLEN field?

Using the mRNA data avilable to me, I have verified that, in the case of two-segment alignments, the TLEN field is redundant in that it be recomputed from knowlege of the mated alignment segments.

Is this true in general? If so why doesn't quip put the TLEN value back during decompression?
Why do we expect the decompressed SAM file to be usable after the change of the TLEN field?
Why do we expect potential quip users to have confidence in quip?

dcjones commented 11 years ago

The TLEN issue was not intentional. There was a bug in how I was reading SAM/BAM files that I didn't catch.

I just released a new version to correct this: http://homes.cs.washington.edu/~dcjones/quip/quip-1.1.5.tar.gz

There are slight differences in the compressed output that make byte-for-byte comparison impossible. In particular, the order of SAM/BAM auxiliary fields is not preserved, and in FASTQ, if the read identifier is repeated prior to the quality string, it's discarded.

For that reason I wrote two programs fastqmd5 and bammd5 which compute md5 checksums while ignoring these "cosmetic" differences. They were always included but in 1.1.5 they are built and installed by default. This allows comparison like:

fastqmd5 < reads.fastq
quip -c -d reads.fastq.qp | fastqmd5

I use these to test correctness, using randomly generated data and actual datasets.

There are also redundant checksums stored in the compressed data, which can be verified with quip --test. Since this bug was in how I was reading SAM/BAM and and not in the compression algorithm itself, it unfortunately slipped through the cracks.

Verifying the correctness of quip is much harder than for general purpose compression algorithms as FASTQ is so ill-defined, and SAM/BAM is a very complex format that is not always correctly adhered to, but I take it very seriously and always fix reported bugs within a day or two if I'm able to confirm them.

sandoro commented 11 years ago

Thanks very much for your rapid response to my issue. I downloaded, built and tried quip-1.1.5 on a SAM file. It looks very good; the SAM file and the decompression of its quip-compression have the same byte count. I did encounter one problem you will want to correct:Making all in tests make[1]: Entering directory /usr/local/apps/quip/1.1.5/quip-1.1.5/tests' CC fastqmd5.o CC md5.o CCLD fastqmd5 fastqmd5.o: In functionmain': /usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:56: undefined reference to quip_in_open_file' /usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:62: undefined reference toquip_read' /usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:73: undefined reference to quip_in_close' collect2: ld returned 1 exit status make[1]: *** [fastqmd5] Error 1 make[1]: Leaving directory/usr/local/apps/quip/1.1.5/quip-1.1.5/tests' make: *\ [all-recursive] Error 1 The bammd5 build also had these undefined references and a few more. They are all in the archive `src/libquip.a' and this is missing from the tests/Makefile. My fix for this was$ cd tests $ ln -s ../src/libquip.a . And then in the tests/Makefile I replaced the line LIBS = -lpthread -lbz2 -lz by the line LIBS = -lpthread -lbz2 -lz -lquip Thanks again. --sandy Daniel Jones wrote:

The TLEN issue was not intentional. There was a bug in how I was reading SAM/BAM files that I didn't catch.

I just released a new version to correct this: http://homes.cs.washington.edu/~dcjones/quip/quip-1.1.5.tar.gz

There are slight differences in the compressed output that make byte-for-byte comparison impossible. In particular, the order of SAM/BAM auxiliary fields is not preserved, and in FASTQ, if the read identifier is repeated prior to the quality string, it's discarded.

For that reason I wrote two programs fastqmd5 and bammd5 which compute md5 checksums while ignoring these "cosmetic" differences. They were always included but in 1.1.5 they are built and installed by default. This allows comparison like:

fastqmd5 < reads.fastq
quip -c -d reads.fastq.qp | fastqmd5

I use these to test correctness, using randomly generated data and actual datasets.

There are also redundant checksums stored in the compressed data, which can be verified with quip --test. Since this bug was in how I was reading SAM/BAM and and not in the compression algorithm itself, it unfortunately slipped through the cracks.

Verifying the correctness of quip is much harder than for general purpose compression algorithms as FASTQ is so ill-defined, and SAM/BAM is a very complex format that is not always correctly adhered to, but I take it very seriously and always fix reported bugs within a day or two if I'm able to confirm them.

Reply to this email directly or view it on GitHub: https://github.com/dcjones/quip/issues/16#issuecomment-14412623

-- Sanford M. Orlow, Computer Engineer Center for Information Technology, National Institutes of Health 12 South Dr. 12/2208 , Bethesda, MD 20892-5680 PHONE 301.496.5362 FAX 301.496.0223 EMAIL sandor@mail.nih.gov

dcjones / quip

quip compresson of SAM file #16