Closed sandoro closed 11 years ago
The TLEN issue was not intentional. There was a bug in how I was reading SAM/BAM files that I didn't catch.
I just released a new version to correct this: http://homes.cs.washington.edu/~dcjones/quip/quip-1.1.5.tar.gz
There are slight differences in the compressed output that make byte-for-byte comparison impossible. In particular, the order of SAM/BAM auxiliary fields is not preserved, and in FASTQ, if the read identifier is repeated prior to the quality string, it's discarded.
For that reason I wrote two programs fastqmd5
and bammd5
which compute md5 checksums while ignoring these "cosmetic" differences. They were always included but in 1.1.5 they are built and installed by default. This allows comparison like:
fastqmd5 < reads.fastq
quip -c -d reads.fastq.qp | fastqmd5
I use these to test correctness, using randomly generated data and actual datasets.
There are also redundant checksums stored in the compressed data, which can be verified with quip --test
. Since this bug was in how I was reading SAM/BAM and and not in the compression algorithm itself, it unfortunately slipped through the cracks.
Verifying the correctness of quip is much harder than for general purpose compression algorithms as FASTQ is so ill-defined, and SAM/BAM is a very complex format that is not always correctly adhered to, but I take it very seriously and always fix reported bugs within a day or two if I'm able to confirm them.
Thanks very much for your rapid response to my issue. I downloaded,
built and tried quip-1.1.5 on a SAM file. It looks very good; the SAM
file and the decompression of its quip-compression have the same byte
count.
I did encounter one problem you will want to correct:Making all in tests
make[1]: Entering directory
/usr/local/apps/quip/1.1.5/quip-1.1.5/tests' CC fastqmd5.o CC md5.o CCLD fastqmd5 fastqmd5.o: In function
main':
/usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:56: undefined
reference to quip_in_open_file' /usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:62: undefined reference to
quip_read'
/usr/local/apps/quip/1.1.5/quip-1.1.5/tests/fastqmd5.c:73: undefined
reference to quip_in_close' collect2: ld returned 1 exit status make[1]: *** [fastqmd5] Error 1 make[1]: Leaving directory
/usr/local/apps/quip/1.1.5/quip-1.1.5/tests'
make: *\ [all-recursive] Error 1
The bammd5 build also had these undefined references and a few more.
They are all in the archive `src/libquip.a'
and this is missing from the tests/Makefile. My fix for this was$ cd tests
$ ln -s ../src/libquip.a .
And then in the tests/Makefile I replaced the line
LIBS = -lpthread -lbz2 -lz
by the line
LIBS = -lpthread -lbz2 -lz -lquip
Thanks again.
--sandy
Daniel Jones wrote:
The TLEN issue was not intentional. There was a bug in how I was reading SAM/BAM files that I didn't catch.
I just released a new version to correct this: http://homes.cs.washington.edu/~dcjones/quip/quip-1.1.5.tar.gz
There are slight differences in the compressed output that make byte-for-byte comparison impossible. In particular, the order of SAM/BAM auxiliary fields is not preserved, and in FASTQ, if the read identifier is repeated prior to the quality string, it's discarded.
For that reason I wrote two programs fastqmd5
and bammd5
which compute md5 checksums while ignoring these "cosmetic" differences. They were always included but in 1.1.5 they are built and installed by default. This allows comparison like:
fastqmd5 < reads.fastq
quip -c -d reads.fastq.qp | fastqmd5
I use these to test correctness, using randomly generated data and actual datasets.
There are also redundant checksums stored in the compressed data, which can be verified with quip --test
. Since this bug was in how I was reading SAM/BAM and and not in the compression algorithm itself, it unfortunately slipped through the cracks.
Verifying the correctness of quip is much harder than for general purpose compression algorithms as FASTQ is so ill-defined, and SAM/BAM is a very complex format that is not always correctly adhered to, but I take it very seriously and always fix reported bugs within a day or two if I'm able to confirm them.
Reply to this email directly or view it on GitHub: https://github.com/dcjones/quip/issues/16#issuecomment-14412623
-- Sanford M. Orlow, Computer Engineer Center for Information Technology, National Institutes of Health 12 South Dr. 12/2208 , Bethesda, MD 20892-5680 PHONE 301.496.5362 FAX 301.496.0223 EMAIL sandor@mail.nih.gov
I am evaluating quip for possible installation on a High Performance Computing cluster available to investigators at the National institutes of Health.
Using quip to compress a SAM file obtained from tophat output from mRNA data input, I see that in the quip decompression of the quip compressed SAM file, the TLEN field is always 0 (zero). Is this as expected? Assuming the answer is 'yes' I have these questions
Using the mRNA data avilable to me, I have verified that, in the case of two-segment alignments, the TLEN field is redundant in that it be recomputed from knowlege of the mated alignment segments.