dcjones / quip

Compressing next-generation sequencing data with extreme prejudice.
http://www.cs.washington.edu/homes/dcjones/quip/
BSD 3-Clause "New" or "Revised" License
78 stars 10 forks source link

'*' in quality strings #9

Closed jkbonfield closed 12 years ago

jkbonfield commented 12 years ago

Firstly, great work on making quip such a good all-rounder. I'm liking it.

However in my tests I've found a couple issues which are probably trivial to fix.

Firstly, any '*' character in a quality string is converted to space. This equates to quality -1, so I suspect some function errored and the return code wasn't checked?

Secondly and more major, a quality string starting in a '' character is completely replaced with just "". Ie it is treated as a read with no quality. It's good that quip supports this, but obviously it needs a quick strlen check to make sure the string is only 1 character long.

James

dcjones commented 12 years ago

Thanks James!

I'm having some difficulty reproducing this. I've been generating random fastq reads with '*' in the quality string, but haven't run into problems yet. Could you give me a read that it fails on?

jkbonfield commented 12 years ago

I'm not sure how to get tabs quoted perfectly here, but I can email you direct if you need.

This is a minimal subset of some larger data, with auxiliary fields removed to simplify the comparison.

$ cat _.sam @HD VN:1.0 SO:unsorted @SQ SN:chr2 LN:243199373 @SQ SN:chr11 LN:135006516 SRR027520.15|SL-XBD_812090924:7:1:0:530~1 0 chr2 58521014 42 76M * 0 0 NGGTTGCAGTGCTTTCTGGAGTAATGAAGGGATCCAGTGCACAGGGTTTAACTTACTGTGCACAATCTCTAAGCCA !_4:886.(32::9:::6/7009:777:67(79:99)37799632439778######################### SRR027520.17|SL-XBD812090924:7:1:0:1755~1 16 chr11 3887487 42 76M * 0 0 TTACCTTTCCTTCTTTAGATCTACTTCCAGTCCTCTATGAAGTCTTTCCTGACACTAATTCTAGCTGTGAGATACN ########################################9429987709;;6(;9994(49;907;0;;6;/! SRR027520.389|SL-XBD_812090924:7:1:3:73~1 4 * 0 0 * * 0 0 TAAATCCCGTCTCCATCACAACCAGTCACGCGTGGCGGCGTCCTCCCCCGCTTCCACCCCCTCTCGCGTCTGTATC 4((%89@99B;010%3<##########################################################

quip-111 is my build of quip-1.1.1 from source tarball.

$ ./quip-111 -i sam _.sam [samopen] SAM header is present: 2 sequences.

$ cp _.sam.qp _2.sam.qp; ./quip-111 -d -o sam 2.sam.qp $ diff .sam _2.sam 4,6c4,6 < SRR027520.15|SL-XBD_812090924:7:1:0:530~1 0 chr2 58521014 42 76M * 0 0 NGGTTGCAGTGCTTTCTGGAGTAATGAAGGGATCCAGTGCACAGGGTTTAACTTACTGTGCACAATCTCTAAGCCA !_4:886.(32::9:::6/7009:777:67(79:99)37799632439778######################### < SRR027520.17|SL-XBD812090924:7:1:0:1755~1 16 chr11 3887487 42 76M * 0 0 TTACCTTTCCTTCTTTAGATCTACTTCCAGTCCTCTATGAAGTCTTTCCTGACACTAATTCTAGCTGTGAGATACN ########################################9429987709;;6(;9994(49;907;0;;6;/*!

< SRR027520.389|SL-XBD_812090924:7:1:3:73~1 4 * 0 0 * * 0 0 TAAATCCCGTCTCCATCACAACCAGTCACGCGTGGCGGCGTCCTCCCCCGCTTCCACCCCCTCTCGCGTCTGTATC *4((%89@99B;010%3<

SRR027520.15|SL-XBD_812090924:7:1:0:530~1 0 chr2 58521014 42 76M * 0 0 NGGTTGCAGTGCTTTCTGGAGTAATGAAGGGATCCAGTGCACAGGGTTTAACTTACTGTGCACAATCTCTAAGCCA ! 4:886.(32::9:::6/7009:777:67(79:99)37799632439778######################### SRR027520.17|SL-XBD_812090924:7:1:0:1755~1 16 chr11 3887487 42 76M * 0 0 TTACCTTTCCTTCTTTAGATCTACTTCCAGTCCTCTATGAAGTCTTTCCTGACACTAATTCTAGCTGTGAGATACN ########################################9429987709;;6(;9994(49;907;0;;6;/ ! SRR027520.389|SL-XBD_812090924:7:1:3:73~1 4 * 0 0 * * 0 0 TAAATCCCGTCTCCATCACAACCAGTCACGCGTGGCGGCGTCCTCCCCCGCTTCCACCCCCTCTCGCGTCTGTATC *

As you can see the quality * have either changed to spaces or for the last case when it started with * it truncated to that char only.

jkbonfield commented 12 years ago

Ugh sorry for the rubbish formatting. I see it's stripped out the tabs and replaced by * too. Anyway hopefully it's sufficient.

dcjones commented 12 years ago

Fixed now. Thanks for finding this.