GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Support multiple clipping operations at the beginning of a read #11

Closed ernfrid closed 9 years ago

ernfrid commented 9 years ago

It's possible to have multiple hard clip and soft clip operations at the beginning of a read. Currently, I believe samblaster only parses the first of these as a "sclip" event and any subsequent as an end clip.

Here's a SAM example of the behavior. I believe 3/4 reads should be marked as duplicates but only 2 of 4 are.

@HD VN:1.3  SO:queryname
@SQ SN:1    LN:249250621
r1  99  1   15729   27  151M    =   16153   575 CAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r1  147 1   16153   27  151M    =   15729   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r2  1123    1   15734   27  5S146M  =   16153   575 CAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r2  1171    1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r3  1123    1   15734   27  5H146M  =   16153   575 CTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT  KAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A  NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r3  1171    1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r4  99  1   15734   27  2H3S146M    =   16153   575 GGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT   AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r4  147 1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
GregoryFaust commented 9 years ago

Yes, it is currently the case that only one "clip" opcode is expected at the beginning or end of a CIGAR string. If there are more than one at the beginning, all but the first will not be included in the start clip, and the last clip in the entire CIGAR string will be treated as the end clip.

Thanks for pointing this out. This is clearly a bug, and I will fix it as soon as I can. However, I am just curious as to what aligner or other tool produced this SAM file.

Thanks, Greg

ernfrid commented 9 years ago

I faked this data to illustrate the bug. I expect it is not a common occurrence.

GregoryFaust commented 9 years ago

Fixed in release 0.1.22.