amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Duplicate marking produces malformed QS tags in SAM output #145

Open eboyden opened 2 years ago

eboyden commented 2 years ago

With -so and writing to bam (file or stdout), every read gets a QS tag, including duplicates. But when writing to sam (file or stdout), it looks like reads marked as duplicates are missing the QS tag; instead, an empty field is present, which can cause downstream tools to crash when additional tags are appended (thus the empty field appears to be a "tag" with an improper format).

Command:

/snap/v2.0.0/snap-aligner paired /snap/index/ -pairedInterleavedFastq - -t 8 -mrl 30 -s 0 500 -fs -so -R "@RG\tID:id\tLB:lb\tPL:ILLUMINA\tPU:pu\tSM:sm" -o -sam NOINPUT.sam

Output:

cat -et NOINPUT.sam | egrep -v "^@" | head -11
SN0796:846:HK7WNBCX3:1:1201:20613:31700^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHIIIIHG@HHHHEC@HDHEGHHFCHCE=HHEGH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1207:5416:83002^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIGIIIIIIIIIIIIIIHIIIIIHIHHIIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1213:4603:47822^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHHG@G<1<<@HEFCEHCC<1CD1HGE=HD111H^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:2125:28883^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGCA^IHIIIIIIIIIHIIHIIIHIHHEH?GHHIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:5543:45261^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IHIHIIHHHIHIIIHIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2215:16958:24783^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IHIHD1FIIIHF1IHGG<<1CC<<<0C/<G?HEH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1110:10832:12681^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIIIIIIIIIHIIIIIIIIIHHHHIIIIIIHIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:14120:23567^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IGDHGIHHIHHG1HHGF@HHIIHHEHC<<IIIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:5463:88475^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IFIHGCFEEHCHHIIIHHIHHHEGC0C=FHHIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1201:13401:6342^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IFIIIHH@?HHIIHIHHHEGHGHGHIHCCHC@EH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:2203:7573:55404^I147^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^IQS:i:1240$
bolosky commented 2 years ago

I put a partial fix for this in the dev branch. It at least trims the trailing tab from the line so that it's syntactically correct. However, it still removes the QS tag.

Unfortunately, fixing it properly will be more work. The SAM code writes SAM lines into string form into a buffer, and then can't extend the string size when it comes to mark duplicates later. The code clips out the QS tag to make extra space for the (possibly) expanded flags field. We'll need to figure out how not to do this ugliness, or else make it more ugly but functional.

Check it out and see if it helps before I put it in master. And let's keep this issue open until it's not clipping out tags.

eboyden commented 2 years ago

This seems to have done the trick, thanks. The missing QS tags aren't great but not the end of the world either; the empty fields are gone and it no longer breaks downstream tools when they append additional tags, and that's what I really care about.

bolosky commented 2 years ago

The fix for not being malformed is in 2.0.1, but it will still strip the QS tags, so I'm leaving this open.