alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Corrupt SAM/BAM file due to NULL injected into MD tag with two-pass #2226

Open btownshend opened 1 month ago

btownshend commented 1 month ago

I encountered a situation where the output from STAR was not readable by samtools, causing 'samtools view' to give an exception due to "corrupt" aux data: [E::sam_format1_append] Corrupted aux data for read LH00192:82:22GMCNLT3:1:1128:12373:5198_TCGATGTCGGCTCT It took a while to track down and to get it to occur in SAM output to be able to see why, but what appears to be happening is that there are some null's embedded in the MD tag causing premature ending of the tag data. This results in parsers assuming that the remaining part of the MD tag is another tag, which turns out to have an invalid type code. Looking at the SAM line that gives the problem doesn't show it immediately:

2421241:LH00192:82:22GMCNLT3:1:1128:12373:5198_TCGATGTCGGCTCT 83 NT_187594.1 191674 255 104S14M19S = 191674 -14 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCTCAAAAAAAAAAACGACACCCCAGATGTCTGAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCACCGACAACAACCTTTCTGCTTCACAATGATAGGA IIIIII9III9999I99I9999999999999-------9I9--9-I------9-9-I-I9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:1 HI:i:1 AS:i:21 nM:i:0 NM:i:3 MD:Z:11000

But looking at a hex dump of that line shows that it ends with: ... 0 \t N M : i : 3 \t M D : Z : 1 1 \0 0 \0 0 \0 0 \n

This only occurs in two-pass mode and I think it has something to do with the alignment occurring at the very end of the chr NT_187594.1 -- it seems to be reading past the end of that template and grabbing some nulls to use in the MD tag.

Note that this is pretty hard to reproduce -- I only saw it on one file that had ~10M reads before hitting this -- shortening the input removed the error.