COMBINE-lab / RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
GNU General Public License v3.0
89 stars 23 forks source link

Samtools fails parsing SAM due to CIGAR and query sequence are of different length #9

Closed vals closed 8 years ago

vals commented 8 years ago

Here is a segment of an output sam from RapMap:

$ sed -n '104528,104540'p rapmap.pseudo.sam
@SQ SN:ERCC-00160   LN:719
@SQ SN:ERCC-00162   LN:499
@SQ SN:ERCC-00163   LN:519
@SQ SN:ERCC-00164   LN:999
@SQ SN:ERCC-00165   LN:848
@SQ SN:ERCC-00168   LN:999
@SQ SN:ERCC-00170   LN:999
@SQ SN:ERCC-00171   LN:481
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 66  ENSMUST00000068798  1369    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000161918  1251    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000160640  1480    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.4:CELL_AGCCACT:UMI_GGGGT 66  ENSMUST00000170335  931 255 35M941S *   0   0   TTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ceeggfgehihiiihihgecccccccccccacccccccc_aWW[ac  NH:i:2
SRR1043412.4:CELL_AGCCACT:UMI_GGGGT 2370    ENSMUST00000185804  2093    255 46M *   0   0   TTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ceeggfgehihiiihihgecccccccccccacccccccc_aWW[ac  NH:i:2

If I try to use htslib (e.g. samtools) to parse this, I get errors due to 35M941S not corresponding to 46 nts in the read.

$ samtools view rapmap.pseudo.sam | head
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 104539
[main_samview] truncated file.
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 66  ENSMUST00000068798  1369    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000161918  1251    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000160640  1480    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3

If I change 35M941S to 35M11S the read parses fine. Is there a sensible reason for the S to be so long in the RapMap output?

rob-p commented 8 years ago

Nope --- no good reason. It's just that clipping support is a newer feature and it looks as if there was a bug. Thanks for the fix!

rob-p commented 8 years ago

It turns out there were a few other corner cases that ticked off samtools --- these (at least the ones I was able to find) have been fixed as of e1cd39b804801ba6a6992ea0167fde28ff8fcbf5, and so samtools should happily process these files now.