ValueError: quality and sequence mismatch: 98 != 97

mkankain-hruh commented 2 years ago

Bamsurgeon downloaded 20/05/2022. Using pysam 0.19.0 py39h5030a8b_0 from bioconda. Below last log lines before crash. Inputing variants to ultradeep BAM file with mean x500

INFO 2022-05-20 02:16:51,517 secondary reads count:0 INFO 2022-05-20 02:16:51,517 supplementary reads count:0 INFO 2022-05-20 02:16:51,517 loaded 12686 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)

Traceback (most recent call last): File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 479, in run() File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 476, in run main(args) File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 422, in main rr.replace_reads(args.bamFileName, outbam_mutsfile, args.outBamFile, keepqual=True, seed=args.seed) File "/usr/local/source/bamsurgeon/bin/bamsurgeon/replace_reads.py", line 165, in replace_reads rdict[extqname].qual = read.qual File "pysam/libcalignedsegment.pyx", line 2772, in pysam.libcalignedsegment.AlignedSegment.qual.set File "pysam/libcalignedsegment.pyx", line 1514, in pysam.libcalignedsegment.AlignedSegment.query_qualities.set ValueError: quality and sequence mismatch: 98 != 97

drmrgd commented 2 years ago

I am getting a very similar error too, which I think may be related to an issue matching the quality string and sequence string for mate pairs:

INFO 2022-08-11 21:18:20,977 loading donor reads into dictionary...

INFO 2022-08-11 21:18:21,042 secondary reads count:0
INFO 2022-08-11 21:18:21,042 supplementary reads count:0
INFO 2022-08-11 21:18:21,042 loaded 28448 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)

Traceback (most recent call last):
  File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 389, in <module>
    run()
  File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 386, in run
    main(args)
  File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 334, in main
    rr.replace_reads(args.bamFileName, outbam_mutsfile, args.outBamFile, keepqual=True, seed=args.seed)
  File "/Users/jg7758/OneDrive - Exact Sciences/biofx_tools/bamsurgeon/bin/bamsurgeon/replace_reads.py", line 165, in replace_reads
    rdict[extqname].qual = read.qual
  File "pysam/libcalignedsegment.pyx", line 2772, in pysam.libcalignedsegment.AlignedSegment.qual.__set__
  File "pysam/libcalignedsegment.pyx", line 1514, in pysam.libcalignedsegment.AlignedSegment.query_qualities.__set__
ValueError: quality and sequence mismatch: 90 != 74

When I tried to look into this, I found it interesting that the lengths correspond to each read in a mate pair in a trans way. For example, in my original BAM, I have this read pair:

220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083   163 chr10   89692878    60  90M =   89692894    90  CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG  @DFEEHDDCEAAFEF?7>DEG+GFEBE9DECIGDAGB?EEGDCAF>@EFCE?DEFACFABC9F8DEDEC9)9FFFE)77G:FCD8EECEF  ZA:i:0  MB:Z:TCTTGATCGTACAC ZB:i:11 MC:Z:74M    ZC:f:0.108911   MD:Z:90 PG:Z:bwa    NM:i:0  MQ:i:60OQ:Z:FFFFFFFF:FFFFFF:,:FFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFF:,:FFFF,::F:FFF:FFFFFFFF:F:,:FF:   UQ:i:0  AS:i:90 CS:Z:GCCGTCGTTTTAT  XS:i:75 QX:Z:FFFFFFFFFFFFFF-FFFFFFFFFFFFF   RX:Z:TCTTGATCGTACAC-GCCGTCGTTTTAT
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083   83  chr10   89692894    60  74M =   89692878    -90 TGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG  CDGFGHCDFB;GFBGDD:EDAHDEGEAEDDEGH:BCFFAEEBBHCE2DDD@H:G;EFGE:HH7CGHAHF8EHB9  MB:Z:TCTTGATCGTACAC MC:Z:90M11S MD:Z:74 PG:Z:bwa    NM:i:0  MQ:i:60 OQ:Z:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFF:F:F:FFFF:FF,FFFFFF,FFF: UQ:i:0  AS:i:74 CS:Z:GCCGTCGTTTTAT  XS:i:59 QX:Z:FFFFFFFFFFFFFF-FFFFFFFFFFFFF   RX:Z:TCTTGATCGTACAC-GCCGTCGTTTTAT

The first is 90 bases long and the second is 74, and the sequence length and quality length are correct. In the temporary mutation BAM that is being compared to the original BAM, I can see that the sequence strings look correct (i.e. the 5bp deletion I requested is in the correct location), the CIGAR is updated correctly, and the string lengths look correct:

samtools view addindel.48284eaa-12e3-47fd-b813-277a15003e8d.muts.bam | grep 220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083   163 chr10   89692878    60  48M5D42M    =   89692894    95  CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGGCCCTA  @DFEEHDDCEAAFEF?7>DEG+GFEBE9DECIGDAGB?EEGDCAF>@EFCE?DEFACFABC9F8DEDEC9)9FFFE)77G:FCD8EECEF  NM:i:5  MD:Z:48^CATAT42 MC:Z:32M5D42M   MQ:i:60 AS:i:79 XS:i:64 XA:Z:chr9,-33676194,42M5D48M,8;BS:i:1
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083   83  chr10   89692894    60  32M5D42M    =   89692878    -95 TGGAAAGGGACGAACTGGTGTAATGATATGTGTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGGCCCTA  CDGFGHCDFB;GFBGDD:EDAHDEGEAEDDEGH:BCFFAEEBBHCE2DDD@H:G;EFGE:HH7CGHAHF8EHB9  NM:i:5  MD:Z:32^CATAT42 MC:Z:48M5D42M   MQ:i:60AS:i:63  XS:i:48 BS:i:1

So, I think that part is correct.

I dug into replace_reads.py, which is where the error is occurring I think, and it seems like it's choking on the very first read. With some debugging cod in there, I can see that the lengths within this record are correct:

~/Dropbox/biofx_tools/bamsurgeon/bin/bamsurgeon/replace_reads.py -b nosc.bam -r addindel.48284eaa-12e3-47fd-b813-277a15003e8d.muts.bam -o sim.bam --keepqual --progress
[E::idx_find_and_load] Could not retrieve index file for 'nosc.bam'
INFO 2022-08-12 08:57:39,936 loading donor reads into dictionary...

INFO 2022-08-12 08:57:40,017 secondary reads count:0
INFO 2022-08-12 08:57:40,017 supplementary reads count:0
INFO 2022-08-12 08:57:40,018 loaded 28448 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)

new Reads -> []

Curr Read:
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083   163 #10 89692878    60  90M #10 89692894    90  CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG  array('B', [31, 35, 37, 36, 36, 39, 35, 35, 34, 36, 32, 32, 37, 36, 37, 30, 22, 29, 35, 36, 38, 10, 38, 37, 36, 33, 36, 24, 35, 36, 34, 40, 38, 35, 32, 38, 33, 30, 36, 36, 38, 35, 34, 32, 37, 29, 31, 36, 37, 34, 36, 30, 35, 36, 37, 32, 34, 37, 32, 33, 34, 24, 37, 23, 35, 36, 35, 36, 34, 24, 8, 24, 37, 37, 37, 36, 8, 22, 22, 38, 25, 37, 34, 35, 23, 36, 36, 34, 36, 37])  [('ZA', 0), ('MB', 'TCTTGATCGTACAC'), ('ZB', 11), ('MC', '74M'), ('ZC', 0.10891088843345642), ('MD', '90'), ('PG', 'bwa'), ('NM', 0), ('MQ', 60), ('OQ', 'FFFFFFFF:FFFFFF:,:FFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFF:,:FFFF,::F:FFF:FFFFFFFF:F:,:FF:'), ('UQ', 0), ('AS', 90), ('CS', 'GCCGTCGTTTTAT'), ('XS', 75), ('QX', 'FFFFFFFFFFFFFF-FFFFFFFFFFFFF'), ('RX', 'TCTTGATCGTACAC-GCCGTCGTTTTAT')]

Seq Len: 90
qual len: 90

So, I'm wondering if the issue comes from this record being compared to the mate pair and since the lengths of the two mates are different, this is causing the issue. Does that seem possible or am I on the wrong track? Is this something that you've seen before and have a suggestion on how to fix?

drmrgd commented 2 years ago

One more thing of note: it seems to be related to newer code in this repo. When I downgraded and pulled bamsurgeon==1.3 just now to test, it seems to work. So, is there a parity check that's either causing an issue here, or detecting an issue with my input BAM?

adamewing commented 2 years ago

Hmm, will try to make a mixed-length .bam for testing and report back.

Thanks fot the details and debugging.

adamewing commented 2 years ago

This may have been related to #212, if still trying to use bamsurgeon let me know how you go with the latest updates.

apraga commented 1 year ago

Hi, I'm have the same issue with both 1.4.1 and latest on the master branch. It does not happen with version 1.3, as mentioned above.

adamewing / bamsurgeon

ValueError: quality and sequence mismatch: 98 != 97 #204