Open mkankain-hruh opened 2 years ago
I am getting a very similar error too, which I think may be related to an issue matching the quality string and sequence string for mate pairs:
INFO 2022-08-11 21:18:20,977 loading donor reads into dictionary...
INFO 2022-08-11 21:18:21,042 secondary reads count:0
INFO 2022-08-11 21:18:21,042 supplementary reads count:0
INFO 2022-08-11 21:18:21,042 loaded 28448 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)
Traceback (most recent call last):
File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 389, in <module>
run()
File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 386, in run
main(args)
File "/Users/jg7758/Dropbox/biofx_tools/bamsurgeon/bin/addindel.py", line 334, in main
rr.replace_reads(args.bamFileName, outbam_mutsfile, args.outBamFile, keepqual=True, seed=args.seed)
File "/Users/jg7758/OneDrive - Exact Sciences/biofx_tools/bamsurgeon/bin/bamsurgeon/replace_reads.py", line 165, in replace_reads
rdict[extqname].qual = read.qual
File "pysam/libcalignedsegment.pyx", line 2772, in pysam.libcalignedsegment.AlignedSegment.qual.__set__
File "pysam/libcalignedsegment.pyx", line 1514, in pysam.libcalignedsegment.AlignedSegment.query_qualities.__set__
ValueError: quality and sequence mismatch: 90 != 74
When I tried to look into this, I found it interesting that the lengths correspond to each read in a mate pair in a trans way. For example, in my original BAM, I have this read pair:
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083 163 chr10 89692878 60 90M = 89692894 90 CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG @DFEEHDDCEAAFEF?7>DEG+GFEBE9DECIGDAGB?EEGDCAF>@EFCE?DEFACFABC9F8DEDEC9)9FFFE)77G:FCD8EECEF ZA:i:0 MB:Z:TCTTGATCGTACAC ZB:i:11 MC:Z:74M ZC:f:0.108911 MD:Z:90 PG:Z:bwa NM:i:0 MQ:i:60OQ:Z:FFFFFFFF:FFFFFF:,:FFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFF:,:FFFF,::F:FFF:FFFFFFFF:F:,:FF: UQ:i:0 AS:i:90 CS:Z:GCCGTCGTTTTAT XS:i:75 QX:Z:FFFFFFFFFFFFFF-FFFFFFFFFFFFF RX:Z:TCTTGATCGTACAC-GCCGTCGTTTTAT
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083 83 chr10 89692894 60 74M = 89692878 -90 TGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG CDGFGHCDFB;GFBGDD:EDAHDEGEAEDDEGH:BCFFAEEBBHCE2DDD@H:G;EFGE:HH7CGHAHF8EHB9 MB:Z:TCTTGATCGTACAC MC:Z:90M11S MD:Z:74 PG:Z:bwa NM:i:0 MQ:i:60 OQ:Z:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFF:F:F:FFFF:FF,FFFFFF,FFF: UQ:i:0 AS:i:74 CS:Z:GCCGTCGTTTTAT XS:i:59 QX:Z:FFFFFFFFFFFFFF-FFFFFFFFFFFFF RX:Z:TCTTGATCGTACAC-GCCGTCGTTTTAT
The first is 90 bases long and the second is 74, and the sequence length and quality length are correct. In the temporary mutation BAM that is being compared to the original BAM, I can see that the sequence strings look correct (i.e. the 5bp deletion I requested is in the correct location), the CIGAR is updated correctly, and the string lengths look correct:
samtools view addindel.48284eaa-12e3-47fd-b813-277a15003e8d.muts.bam | grep 220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083 163 chr10 89692878 60 48M5D42M = 89692894 95 CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGGCCCTA @DFEEHDDCEAAFEF?7>DEG+GFEBE9DECIGDAGB?EEGDCAF>@EFCE?DEFACFABC9F8DEDEC9)9FFFE)77G:FCD8EECEF NM:i:5 MD:Z:48^CATAT42 MC:Z:32M5D42M MQ:i:60 AS:i:79 XS:i:64 XA:Z:chr9,-33676194,42M5D48M,8;BS:i:1
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083 83 chr10 89692894 60 32M5D42M = 89692878 -95 TGGAAAGGGACGAACTGGTGTAATGATATGTGTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGGCCCTA CDGFGHCDFB;GFBGDD:EDAHDEGEAEDDEGH:BCFFAEEBBHCE2DDD@H:G;EFGE:HH7CGHAHF8EHB9 NM:i:5 MD:Z:32^CATAT42 MC:Z:48M5D42M MQ:i:60AS:i:63 XS:i:48 BS:i:1
So, I think that part is correct.
I dug into replace_reads.py
, which is where the error is occurring I think, and it seems like it's choking on the very first read. With some debugging cod in there, I can see that the lengths within this record are correct:
~/Dropbox/biofx_tools/bamsurgeon/bin/bamsurgeon/replace_reads.py -b nosc.bam -r addindel.48284eaa-12e3-47fd-b813-277a15003e8d.muts.bam -o sim.bam --keepqual --progress
[E::idx_find_and_load] Could not retrieve index file for 'nosc.bam'
INFO 2022-08-12 08:57:39,936 loading donor reads into dictionary...
INFO 2022-08-12 08:57:40,017 secondary reads count:0
INFO 2022-08-12 08:57:40,017 supplementary reads count:0
INFO 2022-08-12 08:57:40,018 loaded 28448 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)
new Reads -> []
Curr Read:
220211_A00995_0203_BHW2YGDRXY-Crick:2:2101:4182:26083 163 #10 89692878 60 90M #10 89692894 90 CAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTACATCGGGGCAAATTTTTAAAGGCACAAGAGG array('B', [31, 35, 37, 36, 36, 39, 35, 35, 34, 36, 32, 32, 37, 36, 37, 30, 22, 29, 35, 36, 38, 10, 38, 37, 36, 33, 36, 24, 35, 36, 34, 40, 38, 35, 32, 38, 33, 30, 36, 36, 38, 35, 34, 32, 37, 29, 31, 36, 37, 34, 36, 30, 35, 36, 37, 32, 34, 37, 32, 33, 34, 24, 37, 23, 35, 36, 35, 36, 34, 24, 8, 24, 37, 37, 37, 36, 8, 22, 22, 38, 25, 37, 34, 35, 23, 36, 36, 34, 36, 37]) [('ZA', 0), ('MB', 'TCTTGATCGTACAC'), ('ZB', 11), ('MC', '74M'), ('ZC', 0.10891088843345642), ('MD', '90'), ('PG', 'bwa'), ('NM', 0), ('MQ', 60), ('OQ', 'FFFFFFFF:FFFFFF:,:FFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFF:,:FFFF,::F:FFF:FFFFFFFF:F:,:FF:'), ('UQ', 0), ('AS', 90), ('CS', 'GCCGTCGTTTTAT'), ('XS', 75), ('QX', 'FFFFFFFFFFFFFF-FFFFFFFFFFFFF'), ('RX', 'TCTTGATCGTACAC-GCCGTCGTTTTAT')]
Seq Len: 90
qual len: 90
So, I'm wondering if the issue comes from this record being compared to the mate pair and since the lengths of the two mates are different, this is causing the issue. Does that seem possible or am I on the wrong track? Is this something that you've seen before and have a suggestion on how to fix?
One more thing of note: it seems to be related to newer code in this repo. When I downgraded and pulled bamsurgeon==1.3 just now to test, it seems to work. So, is there a parity check that's either causing an issue here, or detecting an issue with my input BAM?
Hmm, will try to make a mixed-length .bam for testing and report back.
Thanks fot the details and debugging.
This may have been related to #212, if still trying to use bamsurgeon let me know how you go with the latest updates.
Hi, I'm have the same issue with both 1.4.1 and latest on the master branch. It does not happen with version 1.3, as mentioned above.
Bamsurgeon downloaded 20/05/2022. Using pysam 0.19.0 py39h5030a8b_0 from bioconda. Below last log lines before crash. Inputing variants to ultradeep BAM file with mean x500
INFO 2022-05-20 02:16:51,517 secondary reads count:0 INFO 2022-05-20 02:16:51,517 supplementary reads count:0 INFO 2022-05-20 02:16:51,517 loaded 12686 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)
Traceback (most recent call last): File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 479, in
run()
File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 476, in run
main(args)
File "/usr/local/source/bamsurgeon/bin/addsnv.py", line 422, in main
rr.replace_reads(args.bamFileName, outbam_mutsfile, args.outBamFile, keepqual=True, seed=args.seed)
File "/usr/local/source/bamsurgeon/bin/bamsurgeon/replace_reads.py", line 165, in replace_reads
rdict[extqname].qual = read.qual
File "pysam/libcalignedsegment.pyx", line 2772, in pysam.libcalignedsegment.AlignedSegment.qual.set
File "pysam/libcalignedsegment.pyx", line 1514, in pysam.libcalignedsegment.AlignedSegment.query_qualities.set
ValueError: quality and sequence mismatch: 98 != 97