adamewing / bamsurgeon

tools for adding mutations to existing .bam files, used for testing mutation callers
MIT License
231 stars 86 forks source link

Unsuccesful replacement of all reads #174

Open LauraVP1994 opened 3 years ago

LauraVP1994 commented 3 years ago

Dear,

This is a very interesting tool and I'm busy exploring how it works. However, it seems to struggle with replacing in some reads. In the image, you can see the result in IGV. After using bamsurgeon, there still remain 155 reads a C instead of T. When looking in IGV it seems that the colored reads (green, red, blue) seem to be the reads that have not been replaced. I was thus wondering what has to be adapted so that I'm at 100% T at that position? Can something be adapted in the bamsurgeon command so that it will replace these reads or is there a way to filter these reads out?

I used this command: python addsnv.py -v {input.WT_mut} -f {input.SORT_BAM} -r {input.BWA_INDEX} -o {output.BAM_SURGEON} -p 10 --force -d 0 --ignorepileup --mindepth 1 --minmutreads 1 --maxdepth 1000000

with this in {input.WT_mut}: Reference 913 913 1 T

Thanks in advance!

IGV

The output was: INFO 2021-02-07 23:20:33,287 starting /data_rd/laura/COVID/SRA_UK/BAMsurgeon/bamsurgeon-master/bin/addsnv.py called with args: /data_rd/laura/COVID/SRA_UK/BAMsurgeon/bamsurgeon-master/bin/addsnv.py -v /data_rd/laura/COVID/SRA_UK/BAMsurgeon/WT_input/WT_full_snv_C913T.txt -f /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/ERR5059083_sort.bam -r /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta -o /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/ERR5059083_bamsurgeon_C913T.bam -p 10 --force -d 0 --ignorepileup --mindepth 1 --minmutreads 1 --maxdepth 1000000 INFO 2021-02-07 23:20:33,317 haplo_Reference_913_913 creating tmp bam: addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam INFO 2021-02-07 23:29:51,488 haplo_Reference_913_913 len(readlist): 11030 INFO 2021-02-07 23:29:51,497 haplo_Reference_913_913 selected VAF: 1.000000 INFO 2021-02-07 23:29:51,513 haplo_Reference_913_913 picked: 11030 INFO 2021-02-07 23:29:53,117 haplo_Reference_913_913 wrote: 11033, mutated: 11030 INFO 2021-02-07 23:29:53,123 haplo_Reference_913_913 mapping 1st end, cmd: bwa aln -q 5 -l 32 -k 3 -t 1 -o 1 -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.1.sai -b1 /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [bwa_aln] 17bp reads: max_diff = 2 [bwa_aln] 38bp reads: max_diff = 3 [bwa_aln] 64bp reads: max_diff = 4 [bwa_aln] 93bp reads: max_diff = 5 [bwa_aln] 124bp reads: max_diff = 6 [bwa_aln] 157bp reads: max_diff = 7 [bwa_aln] 190bp reads: max_diff = 8 [bwa_aln] 225bp reads: max_diff = 9 [bwa_read_seq] 0.0% bases are trimmed. [bwa_aln_core] calculate SA coordinate... 0.43 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 11033 sequences have been processed. [main] Version: 0.7.17-r1188 [main] CMD: bwa aln -q 5 -l 32 -k 3 -t 1 -o 1 -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.1.sai -b1 /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [main] Real time: 0.522 sec; CPU: 0.517 sec INFO 2021-02-07 23:29:53,664 haplo_Reference_913_913 mapping 2nd end, cmd: bwa aln -q 5 -l 32 -k 3 -t 1 -o 1 -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.2.sai -b2 /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [bwa_aln] 17bp reads: max_diff = 2 [bwa_aln] 38bp reads: max_diff = 3 [bwa_aln] 64bp reads: max_diff = 4 [bwa_aln] 93bp reads: max_diff = 5 [bwa_aln] 124bp reads: max_diff = 6 [bwa_aln] 157bp reads: max_diff = 7 [bwa_aln] 190bp reads: max_diff = 8 [bwa_aln] 225bp reads: max_diff = 9 [bwa_read_seq] 0.0% bases are trimmed. [bwa_aln_core] calculate SA coordinate... 0.62 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 11033 sequences have been processed. [main] Version: 0.7.17-r1188 [main] CMD: bwa aln -q 5 -l 32 -k 3 -t 1 -o 1 -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.2.sai -b2 /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [main] Real time: 0.706 sec; CPU: 0.708 sec INFO 2021-02-07 23:29:54,382 haplo_Reference_913_913 pairing ends, building .sam, cmd: bwa sampe -P -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.sam /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.1.sai addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.2.sai addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [bwa_read_seq] 0.0% bases are trimmed. [bwa_read_seq] 0.0% bases are trimmed. [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] (25, 50, 75) percentile: (386, 386, 386) [infer_isize] fail to infer insert size: weird pairing [bwa_sai2sam_pe_core] time elapses: 0.05 sec [bwa_sai2sam_pe_core] changing coordinates of 0 alignments. [bwa_sai2sam_pe_core] align unmapped mate... [bwa_sai2sam_pe_core] time elapses: 0.00 sec [bwa_sai2sam_pe_core] refine gapped alignments... 0.04 sec [bwa_sai2sam_pe_core] print alignments... 0.08 sec [bwa_sai2sam_pe_core] 11033 sequences have been processed. [main] Version: 0.7.17-r1188 [main] CMD: bwa sampe -P -f addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.sam /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.1.sai addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.2.sai addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam [main] Real time: 0.392 sec; CPU: 0.342 sec INFO 2021-02-07 23:29:54,784 haplo_Reference_913_913 sam --> bam, cmd: samtools view -bt /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/reference/ERR5059083.fasta.fai -o addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.sam INFO 2021-02-07 23:29:55,072 haplo_Reference_913_913 sorting, cmd: samtools sort -T addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.sort.bam -o addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam.sort.bam addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam INFO 2021-02-07 23:29:55,351 haplo_Reference_913_913 indexing, cmd: samtools index addsnv.tmp/haplo_Reference_913_913.tmpbam.76438050-4027-41d3-8074-b125b1ba5b51.bam INFO 2021-02-07 23:29:55,734 haplo_Reference_913_913 avgincover: 7988.000000, avgoutcover: 8005.000000 INFO 2021-02-07 23:29:55,765 done making mutations, merging mutations into /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/ERR5059083_sort.bam --> /data_rd/laura/COVID/SRA_UK/BAMsurgeon/ERR5059083/ERR5059083_bamsurgeon_C913T.bam [E::idx_find_and_load] Could not retrieve index file for 'addsnv.1b62ba77-be2f-4eae-a41d-01fe0006bed3.muts.bam' INFO 2021-02-07 23:29:55,923 secondary reads count:0 INFO 2021-02-07 23:29:55,923 supplementary reads count:0 INFO 2021-02-07 23:30:39,291 vcf output written to ERR5059083_bamsurgeon_C913T.addsnv.WT_full_snv_C913T.vcf loading donor reads into dictionary... loaded 22066 reads, (0 excluded, 0 null or secondary or supplementary--> ignored) replaced 22066 reads (0 excluded ) kept 0 secondary reads. kept 0 supplementary reads. ignored 0 non-primary reads in target BAM.

adamewing commented 3 years ago

That is an interesting pileup pattern. What kind of sequencing data are you using? Looking at the output it may be related to this being extremely high coverage, I'll need to test whether there's a limitation coming from somewhere. It could also result from a mix of paired and single-end reads... if you're able to spot anything that the unmutated reads have in common vs the mutated ones that's a good clue. The other thing I would mention is to make sure you're using as close an alignment method as possible to the original .bam file, the default is still "bwa aln" which is super outdated. Most likely you want --aligner mem.

LauraVP1994 commented 3 years ago

This is from a Novaseq instrument with amplicon targeting sequencing. The data comes from SRA (https://www.ncbi.nlm.nih.gov/sra/?term=err5059083). I don't know if the coverage is the problem, because almost all reads were replaced. We want to also test our method with different percentages eventually, including 1% of the mutation, which makes 155 unchanged reads from 11 000 a bit of a problem.

What did become apparent in IGV is that these unchanged reads are the colored reads, which are reads that were mapped differently (first forward read, then reverse; forward read and reverse read don't overlap).

Before doing bamsurgeon these steps were done:

trimmomatic.sh PE -baseout {params.basename_output} -threads {threads} {input.FQ_fwd} {input.FQ_rev} ILLUMINACLIP:$TRIMMOMATIC_ADAPTER_DIR/NexteraPE-PE.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:{params.min_len}

bwa index reference.fa

bwa mem -t=8 {input.BWA_INDEX} {input.FQ_1P} {input.FQ_2P} > {output.SAM}

samtools view -F 2048 -F 0x08 -S {input.SAM} -bo {output.BAM} (#this was done to try to filter out supplementary reads because before bamsurgeon mentioned "ignored 5267 non-primary reads in target bam")

samtools sort -o {output.SORT_BAM} --output-fmt bam --threads 8 {input.BAM}

samtools index {input.SORT_BAM}

LauraVP1994 commented 3 years ago

In addition, I tried now with --aligner mem but unfortunately this did not change anything.

adamewing commented 3 years ago

I suspect it has to do with these reads being skipped or replaced improperly:

What did become apparent in IGV is that these unchanged reads are the colored reads, which are reads that were mapped differently (first forward read, then reverse; forward read and reverse read don't overlap).

But I need to know more about them, is it possible to pull out just the offending reads into a small .bam and send it to me? Alternatively if the full .bam isn't large you could just send that to save time (I know it's available on SRA, but I'd prefer the version coming out of your pipeline to have a better shot at reproducing this).

LauraVP1994 commented 3 years ago

I made a temporary link (https://we.tl/t-YMQVHvIQdv) the sorted bam file (one week).

Thanks for the help!

LauraVP1994 commented 3 years ago

Dear,

Did you already have time to check this? I ran it also with addindel and the same thing happens. Or will the only way be to remove these reads... ?

Kind regards Laura

adamewing commented 3 years ago

If you have a way to filter out these reads e.g. via samtools flags that would likely be the most expedient way of dealing with it. I've downloaded your .bam file but haven't been able to really dig into what's going on yet.