Kennedy-Lab-UW / Duplex-Sequencing

Other
57 stars 34 forks source link

problem with muts_by_read_position.py #54

Closed ron-geller closed 5 years ago

ron-geller commented 5 years ago

Hi, I am getting an error with the script. I have checked the size of the reads and they are all the same length (I filter the length to 133).

Error: File "~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 253, in main() File ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 205, in main counter.reads[readNum-skips].addN() IndexError: list index out of range

My commands are: samtools mpileup -B -A -d 500000 -f ~/refs/hifi.fa DCS-final.bam > DCS-final.bam.mpileup

python2.7 ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py -i DCS-final.bam.mpileup -l 133 -o test -C 0.3

The protocol I use to get to generate the input bam file is more or less what is in the Nat Protocols paper, with the only exceptions being:

  1. I align with bwa mem
  2. I filter the reads to only keep 133bp reads after running the UnifiedConsensusMaker.py using VariantBam. This is because if I don't, I get ~200 reads that are 30-70bp (mostly 49) compared to ~ 100,000 that are 133bp, and I am not sure if they are increasing the mutation rate.

I think this script is very useful to know if I am clipping enough bases on either end, so it would be really great to get it to work!

bkohrn commented 5 years ago

One of the issues with the muts_by_read_position.py program is that debugging it basically requires that I have the pileup file which is causing problems. However, I will note that the pipeline as a whole has moved on since the Nature Protocols version of the pipeline, and while that version will still work, we have added some steps to the overall pipeline, as well as removing some steps.

We have actually discontinued use of the muts_by_read_position.py program more recent versions of the pipeline, in favor of other, easier to use scripts. The method we use now is from GATK, and is called ErrorRatePerCycle. Using GATK 3.8.1, the command is (roughly) as follows:

java -jar GenomeAnalysisTK.jar \
-T ErrorRatePerCycle \
-R ~/refs/hifi.fa \
-dfrac 1 \
-I DCS-final.bam \
-o DCS-final.ErrorRatePerCycle.txt \
--allow_potentially_misencoded_quality_scores

followed by a program "plot_error_by_cycle.py" (attached; just remove the .txt extension: plot_error_by_cycle.py.txt), which is run as follows:

python plot_error_by_cycle.py \
DCS-final.ErrorRatePerCycle.txt \
DCS-final

I'd also suggest adding overlapping read clipping from FGBio (http://fulcrumgenomics.github.io/fgbio/); this avoids double-counting when your reads overlap. The command looks something like this:

java -jar fgbio.jar ClipBam \
-i realinged_reads.bam
-o realigned_no_overlap_reads.bam \
-r ~/refs/hifi.fa \
--clip-overlapping-reads true

This step goes after the fixed clipping and local realignment steps, and is (basically) the last step before pileup creation).

One final note; if you want to switch out the consensus-making steps for the UnifiedConsensusMaker (from the main directory of the github), you'll find it to be significantly faster than the Nature Protocols version of the consensus maker. It replaces all steps from tag_to_header.py through DuplexMaker.py, although you'll still need to align the SSCS if you want to process it.

Hope that helps you, Brendan

ron-geller commented 5 years ago

Thanks. The ErrorRatePerCycle tool is exactly what I was looking for. I will give the clip-overlapping-reads tool a go. However, from the documentation it is a bit difficult for me to figure out what it is doing exactly. Thanks again for the help.