Closed ron-geller closed 5 years ago
One of the issues with the muts_by_read_position.py program is that debugging it basically requires that I have the pileup file which is causing problems. However, I will note that the pipeline as a whole has moved on since the Nature Protocols version of the pipeline, and while that version will still work, we have added some steps to the overall pipeline, as well as removing some steps.
We have actually discontinued use of the muts_by_read_position.py program more recent versions of the pipeline, in favor of other, easier to use scripts. The method we use now is from GATK, and is called ErrorRatePerCycle. Using GATK 3.8.1, the command is (roughly) as follows:
java -jar GenomeAnalysisTK.jar \
-T ErrorRatePerCycle \
-R ~/refs/hifi.fa \
-dfrac 1 \
-I DCS-final.bam \
-o DCS-final.ErrorRatePerCycle.txt \
--allow_potentially_misencoded_quality_scores
followed by a program "plot_error_by_cycle.py" (attached; just remove the .txt extension: plot_error_by_cycle.py.txt), which is run as follows:
python plot_error_by_cycle.py \
DCS-final.ErrorRatePerCycle.txt \
DCS-final
I'd also suggest adding overlapping read clipping from FGBio (http://fulcrumgenomics.github.io/fgbio/); this avoids double-counting when your reads overlap. The command looks something like this:
java -jar fgbio.jar ClipBam \
-i realinged_reads.bam
-o realigned_no_overlap_reads.bam \
-r ~/refs/hifi.fa \
--clip-overlapping-reads true
This step goes after the fixed clipping and local realignment steps, and is (basically) the last step before pileup creation).
One final note; if you want to switch out the consensus-making steps for the UnifiedConsensusMaker (from the main directory of the github), you'll find it to be significantly faster than the Nature Protocols version of the consensus maker. It replaces all steps from tag_to_header.py through DuplexMaker.py, although you'll still need to align the SSCS if you want to process it.
Hope that helps you, Brendan
Thanks. The ErrorRatePerCycle tool is exactly what I was looking for. I will give the clip-overlapping-reads tool a go. However, from the documentation it is a bit difficult for me to figure out what it is doing exactly. Thanks again for the help.
Hi, I am getting an error with the script. I have checked the size of the reads and they are all the same length (I filter the length to 133).
Error: File "~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 253, in
main()
File ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 205, in main
counter.reads[readNum-skips].addN()
IndexError: list index out of range
My commands are: samtools mpileup -B -A -d 500000 -f ~/refs/hifi.fa DCS-final.bam > DCS-final.bam.mpileup
python2.7 ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py -i DCS-final.bam.mpileup -l 133 -o test -C 0.3
The protocol I use to get to generate the input bam file is more or less what is in the Nat Protocols paper, with the only exceptions being:
I think this script is very useful to know if I am clipping enough bases on either end, so it would be really great to get it to work!