PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

Is it necessary to sort the alignment bam file to run the ipdSummary correctly? #63

Closed uceleste closed 5 years ago

uceleste commented 5 years ago

Hi All,

I tried ipdSummary on two parallel pipelines using the same reads and reference. The only difference is that in one case the alignment file in bam format was sorted and in the other case no. I obtained two different output with some shared information.

What is the correct way to run the process? Is it always better to do the sorting of the bam alignment?

Here my command lines and a semplified example of both output:

Pipeline 1

$ blasr PacBio.bam Assembly.fasta --nproc 30 --bam --out Aligned.bam $ samtools sort Aligned.bam -o Aligned_sorted.bam --threads 30 $ pbindex Aligned_sorted.bam $ samtools faidx Assembly.fasta $ ipdSummary Aligned_sorted.bam --reference Assembly.fasta --identify m6A,m4C --gff basemods.gff

Pipeline 2

$ blasr PacBio.bam Assembly.fasta --nproc 30 --bam --out Aligned.bam $ pbindex Aligned.bam $ samtools faidx Assembly.fasta $ ipdSummary Aligned.bam --reference Assembly.fasta --identify m6A,m4C --gff basemods.gff

Pipeline1       Pipeline2      
Seq1 modified_base 2650 2650 Seq1 modified_base 2623 2623
Seq1 modified_base 4377 4377 Seq1 modified_base 2650 2650
Seq1 modified_base 4833 4833 Seq1 modified_base 2824 2824
Seq1 modified_base 6059 6059 Seq1 modified_base 2916 2916
Seq1 m6A 6674 6674 Seq1 modified_base 3665 3665
Seq1 modified_base 7280 7280 Seq1 modified_base 4377 4377
Seq1 modified_base 7461 7461 Seq1 modified_base 6441 6441
Seq1 modified_base 8005 8005 Seq1 m6A 6674 6674
Seq1 modified_base 8591 8591 Seq1 modified_base 7280 7280
Seq1 modified_base 8648 8648 Seq1 modified_base 7461 7461
Seq1 modified_base 8790 8790 Seq1 modified_base 8648 8648
Seq1 modified_base 9030 9030 Seq1 modified_base 9901 9901
Seq1 modified_base 9101 9101 Seq1 modified_base 10912 10912
Seq1 modified_base 9901 9901 Seq1 modified_base 10992 10992
Seq1 modified_base 10988 10988 Seq1 m6A 13677 13677
Seq1 m6A 13677 13677 Seq1 modified_base 15001 15001
Seq1 m6A 16922 16922 Seq1 modified_base 16213 16213
Seq1 m6A 20081 20081 Seq1 m6A 16922 16922
Seq1 modified_base 20938 20938 Seq1 modified_base 20081 20081
Seq1 modified_base 21987 21987 Seq1 modified_base 21987 21987

Thank you

rhallPB commented 5 years ago

Both the pipelines you suggest are not really supported, or tested, although I expect the pipeline with the sort is doing the correct thing. As a check I would use the supported workflow pbalign, followed by ipdSummary. pbalign sorts and indexes the ouput bam, so it should be equivalent to your pipeline 1, although it's possible that the default alignment parameters are also different. https://github.com/PacificBiosciences/pbalign

uceleste commented 5 years ago

Thank you for your kind explanation.