Is it necessary to sort the alignment bam file to run the ipdSummary correctly?

uceleste commented 6 years ago

Hi All,

I tried ipdSummary on two parallel pipelines using the same reads and reference. The only difference is that in one case the alignment file in bam format was sorted and in the other case no. I obtained two different output with some shared information.

What is the correct way to run the process? Is it always better to do the sorting of the bam alignment?

Here my command lines and a semplified example of both output:

Pipeline 1

$ blasr PacBio.bam Assembly.fasta --nproc 30 --bam --out Aligned.bam $ samtools sort Aligned.bam -o Aligned_sorted.bam --threads 30 $ pbindex Aligned_sorted.bam $ samtools faidx Assembly.fasta $ ipdSummary Aligned_sorted.bam --reference Assembly.fasta --identify m6A,m4C --gff basemods.gff

Pipeline 2

$ blasr PacBio.bam Assembly.fasta --nproc 30 --bam --out Aligned.bam $ pbindex Aligned.bam $ samtools faidx Assembly.fasta $ ipdSummary Aligned.bam --reference Assembly.fasta --identify m6A,m4C --gff basemods.gff

Pipeline1				Pipeline2
Seq1	modified_base	2650	2650	Seq1	modified_base	2623	2623
Seq1	modified_base	4377	4377	Seq1	modified_base	2650	2650
Seq1	modified_base	4833	4833	Seq1	modified_base	2824	2824
Seq1	modified_base	6059	6059	Seq1	modified_base	2916	2916
Seq1	m6A	6674	6674	Seq1	modified_base	3665	3665
Seq1	modified_base	7280	7280	Seq1	modified_base	4377	4377
Seq1	modified_base	7461	7461	Seq1	modified_base	6441	6441
Seq1	modified_base	8005	8005	Seq1	m6A	6674	6674
Seq1	modified_base	8591	8591	Seq1	modified_base	7280	7280
Seq1	modified_base	8648	8648	Seq1	modified_base	7461	7461
Seq1	modified_base	8790	8790	Seq1	modified_base	8648	8648
Seq1	modified_base	9030	9030	Seq1	modified_base	9901	9901
Seq1	modified_base	9101	9101	Seq1	modified_base	10912	10912
Seq1	modified_base	9901	9901	Seq1	modified_base	10992	10992
Seq1	modified_base	10988	10988	Seq1	m6A	13677	13677
Seq1	m6A	13677	13677	Seq1	modified_base	15001	15001
Seq1	m6A	16922	16922	Seq1	modified_base	16213	16213
Seq1	m6A	20081	20081	Seq1	m6A	16922	16922
Seq1	modified_base	20938	20938	Seq1	modified_base	20081	20081
Seq1	modified_base	21987	21987	Seq1	modified_base	21987	21987

Thank you

rhallPB commented 6 years ago

Both the pipelines you suggest are not really supported, or tested, although I expect the pipeline with the sort is doing the correct thing. As a check I would use the supported workflow pbalign, followed by ipdSummary. pbalign sorts and indexes the ouput bam, so it should be equivalent to your pipeline 1, although it's possible that the default alignment parameters are also different. https://github.com/PacificBiosciences/pbalign

uceleste commented 6 years ago

Thank you for your kind explanation.

PacificBiosciences / kineticsTools

Is it necessary to sort the alignment bam file to run the ipdSummary correctly? #63