broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

pilon gives different results--does it have any randomness? #27

Closed dgordon562 closed 7 years ago

dgordon562 commented 7 years ago

If I run pilon to correct a contig using an ENORMOUS bam file that has reads mapped to many other contigs, I get result A.

If I run pilon to correct the same contig using a tiny bam file (generated from the ENORMOUS one by using samtools view to extract all alignments relevant to the contig), I get result B.

I've tried this with many different contigs. The results are VERY similar (but not identical). For example:

length   # of discrepancies    # of discrepancies
of contig  with unpolished       between 2 pilon output fasta

54kb     65                    2
50kb     0                     0
1MB      551                   0
1MB      1154                  8
2.5MB    2717                 32

So the # of discrepancies between the 2 polished methods differs by 1 to 3% of the total number of discrepancies with the unpolished sequence.

So what why is this? Is there some randomness in the running of pilon? Would the results be different if, for example, the bam file had the alignments in a different order?

Has anyone else seen this? Which of the results is more reliable?

w1bw commented 7 years ago

The reason for this is that when Pilon does local reassemblies to fill gaps and/or reassemble potential local misassembled regions, it brings in not only the reads which align to the suspicious region, but also their mates (regardless of if and where they align). There are a bunch of reasons that is important. So ideally, if you could subset your bam file by including reads which align to the specific scaffold/contig AND their mates, it should be reproducible. There shouldn't be anything random!