broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

Question re: uncorrected ONP errors using Pilon #39

Closed cjfields closed 7 years ago

cjfields commented 7 years ago

We have an Oxford Nanopore assembly (bacterial genome) and noticed that, though Pilon did correct many of the indels present, we are still left with a small number. When we analyze these by re-aligning the raw reads to the Pilon-corrected genome, we noticed that some reads (very low frequency) overlap into the indel region. We have tried a few things, including indel realignment prior to the Pilon run, with very little improvement. Any reason these might be problematic with Pilon? Is there a way to adjust expected allele frequency to account for these?

image

We also see some instances where a SNP is present but not corrected:

image

These are at 40x coverage, with no alternate bases at the position. Interestingly, SNP corrections look fine elsewhere, so these 'odd ones out' are a little more puzzling.

w1bw commented 7 years ago

Looking at these examples, it's hard to see why Pilon wouldn't be making these corrections. Alignments often don't reflect indels near the ends of reads (because making base changes is usually less expensive, and few are required to fit the reference near the ends of the read). That's why pilon by default doesn't pay attention to the ends (see the --flank option).

Pilon doesn't do particularly well with raw pacbio reads becuase of the error model (prevalence of indel errors), though circular concensus or HGAP/Falcon corrected reads are generally fine. I imagine the raw ONP data might have the same problem, but I have never tried it. If you are able to make your bam and genome files available to me for download, I'm happy to try to diagnose what's going on.

cjfields commented 7 years ago

Hi @w1bw I'll check in with the PI to see if they are willing to share the data, if so I may contact you offline. The genome in question was assembled via Canu (including error correction) and polished using nanopolish. One thing I'm not sure of is how well reads aligned to these regions prior to correction.

We could see if multiple rounds of Pilon might address these (I recall another older correction tool, iCORN from the Sanger group, that used a similar approach with Illumina + 454 assemblies).

cjfields commented 7 years ago

Hi @w1bw, apologies for not following up. The data in question come from a sample that has unusually variable sequence coverage, it appears these regions correspond to low coverage regions.