broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
338 stars 60 forks source link

Handling of single "N" characters in reference genome #76

Open ewilbanks opened 6 years ago

ewilbanks commented 6 years ago

Hi folks,

Thanks for this great tool! I'm polishing a genome which contains a number of single N characters as ambiguous bases, and I'm confused about how pilon considers these. For many of these there should be good support to correct this to an A, C, T, or G, but these aren't being touched by my current attempts. Ideas? Pilon is correcting other ambigious bases (e.g. R, Y, K) to the correct base, but is ignoring Ns.

The command I'm running is: java -Xmx120g -jar ~/software/anaconda2/pkgs/pilon-1.22-1/share/pilon-1.22-1/pilon-1.22.jar \ --genome ref.fasta \ --frags aln.sorted.bam \ --unpaired u.sorted.bam \ --changes --vcf --tracks \ --threads 16 \ --fix bases,amb \ --outdir pilon_02

w1bw commented 5 years ago

I'm finally catching up on long overdue PIlon support. You are correct that Pilon isn't attempting to correct single Ns in the input genome. The reason is related to the way Pilon does gap filling, which is based on local reassembly rather than pileup information. Right now, it treats Ns as gaps, but it wouldn't be too hard to let it handle the isolated Ns as regular sequence to be corrected with "--fix bases", for instance. I'll queue that up for a future release.