broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
338 stars 60 forks source link

Problem polishing deletion in polynucleotide run with heterozygosity #81

Closed zephyris closed 5 years ago

zephyris commented 5 years ago

I'm having problems polishing a particular corner case with a short polynucleotide run containing a haploid difference, for example:

AGTTTTTCA Draft genome
AGTTTTTTCA True sequence, first haploid variant
AGTTCTTTCA True sequence, second haploid variant

In this case it seems that Pilon identifies two haploid variants due to the leftward shifting of insertions in the handling of homopolymers:

AG-TTTTTCA Draft genome
AGtTTTTTCA Insertion of a T, shifted to the leftmost (first) position

AGTT-TTTCA Draft genome
AGTTcTTTCA Insertion of a C at the third position

It seems that this ends up with no change to the draft genome - presumably because there are two differences relative to the draft genome at two different positions, each with only 50% support. This is particularly annoying as it gives a persistent single base pair deletion in the polished genome and breaks my ORFs!

The preferred behaviour would be to insert a base (either a C or a T), to give the correct number of bases, with evidence for a haploid C/T difference.

Firstly, is this the expected behaviour and am I interpreting the behaviour correctly? Secondly, is there any way to avoid this through settings? If not, any chance of a change to handle this case more usefully?

w1bw commented 5 years ago

Thank you for writing. This is indeed a nasty corner case, and as you point out, it's caused by the shifting of homopolymer indels for the one haplotype but not the other. If I ever get time to improve handling of diploid indel calling, I will keep this case in mind, but I don't have any real timeline on when that might be, as my time for Pilon support is very limited. I'm sorry this has caused you problems!