broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

pilon fails to correct hetozygous indel errors #35

Closed dgordon562 closed 7 years ago

dgordon562 commented 7 years ago

''' AAGAA allele1 AAATA allele2 AAAA sequence NOT changed by pilon ''' Notice the G and T are in different positions in the two alleles.

Such an error in genes erroneously indicates the sequence has a gene-killing 1-base deletion. This happens perhaps a hundred times in the human genome in coding sequence--terrible. I've tried version 1.21 and get the same problem.

Any suggestions?

w1bw commented 7 years ago

Sorry for the delay in responding. I'm finally catching up on reported pilon issues. I think you're saying that the sample has different inserted bases in the two alleles with respect to the reference, and Pilon isn't calling anything. To get a sense of what's going on, could you please send relevant lines of such a region from the VCF output?

dgordon562 commented 7 years ago

HI, Bruce,

I have zillions of examples. Here is a random one:

gunzip -c ../filtered_freebayes1.vcf.gz | grep 18056357 000016F_1_26158149_quiver_pilon 18056357 . TTACAA TTACAAA,TTACAGA 47762.4 . AC=1,1;AF=0.5,0.5;AN=2;MQM=60,60;MQMR=60;NS=1;PQR=850;QA=36408,4911;QR=1181 GT:DP:GL 1/2:45:-133.775,-72.1093,-63.9815,-63.8398,0,-54.8089 {dgordon}e181:/net/eichler/vol25/projects/whole_genome_assembly/nobackups/chimp/Drafts/V3.0a-falcon/freebayes_polishing5/why_worse/igv

Take a look at this IGV screenshot of this particular error (attached): it shows you that the reference (result of pilon) is clearly wrong and that it should be replaced by either allele.

We are wrestling right now with various home-grown solutions but if you can fix it in pilon, we would be grateful.

Best wishes, David

On Wed, 15 Mar 2017, Bruce Walker wrote:

Sorry for the delay in responding. I'm finally catching up on reported pilon issues. I think you're saying that the sample has different inserted bases in the two alleles with respect to the reference, and Pilon isn't calling anything. To get a sense of what's going on, could you please send relevant lines of such a region from the VCF output?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.[AK_rixbh2jC3E-13L6E7dje--_ChYcerks5rl__fgaJpZM4L8m3c.gif]

dgordon562 commented 7 years ago

image

dgordon562 commented 7 years ago

Bruce: be aware that the 2 indels are often further away (in this case the insertions are just 1bp away). I've seen many such errors where the indels are 5 or 6 bp away.

dgordon562 commented 7 years ago

Here is the command we used: java -Xmx110g -jar ~chrismh/src/pilon-1.17.jar --genome " + str( inp ut[2] ) + " --changes --frags " + str( input[0] ) + " --diploid --fix all --output " + szPrefix + " --threads 16

Is there some other parameters that would sove the problems shown above?

dgordon562 commented 7 years ago

I should mention that I tried pilon without the diploid and I tried pilon version 1.21 but no joy.

w1bw commented 7 years ago

Thanks for all the info, David. However, this is not going to be easy to fix in the short term. Pilon does not know how to do read phasing yet. While it should successfully call the heterozygous insertions at the respective locations in the Pilon VCF output, it doesn't know that they are mutually exclusive, and it probably won't incorporate either into the actual FASTA changes. There is certainly more work like this which could be done to make Pilon better at diploid genome improvement, but my Pilon development time is extremely limited these days, so no timeline.

Thanks for calling it to my attention, though!