broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

Discordant Pilon Results for Equivalent Files #136

Open cwarden45 opened 3 years ago

cwarden45 commented 3 years ago

Hi,

I am reviewing some results in preparation for submission of a paper. So, I am trying to make sure that I am using the exact right files and can replicate results (as much as possible).

However, I am encountering an odd issue with the Pilon results for 1 sample. The Pilon results match for the sample with no differences, and there is another sample where the original Pilon call for 1 variation decreases to 0 variants. However, there is 1 sample with a big difference, and that is why I wanted to double-check with this question.

Essentially, there was previously a short list of variants that I thought mostly matched the other results (in terms of the fraction of possibly variable regions). That earlier file has 14 rows in the pilon.changes file.

However, in the more recently created version, there are 75,820 rows in the pilon.changes file.

There are also a few strange formatting problems. For example, here are 3 that I notice in the more recent pilon.changes file.:

190m7:61 190m7_pilon:61 
 T
190m7:63 190m7_pilon:63 T C
190m7:121 190m7_pilon:121 
 G
190m7:122 190m7_pilon:122 G C
190m7:181 190m7_pilon:181 
 T
190m7:182 190m7_pilon:182 T A

In both cases, the alignment and Pilon reference are a little different because I align to a reference with 2 sequences and I provide 1 sequence to Pilon (since I want to filter reads aligned to E. coli, but I only want to polish the assembled sequence).

If I go back and check all relevant files, the only issue that I see is the exact name of the assembly sequence:

cwarden$ diff Pilon_OK/arrow_var.fasta Pilon_Problem/190m7.fasta
1c1
< >190M|arrow
---
> >190m7
cwarden$ diff Pilon_OK/190M-Canu-rearranged-Arrow2x_with_Ecoli.fa Pilon_Problem/190m7_with_Ecoli.fa
1c1
< >190M|arrow
---
> >190m7

Additionally, the first line of the Pilon .vcf is off for the more recent files:

190m7   1   .   
    .   0   LowCov  DP=0;TD=879;BQ=0;MQ=0;QD=0;BC=0,0,0,0;QP=0,0,0,0;PC=0;IC=0;DC=0;XC=4;AC=0;AF=0.00   GT  0/0
190m7   2   .   G   .   0   LowCov  DP=0;TD=896;BQ=0;MQ=0;QD=0;BC=0,0,0,0;QP=0,0,0,0;PC=0;IC=0;DC=0;XC=7;AC=0;AF=0.00   GT  0/0

The first nucleotide in the sequence is "G", which is listed in the .vcf as position 2 instead of position 1.

I also re-ran the BWA-MEM alignment and re-created all other files, and I am still getting a strange result in the newer file. I went back to the original location, and I can in fact reproduce the more normal results with the earlier set of files.

My guess is that it would probably be reasonable to use the earlier result. However, as I put together detailed methods, I want to make sure that get the exact same result with the exact files that I am using.

Do you have any suggestions about what may be causing this problem, and how to resolve that?

Please let me know if there are any additional files that I should provide.

Thank you very much.

Sincerely, Charles