althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
129 stars 5 forks source link

Inconsistent start score computed for some genes #19

Open althonos opened 1 year ago

althonos commented 1 year ago

While adding some tests to check for the GFF output (in order to fix #18) I noticed that the start score of some genes were deviating from the Prodigal reference results. This was not verified before since the GFF format is the only output format to contain these statistics. This change in start score affects the may score and the confidence of each gene marginally.

Genes scored with Prodigal:

NODE_23_length_79939_cov_26.984653  Prodigal_v2.6.3 CDS 1   177 8.4 -   0   ID=1_1;partial=10;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.237;conf=90.13;score=9.62;cscore=10.74;sscore=-1.12;rscore=-5.22;uscore=-1.07;tscore=3.94;
NODE_23_length_79939_cov_26.984653  Prodigal_v2.6.3 CDS 168 386 25.1    -   0   ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.251;conf=99.77;score=26.33;cscore=27.03;sscore=-0.70;rscore=-6.04;uscore=0.68;tscore=3.41;
NODE_23_length_79939_cov_26.984653  Prodigal_v2.6.3 CDS 389 1483    186.7   -   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.254;conf=99.99;score=186.70;cscore=168.23;sscore=18.47;rscore=14.49;uscore=0.04;tscore=3.94;
NODE_23_length_79939_cov_26.984653  Prodigal_v2.6.3 CDS 1632    2981    218.9   -   0   ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=3-4bp;gc_cont=0.296;conf=99.99;score=218.26;cscore=200.52;sscore=17.74;rscore=14.49;uscore=-0.04;tscore=3.94;
NODE_23_length_79939_cov_26.984653  Prodigal_v2.6.3 CDS 3569    3925    25.5    +   0   ID=1_5;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.266;conf=99.72;score=25.49;cscore=21.09;sscore=4.41;rscore=1.46;uscore=-1.00;tscore=3.94;

Genes scored with Pyrodigal v0.6.4:

NODE_23_length_79939_cov_26.984653_1    pyrodigal_v0.6.4    CDS 1   177 8.4 -   0   ID=1_1;partial=10;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.237;conf=90.13;score=9.62;cscore=10.74;sscore=-1.12;rscore=-5.22;uscore=-1.07;tscore=3.94;
NODE_23_length_79939_cov_26.984653_2    pyrodigal_v0.6.4    CDS 168 386 25.1    -   0   ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.251;conf=99.77;score=26.33;cscore=27.03;sscore=-0.70;rscore=-6.04;uscore=0.68;tscore=3.41;
NODE_23_length_79939_cov_26.984653_3    pyrodigal_v0.6.4    CDS 389 1483    186.7   -   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.254;conf=99.99;score=186.70;cscore=168.23;sscore=18.47;rscore=14.49;uscore=0.04;tscore=3.94;
NODE_23_length_79939_cov_26.984653_4    pyrodigal_v0.6.4    CDS 1632    2981    218.9   -   0   ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=3-4bp;gc_cont=0.296;conf=99.99;score=218.26;cscore=200.52;sscore=17.74;rscore=14.49;uscore=-0.04;tscore=3.94;
NODE_23_length_79939_cov_26.984653_5    pyrodigal_v0.6.4    CDS 3569    3925    25.5    +   0   ID=1_5;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.266;conf=99.72;score=25.49;cscore=21.09;sscore=4.41;rscore=1.46;uscore=-1.00;tscore=3.94;

Genes scored with Pyrodigal v1.1.2:

NODE_23_length_79939_cov_26.984653_1    pyrodigal_v1.1.2    CDS 1   177 8.4 -   0   ID=1_1;partial=10;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.237;conf=87.32;score=8.39;cscore=10.74;sscore=-2.35;rscore=-5.22;uscore=-1.07;tscore=3.94
NODE_23_length_79939_cov_26.984653_2    pyrodigal_v1.1.2    CDS 168 386 25.1    -   0   ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.251;conf=99.69;score=25.07;cscore=27.03;sscore=-1.96;rscore=-6.04;uscore=0.68;tscore=3.41
NODE_23_length_79939_cov_26.984653_3    pyrodigal_v1.1.2    CDS 389 1483    186.7   -   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.254;conf=99.99;score=186.70;cscore=168.23;sscore=18.47;rscore=14.49;uscore=0.04;tscore=3.94
NODE_23_length_79939_cov_26.984653_4    pyrodigal_v1.1.2    CDS 1632    2981    218.9   -   0   ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=3-4bp;gc_cont=0.296;conf=99.99;score=218.91;cscore=200.52;sscore=18.39;rscore=14.49;uscore=-0.04;tscore=3.94
NODE_23_length_79939_cov_26.984653_5    pyrodigal_v1.1.2    CDS 3569    3925    25.5    +   0   ID=1_5;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.266;conf=99.72;score=25.49;cscore=21.09;sscore=4.41;rscore=1.46;uscore=-1.00;tscore=3.94

After bissecting, I found that the bug was introduced between v0.6.4 and v0.7.0.

althonos commented 1 year ago

It looks like the bug may be coming from a weird Prodigal behaviour, and only occurs in metagenomic mode.

In the original Prodigal code, the gene data string is created right when the best genes are identified but the nodes may be changed after that, so there is a discrepancy between the gene data string and the actual start node attributes. This only occurs for genes that have been corrected with eliminate_bad_genes.