apcamargo / prodigal-gv

A fork of Prodigal meant to improve gene calling for giant viruses and viruses that use alternative genetic codes
GNU General Public License v3.0
29 stars 3 forks source link

Differences with prodigal regarding SD and RBS score #5

Closed FlorianTrigodet closed 3 months ago

FlorianTrigodet commented 3 months ago

Hello!

I was looking into replace prodigal with prodigal-gv in my routine workflows (and maybe change the default gene caller in the platform anvi'o), so I ran some test to investigate the potential differences with prodigal.

I used a small metagenome available in this tutorial and extracted the genes calls not identical between the two programs. I used -p meta for both prodigal and prodigal-gv.

Around 5-6% of the total gene calls were not quite identical between prodigal and prodigal-gv, with a noticeable different at the start position (or stop if gene is reverse). I am focusing on results where the model and genetic code are comparable between prodigal and prodigal-gv.

Here is a random example (program 'og' is original prodigal, 'gv' is prodigal-gv):

**** gene_callers_id contig start stop direction gene_length program aa_sequence
33143 3406 Day17a_QCcontig1008 41696 42635 r 939 gv MDETEEGINIDNTQHDLELDRDSSTQQPAQTHEDDGDLLGLDKPIKLKTRAKIAKVDNQRIFNHNGIPLLVKTHSKLLRTLKKNDKNFYSEPRSSISKSQKFEHEYENLSSVLQFYQLWCHGLFPKATFKDCIHLIRALGARSPQLRLYRRELIAAELHKLKVAKGIIADENQDAPSIPEEENTTDPSNEEWNSMHMSALVPGSSNKNGLFVDSNSNEDFETTNEVNAAASLADKDALSTDDKAEQTNAITSDTHNNDVDSDDPFSDDDDINIDAHTENLHPASGTQHQDRPKETTEENEDLELELMREYGA
3409 3409 Day17a_QCcontig1008 41696 42647 r 951 og MSYVMDETEEGINIDNTQHDLELDRDSSTQQPAQTHEDDGDLLGLDKPIKLKTRAKIAKVDNQRIFNHNGIPLLVKTHSKLLRTLKKNDKNFYSEPRSSISKSQKFEHEYENLSSVLQFYQLWCHGLFPKATFKDCIHLIRALGARSPQLRLYRRELIAAELHKLKVAKGIIADENQDAPSIPEEENTTDPSNEEWNSMHMSALVPGSSNKNGLFVDSNSNEDFETTNEVNAAASLADKDALSTDDKAEQTNAITSDTHNNDVDSDDPFSDDDDINIDAHTENLHPASGTQHQDRPKETTEENEDLELELMREYGA

And here is the detailed output of each program for this region:

Prodigal Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr GCCont
41697 41798 - -114.60 -77.88 -36.73 TTG None None -12.29 0.38 -24.31 0.441
41697 41960 - -33.55 -22.14 -11.40 GTG AAA 4bp -1.81 -4.14 -4.96 0.428
41697 42002 - -9.77 -7.66 -2.11 GTG AAAA 11bp 2.56 0.79 -4.96 0.412
41697 42047 - -9.62 -7.52 -2.10 ATG None None -4.87 -0.20 3.47 0.407
41697 42053 - -8.65 -6.51 -2.14 ATG None None -4.87 -0.23 3.47 0.409
41697 42164 - 24.11 29.72 -5.61 TTG TAA 11bp 6.07 -2.06 -9.63 0.402
41697 42194 - 16.67 37.37 -20.69 TTG None None -4.87 -6.20 -9.63 0.400
41697 42269 - 24.19 37.86 -13.67 TTG None None -4.87 0.82 -9.63 0.403
41697 42284 - 17.64 35.24 -17.61 TTG None None -4.87 -3.11 -9.63 0.403
41697 42299 - 17.24 35.84 -18.60 TTG None None -4.87 -4.11 -9.63 0.400
41697 42311 - 20.70 36.51 -15.81 TTG None None -4.87 -1.32 -9.63 0.400
41697 42407 - 35.38 47.06 -11.69 TTG AAAA 11bp 2.56 -4.62 -9.63 0.388
41697 42425 - 38.50 48.08 -9.58 GTG None None -4.87 0.25 -4.96 0.387
41697 42515 - 38.76 61.28 -22.51 TTG None None -4.87 -8.02 -9.63 0.385
41697 42524 - 40.63 60.10 -19.47 TTG None None -4.87 -4.98 -9.63 0.384
41697 42635 - 108.91 111.22 -2.30 ATG None None -4.87 -0.90 3.47 0.387
41697 42638 - 105.28 112.22 -6.94 GTG None None -4.87 2.89 -4.96 0.386
41697 42647 - 124.74 113.26 11.48 ATG TAA 12bp 6.07 1.94 3.47 0.387
Prodigal-gv Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr GCCont
41697 41798 - -121.90 -84.99 -36.90 TTG ATA 8bp 0.34 -0.13 -36.61 0.451
41697 41960 - -42.15 -21.87 -20.28 GTG None None -5.55 -2.88 -11.35 0.428
41697 42002 - -30.13 -7.59 -22.53 GTG None None -5.55 -5.13 -11.35 0.415
41697 42047 - -7.58 -1.53 -6.06 ATG None None -5.55 -3.80 3.80 0.410
41697 42053 - -9.48 -1.90 -7.58 ATG None None -5.55 -5.32 3.80 0.409
41697 42164 - 16.35 32.07 -15.72 TTG ATA 9bp 0.85 -2.07 -14.50 0.404
41697 42194 - 9.59 39.18 -29.60 TTG None None -5.55 -9.54 -14.50 0.400
41697 42269 - 21.34 44.67 -23.33 TTG None None -5.55 -3.28 -14.50 0.401
41697 42284 - 22.69 42.59 -19.90 TTG TAT 14bp -2.45 -2.95 -14.50 0.405
41697 42299 - 16.44 40.66 -24.22 TTG None None -5.55 -4.17 -14.50 0.401
41697 42311 - 21.88 41.63 -19.75 TTG TAT 6bp -2.01 -3.25 -14.50 0.400
41697 42407 - 27.47 51.32 -23.85 TTG None None -5.55 -3.79 -14.50 0.390
41697 42425 - 43.68 52.25 -8.56 GTG TATA 9bp 5.16 -2.37 -11.35 0.388
41697 42515 - 39.60 69.68 -30.08 TTG None None -5.55 -10.03 -14.50 0.383
41697 42524 - 37.39 69.17 -31.78 TTG None None -5.55 -11.73 -14.50 0.385
41697 42635 - 119.55 114.25 5.30 ATG ATA 4bp 3.62 -2.12 3.80 0.387
41697 42638 - 101.29 113.96 -12.67 GTG ATA 15bp -1.22 -0.09 -11.35 0.387
41697 42647 - 120.24 115.24 5.00 ATG ATA 6bp 0.85 0.36 3.80 0.386

In bold are the selected hit. I can see that both programs compute different scores, especially regarding the Shine-Dalgarno sequence and the ribosome binding site. But I am not sure why the selected gene-call is not the one with the highest score.

Do you have more information about that change in scoring system between prodigal and gv? And why the shorter gene call would be the best in this case?

Thanks for your response!

apcamargo commented 3 months ago

Hi @FlorianTrigodet!

Differences between prodigal-gv and Prodigal are due to two main factors: (1) a couple of bugfixes from @althonos, some of which were not incorporated into vanilla Prodigal (https://github.com/apcamargo/prodigal-gv/commit/745d3e8e366da3339c8aa06e73f57116d8c8d617, https://github.com/apcamargo/prodigal-gv/commit/d71a02eda26b29eb79f3ca62979ece126375b7ef, https://github.com/apcamargo/prodigal-gv/commit/1f891d67f6d69360e0310ac5c3977ad8d63c1930, https://github.com/apcamargo/prodigal-gv/commit/ba4b7dbdde8bde2ca1df2f3e2e7c632336d23609); (2) additional gene models in the metagenome mode, some of which use translation table 15.

Because of (1), Prodigal and pyrodigal-gv can give you distinct gene calls even when they use the same gene model in the metagenome mode, but the differences should be very small. Can you check if Prodigal and prodigal-gv picked the same model? This is easy to get from the GFF output.

A more 1:1 comparison would be to compare pyrodigal and pyrodigal-gv, since pyrodigal incorporates all the fixes and the only difference between the two software is that pyrodigal-gv includes the additional gene models. On top of that, pyrodigal/pyrodigal-gv are faster than Prodigal/prodigal-gv.

p.s.: is there a reason for the starting position being constant in your table?

FlorianTrigodet commented 3 months ago

Hi @apcamargo!

Thanks a lot for the detailed response, really appreciate! I only investigated contigs where prodigal and prodigal-gv picked the same model, and that's why I was concerned about similar, yet slightly different output.

I just read about all the issues and fixes in pyrodigal/pyrodigal-gv and it looks like the difference I was seeing is due to the SD or RBS detection/scoring issue in prodigal. I will continue with pyrodigal/pyrodigal-gv for now!

And as for the table with the constant start position: it is from the output of -s for all possible genes.

Thanks!

apcamargo commented 3 months ago

Ohh, I don't think I've ever used -s. This is very useful!

Please let me know if you need anything else!