Differences with prodigal regarding SD and RBS score

FlorianTrigodet commented 3 months ago

Hello!

I was looking into replace prodigal with prodigal-gv in my routine workflows (and maybe change the default gene caller in the platform anvi'o), so I ran some test to investigate the potential differences with prodigal.

I used a small metagenome available in this tutorial and extracted the genes calls not identical between the two programs. I used -p meta for both prodigal and prodigal-gv.

Around 5-6% of the total gene calls were not quite identical between prodigal and prodigal-gv, with a noticeable different at the start position (or stop if gene is reverse). I am focusing on results where the model and genetic code are comparable between prodigal and prodigal-gv.

Here is a random example (program 'og' is original prodigal, 'gv' is prodigal-gv):

****	gene_callers_id	contig	start	stop	direction	gene_length	program	aa_sequence
33143	3406	Day17a_QCcontig1008	41696	42635	r	939	gv	MDETEEGINIDNTQHDLELDRDSSTQQPAQTHEDDGDLLGLDKPIKLKTRAKIAKVDNQRIFNHNGIPLLVKTHSKLLRTLKKNDKNFYSEPRSSISKSQKFEHEYENLSSVLQFYQLWCHGLFPKATFKDCIHLIRALGARSPQLRLYRRELIAAELHKLKVAKGIIADENQDAPSIPEEENTTDPSNEEWNSMHMSALVPGSSNKNGLFVDSNSNEDFETTNEVNAAASLADKDALSTDDKAEQTNAITSDTHNNDVDSDDPFSDDDDINIDAHTENLHPASGTQHQDRPKETTEENEDLELELMREYGA
3409	3409	Day17a_QCcontig1008	41696	42647	r	951	og	MSYVMDETEEGINIDNTQHDLELDRDSSTQQPAQTHEDDGDLLGLDKPIKLKTRAKIAKVDNQRIFNHNGIPLLVKTHSKLLRTLKKNDKNFYSEPRSSISKSQKFEHEYENLSSVLQFYQLWCHGLFPKATFKDCIHLIRALGARSPQLRLYRRELIAAELHKLKVAKGIIADENQDAPSIPEEENTTDPSNEEWNSMHMSALVPGSSNKNGLFVDSNSNEDFETTNEVNAAASLADKDALSTDDKAEQTNAITSDTHNNDVDSDDPFSDDDDINIDAHTENLHPASGTQHQDRPKETTEENEDLELELMREYGA

And here is the detailed output of each program for this region:

Prodigal	Beg	End	Std	Total	CodPot	StrtSc	Codon	RBSMot	Spacer	RBSScr	UpsScr	TypeScr
41697	41798	-	-114.60	-77.88	-36.73	TTG	None	None	-12.29	0.38	-24.31	0.441
41697	41960	-	-33.55	-22.14	-11.40	GTG	AAA	4bp	-1.81	-4.14	-4.96	0.428
41697	42002	-	-9.77	-7.66	-2.11	GTG	AAAA	11bp	2.56	0.79	-4.96	0.412
41697	42047	-	-9.62	-7.52	-2.10	ATG	None	None	-4.87	-0.20	3.47	0.407
41697	42053	-	-8.65	-6.51	-2.14	ATG	None	None	-4.87	-0.23	3.47	0.409
41697	42164	-	24.11	29.72	-5.61	TTG	TAA	11bp	6.07	-2.06	-9.63	0.402
41697	42194	-	16.67	37.37	-20.69	TTG	None	None	-4.87	-6.20	-9.63	0.400
41697	42269	-	24.19	37.86	-13.67	TTG	None	None	-4.87	0.82	-9.63	0.403
41697	42284	-	17.64	35.24	-17.61	TTG	None	None	-4.87	-3.11	-9.63	0.403
41697	42299	-	17.24	35.84	-18.60	TTG	None	None	-4.87	-4.11	-9.63	0.400
41697	42311	-	20.70	36.51	-15.81	TTG	None	None	-4.87	-1.32	-9.63	0.400
41697	42407	-	35.38	47.06	-11.69	TTG	AAAA	11bp	2.56	-4.62	-9.63	0.388
41697	42425	-	38.50	48.08	-9.58	GTG	None	None	-4.87	0.25	-4.96	0.387
41697	42515	-	38.76	61.28	-22.51	TTG	None	None	-4.87	-8.02	-9.63	0.385
41697	42524	-	40.63	60.10	-19.47	TTG	None	None	-4.87	-4.98	-9.63	0.384
41697	42635	-	108.91	111.22	-2.30	ATG	None	None	-4.87	-0.90	3.47	0.387
41697	42638	-	105.28	112.22	-6.94	GTG	None	None	-4.87	2.89	-4.96	0.386
41697	42647	-	124.74	113.26	11.48	ATG	TAA	12bp	6.07	1.94	3.47	0.387

Prodigal-gv	Beg	End	Std	Total	CodPot	StrtSc	Codon	RBSMot	Spacer	RBSScr	UpsScr	TypeScr
41697	41798	-	-121.90	-84.99	-36.90	TTG	ATA	8bp	0.34	-0.13	-36.61	0.451
41697	41960	-	-42.15	-21.87	-20.28	GTG	None	None	-5.55	-2.88	-11.35	0.428
41697	42002	-	-30.13	-7.59	-22.53	GTG	None	None	-5.55	-5.13	-11.35	0.415
41697	42047	-	-7.58	-1.53	-6.06	ATG	None	None	-5.55	-3.80	3.80	0.410
41697	42053	-	-9.48	-1.90	-7.58	ATG	None	None	-5.55	-5.32	3.80	0.409
41697	42164	-	16.35	32.07	-15.72	TTG	ATA	9bp	0.85	-2.07	-14.50	0.404
41697	42194	-	9.59	39.18	-29.60	TTG	None	None	-5.55	-9.54	-14.50	0.400
41697	42269	-	21.34	44.67	-23.33	TTG	None	None	-5.55	-3.28	-14.50	0.401
41697	42284	-	22.69	42.59	-19.90	TTG	TAT	14bp	-2.45	-2.95	-14.50	0.405
41697	42299	-	16.44	40.66	-24.22	TTG	None	None	-5.55	-4.17	-14.50	0.401
41697	42311	-	21.88	41.63	-19.75	TTG	TAT	6bp	-2.01	-3.25	-14.50	0.400
41697	42407	-	27.47	51.32	-23.85	TTG	None	None	-5.55	-3.79	-14.50	0.390
41697	42425	-	43.68	52.25	-8.56	GTG	TATA	9bp	5.16	-2.37	-11.35	0.388
41697	42515	-	39.60	69.68	-30.08	TTG	None	None	-5.55	-10.03	-14.50	0.383
41697	42524	-	37.39	69.17	-31.78	TTG	None	None	-5.55	-11.73	-14.50	0.385
41697	42635	-	119.55	114.25	5.30	ATG	ATA	4bp	3.62	-2.12	3.80	0.387
41697	42638	-	101.29	113.96	-12.67	GTG	ATA	15bp	-1.22	-0.09	-11.35	0.387
41697	42647	-	120.24	115.24	5.00	ATG	ATA	6bp	0.85	0.36	3.80	0.386

In bold are the selected hit. I can see that both programs compute different scores, especially regarding the Shine-Dalgarno sequence and the ribosome binding site. But I am not sure why the selected gene-call is not the one with the highest score.

Do you have more information about that change in scoring system between prodigal and gv? And why the shorter gene call would be the best in this case?

Thanks for your response!

apcamargo commented 3 months ago

Hi @FlorianTrigodet!

Differences between prodigal-gv and Prodigal are due to two main factors: (1) a couple of bugfixes from @althonos, some of which were not incorporated into vanilla Prodigal (https://github.com/apcamargo/prodigal-gv/commit/745d3e8e366da3339c8aa06e73f57116d8c8d617, https://github.com/apcamargo/prodigal-gv/commit/d71a02eda26b29eb79f3ca62979ece126375b7ef, https://github.com/apcamargo/prodigal-gv/commit/1f891d67f6d69360e0310ac5c3977ad8d63c1930, https://github.com/apcamargo/prodigal-gv/commit/ba4b7dbdde8bde2ca1df2f3e2e7c632336d23609); (2) additional gene models in the metagenome mode, some of which use translation table 15.

Because of (1), Prodigal and pyrodigal-gv can give you distinct gene calls even when they use the same gene model in the metagenome mode, but the differences should be very small. Can you check if Prodigal and prodigal-gv picked the same model? This is easy to get from the GFF output.

A more 1:1 comparison would be to compare pyrodigal and pyrodigal-gv, since pyrodigal incorporates all the fixes and the only difference between the two software is that pyrodigal-gv includes the additional gene models. On top of that, pyrodigal/pyrodigal-gv are faster than Prodigal/prodigal-gv.

p.s.: is there a reason for the starting position being constant in your table?

FlorianTrigodet commented 3 months ago

Hi @apcamargo!

Thanks a lot for the detailed response, really appreciate! I only investigated contigs where prodigal and prodigal-gv picked the same model, and that's why I was concerned about similar, yet slightly different output.

I just read about all the issues and fixes in pyrodigal/pyrodigal-gv and it looks like the difference I was seeing is due to the SD or RBS detection/scoring issue in prodigal. I will continue with pyrodigal/pyrodigal-gv for now!

And as for the table with the constant start position: it is from the output of -s for all possible genes.

Thanks!

apcamargo commented 3 months ago

Ohh, I don't think I've ever used -s. This is very useful!

Please let me know if you need anything else!

apcamargo / prodigal-gv

Differences with prodigal regarding SD and RBS score #5