Closed apcamargo closed 10 months ago
Hi Antonio, psyched to hear this!
First of all I've been aware of prodigal-gv
for some time, and I recently recommended it to a colleague working with virae!
I've been thinking about how to allow custom metagenomic models to be passed to Pyrodigal, and actually not a lot would have to be changed for everything to work. The more complicated part would be how to store the models to allow you (or other usecases of custom models) to load efficiently. But otherwise it would be feasible to have an OrfFinder.find_genes
call that takes an additional argument which would be a list of MetagenomicModel
objects, or use the default ones if None
given.
For the second question, I'd have to think about how to integrate it efficiently; I think you could actually try to count the number of extracted nodes without actually scoring them for putative gene density. But indeed, this may be a bit more out of scope compared to the metagenomic stuff.
I've been thinking about how to allow custom metagenomic models to be passed to Pyrodigal, and actually not a lot would have to be changed for everything to work. The more complicated part would be how to store the models to allow you (or other usecases of custom models) to load efficiently. But otherwise it would be feasible to have an OrfFinder.find_genes call that takes an additional argument which would be a list of MetagenomicModel objects, or use the default ones if None given.
Great to hear! I like this interface idea. If None
is used, would you also restrict the model search within a range of GC values or you would allow users to do a full search?
For the second question, I'd have to think about how to integrate it efficiently; I think you could actually try to count the number of extracted nodes without actually scoring them for putative gene density. But indeed, this may be a bit more out of scope compared to the metagenomic stuff.
This would make more sense than what people (me included) usually do. Although it is not within the scope right now, it would be interesting to have something like this in the future or for another project. That's a feature that is lacking in all gene callers.
It almost took a year but I've started updating the interface to allow this. At the moment i can compile an external package that depends on pyrodigal
but uses your prodigal-gv
model, but I'm working on a way that doesn't need compiling (using training info in some other format) so that it's easier to distribute :)
That's great! Thanks @althonos
Not sure if I understand the interface, though. The gene models would be packaged in a separate package and then read by pyrodigal?
Yes, I'll make a repo and invite you to that 👍
Version 3.0.0
of Pyrodigal now supports using user-provided metagenomic models to run gene finding in meta mode. The giant-virus models are distributed in pyrodigal-gv
.
Thank you! Really good idea to store the models in json files to avoid compilation.
I actually did something even more hacky to avoid storing them in JSON once installed :see_no_evil:
Thanks for adding these models!
Is there a way to use the models with pyrodigal in meta mode on the CLI?
@rohansachdeva : I have added a CLI for pyrodigal-gv
in latest version v0.3.0
. Use pyrodigal-gv
instead of prodigal
in the shell and you'll be all set :smiley:
Awesome - thank you!
Hey @althonos,
Thanks for your work on pyrodigal! It is an amazing tool!
I'm wondering if you would be interested in adding additional gene models to the metagenome mode. In prodigal-gv I added a couple of models trained on genomes of giant viruses (the main reason for that is that none of the pre-trained models in Prodigal detect the TATATA RBS motif that is super common in those viruses) and, more importantly, models for phages that use the translation table 15 (very prevalent in Crassvirales).
These models proved to be really useful in geNomad and they ended up improving the detection of giant viruses and phages with code 15 in IMG/VR. But I can see two main disadvantages of using them:
I'm planning to adopt pyrodigal for my next projects and I'd use these models a lot, but I can always change the code locally if you feel that adding additional models is not within the scope of this project. No worries :)
Somewhat related to this:
I'm doing a large-scale gene prediction for hundreds of thousands of genomes and some of them will have alternative genetic codes. I wrote a function that takes part of the genome (not the whole thing, to speed things up) and tests different translation tables (4, 11, and 15) to evaluate whether the genome potentially uses an alternative code (I just compare the gene density between the codes). Do you think a function like this could be useful in pyrodigal?
There are multiple papers where people look for alt-coded bacteria/viruses by running Prodigal multiple times and comparing the gene density. Having this implemented in an elegant and efficient solution could be very useful. Databases (NCBI, IMG, etc.) are full of Crassvirales with truncated genes.
Just an idea! Please ignore all of that if you feel it would be out of the scope for this package.
Thanks again!