althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
132 stars 5 forks source link

Add additional gene models to the metagenome mode #24

Closed apcamargo closed 10 months ago

apcamargo commented 1 year ago

Hey @althonos,

Thanks for your work on pyrodigal! It is an amazing tool!

I'm wondering if you would be interested in adding additional gene models to the metagenome mode. In prodigal-gv I added a couple of models trained on genomes of giant viruses (the main reason for that is that none of the pre-trained models in Prodigal detect the TATATA RBS motif that is super common in those viruses) and, more importantly, models for phages that use the translation table 15 (very prevalent in Crassvirales).

These models proved to be really useful in geNomad and they ended up improving the detection of giant viruses and phages with code 15 in IMG/VR. But I can see two main disadvantages of using them:

I'm planning to adopt pyrodigal for my next projects and I'd use these models a lot, but I can always change the code locally if you feel that adding additional models is not within the scope of this project. No worries :)

Somewhat related to this:

I'm doing a large-scale gene prediction for hundreds of thousands of genomes and some of them will have alternative genetic codes. I wrote a function that takes part of the genome (not the whole thing, to speed things up) and tests different translation tables (4, 11, and 15) to evaluate whether the genome potentially uses an alternative code (I just compare the gene density between the codes). Do you think a function like this could be useful in pyrodigal?

There are multiple papers where people look for alt-coded bacteria/viruses by running Prodigal multiple times and comparing the gene density. Having this implemented in an elegant and efficient solution could be very useful. Databases (NCBI, IMG, etc.) are full of Crassvirales with truncated genes.

Just an idea! Please ignore all of that if you feel it would be out of the scope for this package.

Thanks again!

althonos commented 1 year ago

Hi Antonio, psyched to hear this!

First of all I've been aware of prodigal-gv for some time, and I recently recommended it to a colleague working with virae!

I've been thinking about how to allow custom metagenomic models to be passed to Pyrodigal, and actually not a lot would have to be changed for everything to work. The more complicated part would be how to store the models to allow you (or other usecases of custom models) to load efficiently. But otherwise it would be feasible to have an OrfFinder.find_genes call that takes an additional argument which would be a list of MetagenomicModel objects, or use the default ones if None given.

For the second question, I'd have to think about how to integrate it efficiently; I think you could actually try to count the number of extracted nodes without actually scoring them for putative gene density. But indeed, this may be a bit more out of scope compared to the metagenomic stuff.

apcamargo commented 1 year ago

I've been thinking about how to allow custom metagenomic models to be passed to Pyrodigal, and actually not a lot would have to be changed for everything to work. The more complicated part would be how to store the models to allow you (or other usecases of custom models) to load efficiently. But otherwise it would be feasible to have an OrfFinder.find_genes call that takes an additional argument which would be a list of MetagenomicModel objects, or use the default ones if None given.

Great to hear! I like this interface idea. If None is used, would you also restrict the model search within a range of GC values or you would allow users to do a full search?

For the second question, I'd have to think about how to integrate it efficiently; I think you could actually try to count the number of extracted nodes without actually scoring them for putative gene density. But indeed, this may be a bit more out of scope compared to the metagenomic stuff.

This would make more sense than what people (me included) usually do. Although it is not within the scope right now, it would be interesting to have something like this in the future or for another project. That's a feature that is lacking in all gene callers.

althonos commented 10 months ago

It almost took a year but I've started updating the interface to allow this. At the moment i can compile an external package that depends on pyrodigal but uses your prodigal-gv model, but I'm working on a way that doesn't need compiling (using training info in some other format) so that it's easier to distribute :)

apcamargo commented 10 months ago

That's great! Thanks @althonos

Not sure if I understand the interface, though. The gene models would be packaged in a separate package and then read by pyrodigal?

althonos commented 10 months ago

Yes, I'll make a repo and invite you to that 👍

althonos commented 10 months ago

Version 3.0.0 of Pyrodigal now supports using user-provided metagenomic models to run gene finding in meta mode. The giant-virus models are distributed in pyrodigal-gv.

apcamargo commented 10 months ago

Thank you! Really good idea to store the models in json files to avoid compilation.

althonos commented 10 months ago

I actually did something even more hacky to avoid storing them in JSON once installed :see_no_evil:

rohansachdeva commented 8 months ago

Thanks for adding these models!

Is there a way to use the models with pyrodigal in meta mode on the CLI?

althonos commented 7 months ago

@rohansachdeva : I have added a CLI for pyrodigal-gv in latest version v0.3.0. Use pyrodigal-gv instead of prodigal in the shell and you'll be all set :smiley:

rohansachdeva commented 7 months ago

Awesome - thank you!