arpcard / rgi

Resistance Gene Identifier (RGI). Software to predict resistomes from protein or nucleotide data, including metagenomics data, based on homology and SNP models.
Other
319 stars 76 forks source link

Use Pyrodigal instead of Prodigal for ORF prediction #200

Closed althonos closed 1 year ago

althonos commented 1 year ago

Hi @raphenya !

This PR proposes to replace Prodigal with Pyrodigal for running the ORF prediction stage. Pyrodigal is a Python library binding to Prodigal with additional performance enhancements. I'm the author of Pyrodigal, so ofc this is not a completely neutral list, but there are several advantages over Prodigal that I'll try to list down:

Single-threaded speed

Pyrodigal comes with a SIMD pre-filter to skip score computation for invalid gene pairs. This typically saves around half of the runtime for processing a genome in single mode (and more than that in metagenomic mode) on platforms with supported CPU features (SSE or NEON). I did a small writeup about this in the paper.

I ran some benchmarks on a single closed genome (NC_004129) to compare the runtime (still using BLAST for the downstream analysis):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 245s 205s
Low quality 340s 272s

Multi-threading

Pyrodigal supports re-entrant multithreading, so you can use multi-threaded ORF prediction even when running in single mode, contrary to what the code is currently doing with Prodigal where you only run multi-threaded prediction in --low_quality mode. This improves the runtime even more on fragmented genomes (e.g. 548.SAMN21245456):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 231s 153s
Low quality 241s 165s

Simpler installation

Contrary to Prodigal, Pyrodigal can be pip installed, so it's one less dependency to worry about for people who don't use conda. Otherwise it's also in Bioconda.

Same results

Despite the faster speed, Pyrodigal and Prodigal produce exactly[^1] the same output.

[^1]: Well, almost. During the refactor I found a bug in Prodigal that got all genes on the reverse strand to be penalized. It was fixed here but Prodigal never got a new release, so unless you recompile the code yourself you're still getting a buggy version. On the contrary, Pyrodigal contains the fix. So the "recompiled/fixed" Prodigal and Pyrodigal predict exactly the same thing (this is tested for), but the buggy Prodigal and Pyrodigal may occasionally diverge.

raphenya commented 1 year ago

@althonos Thank you, Martin! This looks awesome. I will review the code, but I think the best way is to have orf tools (i.e Prodigal and Pyrodigal) as an option. That way, it will be easy to compare and also in light of the anticipated Prodigal 3 release in the future.

althonos commented 1 year ago

Fine by me! I updated the code to control the ORF finder based on the CLI, like for the aligner tool

althonos commented 1 year ago

Please don't merge yet, I'm making some breaking API changes regarding output formatting in Pyrodigal, so I'll update the PR later to use Pyrodigal v2 after it's properly released.

althonos commented 1 year ago

Just updated to v2.0, which has been verified to produce exactly the same results as Prodigal.

nickp60 commented 1 year ago

Excited for this!

raphenya commented 1 year ago

@althonos Thank you, I will merge away!

althonos commented 1 year ago

Yay, thank you!