arpcard / rgi

Resistance Gene Identifier (RGI). Software to predict resistomes from protein or nucleotide data, including metagenomics data, based on homology and SNP models.
Other
314 stars 75 forks source link

depreciate -split_prodigal_jobs #275

Open agmcarthur opened 3 months ago

agmcarthur commented 3 months ago

Analysis of simulated validation data by the CARD team (unpublished) revealed that Prodigal undercalls ORFs when the _-split_prodigaljobs option is used. This was particularly noticeable in a Acinetobacter baumannii genomic context. As this leads to false negatives, please depreciate support of _-split_prodigaljobs in RGI.

fmaguire commented 3 months ago

Looking at the implementation, that is likely a product of not generating and using a single training file for the whole genome before running the subjob ORF calling. This means each split is tuning the ORF finding model on only the sequence subset it gets thus lower accuracy.

Similar issue can occur if running on a large set of related genomes. Low quality genomes will have even worse ORF calling because the model trained on them will be poorer. Training on all genomes then using that training file would maximise accuracy/consistency (or moving to a ggcaller approach!)

github-actions[bot] commented 1 month ago

Issue is stale and will be closed in 7 days unless there is new activity

agmcarthur commented 3 weeks ago

Re-opening to assess if we should handle training better or depreciate.