[Feature Request] support for input with annotated sequence

arpcard / rgi

Resistance Gene Identifier (RGI). Software to predict resistomes from protein or nucleotide data, including metagenomics data, based on homology and SNP models.

Other

312 stars 74 forks source link

[Feature Request] support for input with annotated sequence #287

Closed Hocnonsense closed 6 days ago

Hocnonsense commented 6 days ago

Thanks for this usefull tool, I'm going to use it to annotate my sequence.

However, I'm puzzled about the params [-t {contig,protein}], it seems that in contig mode, input will treated as nucleotide, and firstly annotated to CDS using prodigal/prodigal; and in protein mode, input should be amino acid. If I using protein mode, I can only provide protein and cannot apply "mutation screening depending upon the detection model type".

I'm interesing about the mutation results, but I also make excessive demands to use existing gene prediction results, is it possible?

Thanks for your help!

agmcarthur commented 6 days ago

Hello @Hocnonsense,

You are correct, -contig is for genomes or assembly contigs (nucleotides). ORFs are predicted using Prodigal and then the encoded proteins are compared against CARD's reference sequences using BLASTP (including secondary screening for AMR-conferring amino acid substations). rRNA genes are also examined for key SNPs.

For the -protein option, you provide a FASTA file of protein sequences (i.e. your own proteome predictions) and these are compared against CARD's reference sequences using BLASTP as above, including screening for key substitutions.

Documentation for RGI main can be found here as it has the details: https://github.com/arpcard/rgi/blob/master/docs/rgi_main.rst

Hocnonsense commented 6 days ago

Thanks for your kind reply!

Is it possible to make rgi use nucleotide sequence of gene as input (such as output of prodigal -n), which is not contig but also not translated proteins? I think it can be usefull.

Also, you mentioned "rRNA genes", is it refers to such as 16S rRNA? if so, nucleotide sequence of gene does not contain this information, is it possible to add param to parse the gene region to rgi (e.g., a gff file?)

agmcarthur commented 6 days ago

Yes, you can use -contig with a FASTA of ORFs. Prodigal will still run, but it will recognize the ORFs.

For rRNA, CARD does indeed have curated data for mutations in 16S and 23S rRNA. With -contig, RGI uses BLASTN to find the rRNA genes in your data and then checks for SNPs.

Hocnonsense commented 6 days ago

Yes, you can use -contig with a FASTA of ORFs. Prodigal will still run, but it will recognize the ORFs.

For rRNA, CARD does indeed have curated data for mutations in 16S and 23S rRNA. With -contig, RGI uses BLASTN to find the rRNA genes in your data and then checks for SNPs.

Thanks! So I should always first try to run with contigs.

What I want is to relpace default prodigal annotation inside rgi with my own (or publicated database) gene annotation. I hope this suggestion can make rgi better, but I'll also try something to overcome it.

agmcarthur commented 6 days ago

Ok, great. You can run your own predicted proteins for now, but if you would be willing to share an example input file with us at card@mcmaster.ca, we can look at supporting your type of data. I'll close this issue for now.