apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

plasmid classified as virus? #108

Open xinehc opened 1 week ago

xinehc commented 1 week ago

Hi,

When classifying some Refseq sequences I noticed that some plasmids are being classified as virus. For example the reference assembly of Klebsiella pneumoniae (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000240185.1/).

It seems sequence NC_016838.1 has more virus hallmarks than plasmid hallmarks which makes genomad classify this sequence as virus. However, this sequence carries AMR gene blaCTX-M, which is unlikely to show up in virus.

I am particularly interested in classifying a sequence into only chromosomes/plasmids so I wonder is it possible to prevent genomad from outputting virus? Thanks

    seq_name    length  topology    coordinates n_genes genetic_code    virus_score fdr n_hallmarks marker_enrichment   taxonomy
1   NC_016845.1|provirus_1288374_1340563    52190   Provirus    1288374-1340563 75  11  0.9788  NA  11  91.4045 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
2   NC_016845.1|provirus_2282085_2324920    42836   Provirus    2282085-2324920 62  11  0.9781  NA  10  79.4788 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
3   NC_016845.1|provirus_4049987_4084759    34773   Provirus    4049987-4084759 43  11  0.9748  NA  21  52.8533 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
4   NC_016845.1|provirus_1778390_1811349    32960   Provirus    1778390-1811349 39  11  0.9664  NA  20  40.6072 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
5   NC_016845.1|provirus_4818868_4834971    16104   Provirus    4818868-4834971 25  11  0.9648  NA  7   24.4392 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
6   NC_016838.1 122799  No terminal repeats NA  136 11  0.9562  NA  18  97.9193 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
apcamargo commented 1 week ago

Thank you for bringing this to my attention. This is an interesting case because it seems that at least a portion of the genes of this replicon are indeed of viral origin (lots of genes encoding tail proteins side-by-side), suggesting that this might be a hybrid element.

Can you check the chromosome, plasmid, and virus scores of these sequences in the <prefix>_aggregated_classification/<prefix>_aggregated_classification.tsv file? If the plasmid score is substantially higher than the chromosome score, I can evaluate adding a parameter to ignore proviruses identified within sequences classified as plasmid (maybe this could be enabled by default, if cases like this are common).

xinehc commented 1 week ago

NC_016838.1 has a higher plasmid_score than chromosome_score. In this case, should this sequence be classified as plasmid if I don't care about virus (given that virus rare carries AMR genes)?

seq_name        chromosome_score        plasmid_score   virus_score
NC_016845.1     0.6210  0.3022  0.0768
NC_016838.1     0.0098  0.0340  0.9562
NC_016846.1     0.0011  0.9952  0.0037
NC_016839.1     0.0016  0.9936  0.0048
NC_016840.1     0.0014  0.9941  0.0046
NC_016847.1     0.0057  0.9869  0.0075
NC_016841.1     0.0022  0.9894  0.0085

This situation is not very common: I classified 47306 AMR gene-carrying plasmid sequences (retrieved from PLSDB, refseq complete genomes and IMG/PR), only 152 are being classified as virus. The minimal plasmid/chromosome score ratio is 1.9637. Here are the classified summaries if necessary.

Archive.zip

apcamargo commented 1 week ago

If you're not interested in viruses at all, you can just delete <prefix>_find_proviruses directory and then run the end-to- end command with the --disable-find-proviruses parameter. No provirus will be detected in that sequence and it will be classified as a plasmid.

If you have more cases like this, please share. I think it might make sense to disable provirus detection by default on cases where sequences have strong evidence of being plasmid.

apcamargo commented 1 week ago

From experience, PLSDB (or at least the previous version of it) had a couple of actual phages there (I couldn't find evidence of them being hybrid elements). This is not a problem with PLSDB itself, but related to the fact that some submitters will tag all secondary replicon as plasmids and the error gets propagated to RefSeq.

apcamargo commented 1 week ago

I just noticed that NC_016838 got a virus score higher than the plasmid score, so disabling provirus discovery won't help. I need to take a look at this manually. You can experiment with the ratio of marker enrichments, as you mentioned in the previous comment

xinehc commented 1 week ago

Thanks for the suggestions and comments.

Yes there are many mislabelled/misassembled plasmids sequences in RefSeq and my goal was to remove these fake plasmids. NC_016838.1 is possibly misassembled somehow, despite being labelled by RefSeq as reference genome. I will play around with the parameter to see whether this sequence should be kept.

Here are all the sequences being classified as virus/provirus in the 47306 AMR gene-carrying (putative) plasmid sequences. I noticed that some IMG/PR sequences are also being classified as virus, maybe due to version upgrade.

plasmid_virus.fna.zip

apcamargo commented 1 week ago

Thanks a lot! This will be very helpful.

I'll work on this soon and make a new release.