Open xinehc opened 1 week ago
Thank you for bringing this to my attention. This is an interesting case because it seems that at least a portion of the genes of this replicon are indeed of viral origin (lots of genes encoding tail proteins side-by-side), suggesting that this might be a hybrid element.
Can you check the chromosome, plasmid, and virus scores of these sequences in the <prefix>_aggregated_classification/<prefix>_aggregated_classification.tsv
file? If the plasmid score is substantially higher than the chromosome score, I can evaluate adding a parameter to ignore proviruses identified within sequences classified as plasmid (maybe this could be enabled by default, if cases like this are common).
NC_016838.1 has a higher plasmid_score than chromosome_score. In this case, should this sequence be classified as plasmid if I don't care about virus (given that virus rare carries AMR genes)?
seq_name chromosome_score plasmid_score virus_score
NC_016845.1 0.6210 0.3022 0.0768
NC_016838.1 0.0098 0.0340 0.9562
NC_016846.1 0.0011 0.9952 0.0037
NC_016839.1 0.0016 0.9936 0.0048
NC_016840.1 0.0014 0.9941 0.0046
NC_016847.1 0.0057 0.9869 0.0075
NC_016841.1 0.0022 0.9894 0.0085
This situation is not very common: I classified 47306 AMR gene-carrying plasmid sequences (retrieved from PLSDB, refseq complete genomes and IMG/PR), only 152 are being classified as virus. The minimal plasmid/chromosome score ratio is 1.9637. Here are the classified summaries if necessary.
If you're not interested in viruses at all, you can just delete <prefix>_find_proviruses
directory and then run the end-to- end command with the --disable-find-proviruses parameter
. No provirus will be detected in that sequence and it will be classified as a plasmid.
If you have more cases like this, please share. I think it might make sense to disable provirus detection by default on cases where sequences have strong evidence of being plasmid.
From experience, PLSDB (or at least the previous version of it) had a couple of actual phages there (I couldn't find evidence of them being hybrid elements). This is not a problem with PLSDB itself, but related to the fact that some submitters will tag all secondary replicon as plasmids and the error gets propagated to RefSeq.
I just noticed that NC_016838 got a virus score higher than the plasmid score, so disabling provirus discovery won't help. I need to take a look at this manually. You can experiment with the ratio of marker enrichments, as you mentioned in the previous comment
Thanks for the suggestions and comments.
Yes there are many mislabelled/misassembled plasmids sequences in RefSeq and my goal was to remove these fake plasmids. NC_016838.1 is possibly misassembled somehow, despite being labelled by RefSeq as reference genome. I will play around with the parameter to see whether this sequence should be kept.
Here are all the sequences being classified as virus/provirus in the 47306 AMR gene-carrying (putative) plasmid sequences. I noticed that some IMG/PR sequences are also being classified as virus, maybe due to version upgrade.
Thanks a lot! This will be very helpful.
I'll work on this soon and make a new release.
Hi,
When classifying some Refseq sequences I noticed that some plasmids are being classified as virus. For example the reference assembly of Klebsiella pneumoniae (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000240185.1/).
It seems sequence NC_016838.1 has more virus hallmarks than plasmid hallmarks which makes genomad classify this sequence as virus. However, this sequence carries AMR gene blaCTX-M, which is unlikely to show up in virus.
I am particularly interested in classifying a sequence into only chromosomes/plasmids so I wonder is it possible to prevent genomad from outputting virus? Thanks