apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

Interpretation of results - plasmids with 0 genes? #87

Closed jimen210 closed 2 months ago

jimen210 commented 3 months ago

Hello,

First, thanks for developing GeNomad, it ran smoothly in my metagenomic dataset from lakes.

After I ran genomad I got the contig summary mentioning "15,657 plasmid(s) and 6,922 virus(es) were identified"

When I looked at the contigs_plasmid_summary.tsv I counted 15,657 rows, those I understand that are the plasmid identified by GeNomad. When I sum the number of genes from this n_genes column I got 66989 in total, which match the rows in the contig_plasmid_genes.tsv, thus the number of genes in plasmid are clear to me. However, in the contig_plasmid_summary.tsv I have several plasmids that zero n_genes, Should those be interpreted as empty plasmids?

Thanks again

apcamargo commented 3 months ago

It's strange to have a sequence with zero genes classified as a plasmid. Can you share some of those with me? I want to take a look to understand better what might be going on.

jimen210 commented 3 months ago

Sure, here are some from the contigs_plasmid_summary.

seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes amr_genes
c_000000004590 2717 No terminal repeats 0 11 0.732 NA 0 0 NA NA
c_000000008890 2543 No terminal repeats 0 11 0.7117 NA 0 0 NA NA
c_000000017907 2942 No terminal repeats 0 11 0.7887 NA 0 0 NA NA
c_000000036041 2807 No terminal repeats 0 11 0.7324 NA 0 0 NA NA

Also, I forgot to mention that I used the "genomad end-to-end --cleanup" command for my dataset.

apcamargo commented 3 months ago

These sequences are too long to not have any genes. Do you think you can share the FASTA file with me?

jimen210 commented 2 months ago

Hi, sorry the delay. Yes, here is a subset which include the sequences mentioned above. I transformed into txt to be able to upload it. contigs_plasmid.txt

apcamargo commented 2 months ago

Thank you!

What is happening here is that the neural network is classifying these sequences as plasmids. Because they don't have markers, the weight of the neural network classifier is higher than the marker-based classifier (which doesn't classify those as plasmids). I'll implement additional filter for the next release.

Right now, you can use the marker_enrichment to remove these cases. Just remove rows where marker_enrichment == 0.

jimen210 commented 2 months ago

Thanks much for your help!

apcamargo commented 2 months ago

I just released version 1.8.0, which has new options and default parameters that will avoid cases like this.