apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
184 stars 17 forks source link

How to export the viral DNA gene sequence? #116

Open wfgui opened 1 month ago

wfgui commented 1 month ago

Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

apcamargo commented 1 month ago

There is currently no option to do this, but I could implement it as a feature in the future. In the meantime, you can obtain the nucleotide sequences of the CDSs by extracting them from the genomes using the gene coordinates.

wfgui commented 1 month ago

I also had a seemingly simple question about whether I could format the output at taxonomy, such as converting it to k; p; c; o; f; g; s__.

Thanks!

apcamargo commented 3 weeks ago

You can use taxopy for that. geNomad's taxdump is inside the database directory, and you can find the TaxIds in the <prefox>_annotate/<prefox>_taxonomy.tsv file.

For instance:

import taxopy

taxdb = taxopy.TaxDb(
    nodes_dmp="genomad_db/nodes.dmp",
    names_dmp="genomad_db/names.dmp",
    keep_files=True
)
taxon = taxopy.Taxon(5797, taxdb)
for rank, name in reversed(taxon.ranked_name_lineage):
    if name != "root":
        print(f"{rank}__{name}")
realm__Duplodnaviria
kingdom__Heunggongvirae
phylum__Uroviricota
class__Caudoviricetes
order__Crassvirales
wfgui commented 3 weeks ago

1 What's the difference between "Unclassified" and "Viruses;;;;;;" ?

apcamargo commented 3 weeks ago

"Unclassified" means that the genes in the sequence had no matches to markers with taxonomy information. "Viruses" means that the classification is uncertain at a high rank.