bhattlab / phanta

Workflow to rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes
MIT License
60 stars 9 forks source link

Question about the database species #33

Closed Iris7788 closed 1 year ago

Iris7788 commented 1 year ago

hi,

I have downloaded the unmasked_db_v1. I would like to know the viral species covered in the database. I noticed that the file “seqid2taxid.map” in the database folder can map genome names to taxa IDs. However, I found that the taxa IDs corresponding to genomes from MGV cannot retrieve the corresponding lineage information on the NCBI website. is there a solution to get it?

meenachakra commented 1 year ago

Hi, you can get the lineage information from the names/nodes files in the taxonomy subfolder. These files are in the standard format for names/nodes files for Kraken2 databases - e.g., see this link - https://github.com/DerrickWood/kraken2/issues/436. Please see the methods section of our paper for how taxonomic assignments were made for each viral genome. Hope this helps!

Iris7788 commented 1 year ago

Thank you for your kind reply. I also want to know if phanta will provide host, and lifestyle information for predicted phage.

meenachakra commented 1 year ago

Hi, no problem. Yes, the host and lifestyle information is contained within the DB as species_name_to_vir_score.txt and host_prediction_to_genus.tsv. Virulence scores for phage species are between 0 and 1 and the host predictions are to the prokaryotic genus level.