bhattlab / phanta

Workflow to rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes
MIT License
58 stars 7 forks source link

Most taxa have no taxonomy name at species level #44

Open DrYoungOG opened 2 weeks ago

DrYoungOG commented 2 weeks ago

Hi!

Thanks for the excellent tool phanta.

I have run phanta (following raw reads quality control and host contamination removal) for my samples (human fecal samples, Illuminia 150bp paired-end reads) and got final_merged_outputs including: counts.txt, relative_read_abundance.txt, relative_taxonomic_abundance.txt, and total_reads.tsv.

My questions are:

  1. If I want to conduct differential analysis only for viruses at the species level, how to calculate the relative abundance of viruses without taking account of bacteria? Extract the reads counts of viruses annotated to the species level, and then calculate the proportions for these viruses?

  2. The vast majority of the identified virus species do not have specific taxonomic names at the species level, but are instead named in a manner similar to “species_OTU-57497”. Are such results normal? If downstream differential analysis reveals that the virus with significant differences between groups is “species_OTU-xxxxx” (without a specific taxonomic name), how can we understand the biology characteristics of this virus to explain the biological significance of the differential analysis results?

Thanks you!

Supplemental information: 1. The phanta script that I used was like: conda activate phanta_env snakemake -s /path/to/phanta/Snakefile --configfile /path/to/phanta/config.yaml --jobs 80 --cores 10 --max-threads 40 2. Parameters used for the config.yaml file: database: /path/to/phanta/phanta_dbs/masked_db_v1 confidence_threshold: 0.1 gzipped: True class_mem_mb: 524288 class_threads: 40 single_end_krak: False cov_thresh_viral: 0.05 minimizer_thresh_viral: 0 cov_thresh_bacterial: 0.01 minimizer_thresh_bacterial: 0 cov_thresh_arc: 0.01 minimizer_thresh_arc: 0 cov_thresh_euk: 0 minimizer_thresh_euk: 0 read_length: 150 filter_thresh: 10 delete_intermediate: True 3. The final_merged_outputs counts.txt and total_reads.tsv are attached counts.txt total_reads.txt

meenachakra commented 2 weeks ago

Hi, thanks for the question!

  1. That method works! I think you can also just extract the rows of the abundance table that contain "superkingdom_Viruses" and then do differential abundance on that filtered table. @yipinto may have an alternative opinion.

  2. That's normal, yes! Please see our paper for additional details. Based on the files included in the "taxonomy" folder of the database, you can figure out which genomes are strains of OTU-XXXX. You can then see which of these genomes were assigned reads by Kraken2, if you change delete_intermediate to False. All of the genome sequences are available here: https://github.com/bhattlab/phanta/issues/31. So that should help you figure out the characteristics of the virus.

Please let us know if you have additional questions!

DrYoungOG commented 1 week ago

Hi, thanks for your reply!

For the question 2: I have downloaded all of the genome sequences which was mentioned in https://github.com/bhattlab/phanta/issues/31#issuecomment-1666115258, but I am confused about what to do next to explore the biological characteristics of specific virus species OTU-XXXX. (I changed the parameter delete_intermediate to True in my run).

Sorry if I asked a stupid question.

Thanks for your patience.

meenachakra commented 6 days ago

Not a stupid question!

Step 1 - determine which strains of that OTU are present in each sample of interest by using the intermediate files ending in krak.report.filtered. The strains are listed right underneath the given OTU. You can see how many reads were assigned to each strain using the third column. There's no set rule for this, but you should probably decide a threshold for # reads, in order to consider whether a strain is "present" in a sample.

Step 2 - then find the genome sequence of the strains of interest in the files you downloaded.

Step 3 - determine biological characteristics. There are a few options here. For example, annotate the genomes using bakta. Or predict host via iPHoP. Note that we already provide a likely host for each species in the host_species_to_genus file that comes with the database - you can determine the taxid of your OTU (needed to look up the species in host_species_to_genus) using the names.dmp file in the "taxonomy" subfolder of the database.

@yipinto may have further comments. Hope this is helpful!

yipinto commented 1 day ago

Hi @DrYoungOG, Regarding your first question- yes you may filter-in only viral species, and normalize the proportion of viruses. If you do so, I would recommend using relative read-abundance (to avoid two normalizations). As for biological characteristics, in addition to everything the Meena has mentioned, you can also take a look on the MGV database metadata which include info for most of the viral species of the default Phanta database. Good luck!