RyanCook94 / inphared

Providing up-to-date phage genome databases, metrics and useful input files for a number of bioinformatic pipelines.
GNU Affero General Public License v3.0
61 stars 8 forks source link

Host taxonomy #10

Closed snayfach closed 2 years ago

snayfach commented 2 years ago

Would it be easy to include the full host taxonomy in the data.tsv output file? Or at least the full genus+species name? Currently only the genus is displayed.

RyanCook94 commented 2 years ago

Hi Stephen,

I've definitely considered adding full taxonomy for the hosts (from genus to phylum, that is) using something such as the R package taxonomizr but have decided to keep the reliance on other packages/libraries to a minimum for now, so haven't incorporated this.

As for species level host taxa, this is information I would love but it's very difficult to come by. The "Host" column of my output file is derived mostly from phage names/descriptions on Genbank where I take the word which precedes "phage" or "virus" (e.g. Serratia phage vB_SmaM_Haymo, I can easily grab Serratia as it precedes the word phage). I then periodically check the list of hosts and add additional lines to the script to clean up nonsense ones (e.g. Capybara virus).

Generally, the host species is not recorded in the Genbank file. There are a number of GB files that have an additional tag for host/isolation_host for which some of the values are the host species. I've created a version of the script which outputs this (will push to Github shortly), although only ~50% of entries have this tag and a very large number of these have the sample material (e.g. cow faeces) rather than the isolation host as the value.

snayfach commented 2 years ago

As you suggested, the host and isolation_host fields work pretty well. I was able to get host species names for 18400/21221 inphared genomes. Only 2401 of these are unlabeled species "sp."

RyanCook94 commented 2 years ago

Hi Stephen,

I've just pushed a version of the script which grabs these tag values automatically and outputs them into a new column in the tsv output files. Files from a run I did on 1st March are also being uploaded to the Github as we speak, so these will also be available without having to run the script.

Thanks again for the input and hope you find the script helpful!

All the best, Ryan

snayfach commented 2 years ago

Thanks!