Closed snayfach closed 2 years ago
Hi Stephen,
I've definitely considered adding full taxonomy for the hosts (from genus to phylum, that is) using something such as the R package taxonomizr but have decided to keep the reliance on other packages/libraries to a minimum for now, so haven't incorporated this.
As for species level host taxa, this is information I would love but it's very difficult to come by. The "Host" column of my output file is derived mostly from phage names/descriptions on Genbank where I take the word which precedes "phage" or "virus" (e.g. Serratia phage vB_SmaM_Haymo, I can easily grab Serratia as it precedes the word phage). I then periodically check the list of hosts and add additional lines to the script to clean up nonsense ones (e.g. Capybara virus).
Generally, the host species is not recorded in the Genbank file. There are a number of GB files that have an additional tag for host/isolation_host for which some of the values are the host species. I've created a version of the script which outputs this (will push to Github shortly), although only ~50% of entries have this tag and a very large number of these have the sample material (e.g. cow faeces) rather than the isolation host as the value.
As you suggested, the host and isolation_host fields work pretty well. I was able to get host species names for 18400/21221 inphared genomes. Only 2401 of these are unlabeled species "sp."
Hi Stephen,
I've just pushed a version of the script which grabs these tag values automatically and outputs them into a new column in the tsv output files. Files from a run I did on 1st March are also being uploaded to the Github as we speak, so these will also be available without having to run the script.
Thanks again for the input and hope you find the script helpful!
All the best, Ryan
Thanks!
Would it be easy to include the full host taxonomy in the data.tsv output file? Or at least the full genus+species name? Currently only the genus is displayed.