Closed rcedgar closed 3 years ago
VirusHostDB:
ftp://ftp.genome.jp/pub/db/virushostdb/
Annotations from Genomes Online DB with help from Dr. Reddy: GOLD-virus-host-associations.xlsx
Here's the host information for the sequences from the cov3ma.fa
file, as scraped from NCBI using BioPython:
Format: accession <tab> virus taxon ID <tab> host free text
OK @rcedgar , reassigning to you. There are many ways to combine these three different datasets. I need more guidance regarding the desired format (e.g., columns) for the combined TSV. Thanks!
Multiple tsvs is fine providing all tsvs have both Virus_TaxID and Host_TaxID. That's all I need for Fig. 1., I can easily combine those columns from multiple tsvs.
I think that we need to think a bit harder about how to combine, as entries in two datasets (or even within a single dataset!) might be talking about the same thing conceptually, but at different levels of granularity ("Canis lupus" vs. "Canis lupus familiaris"). These would require code aware of subsumption relationships between the NCBI Taxonomy IDs to reconcile.
Unless you want me to just drop a hot mess on your plate; of duplicate entries at varying levels of specificity. :-)
Yes, drop me the hot mess -- as long as I have taxids that's my problem in Fig. 1.
Ok, I won't over-engineer it. Thanks!
With a mixed list of entries there is nifty tool from from Taxonomy DB that can clean tihs up and give us the full taxid record (species, genus, family, order ... )
Nifty, indeed! I think the back-end is using the Entrez name reconciliation API, but I'll try this as well. Thanks!
@rcedgar, you want the host Order as TaxID as well, right?
Here's the latest version of the cov3ma
dataset:
s3://serratus-taltman/scratch/cov3ma-host-data.tsv
The columns represent: NCBI accession, virus taxID, NCBI Nucleotide host term, host taxID, NCBI Taxonomy host term, Order taxID, and NCBI Taxonomy Order term.
For convenience, here's the same data, in XLSX format:
To keep me sane and to get you this faster, I only manually curated the terms that NCBI couldn't auto-associate when the term was found in three or more entries. @rcedgar Please let me know if this is sufficient for the analysis that you want to perform, and I'll update the other metadata sources as well.
@ababaian @rcedgar Please remind me where the 'master' file is, with the SerraTax & SerraPlace info for each SRA run?
There is/are no master file(s) yet, this will be the end product(s) of combining all the individual tsvs. For virus-host, there will be at least two tsvs, one for known associations and one for our assemblies, as noted in #210. It's fine to have more than two tsvs, e.g. one for GB records, one for some other public database etc.
Re. cov3ma-host-data.tsv
, it's incomplete -- there are many hosts with recognizable names but without TaxIDs, these should be filled in. Examples (google the name to get scientific name, NCBI Taxonomy search to get TaxID):
white-rumped munia
= Lonchura striata TaxID 1766.
grey-backed thrush
= Turdus hortulorum TaxID 411519.
broiler
= Gallus gallus TaxID 9031.
dromedary camel
= Camelus dromedarius TaxID 9838.
For virus-host, there will be at least two tsvs, one for known associations and one for our assemblies, as noted in #210.
Passive voice isn't great for issues. WHO will be creating these two TSVs? I'm planning on creating both unless I hear otherwise.
As I stated above, I only did a limited amount of manual curation, otherwise I'd need to spend several hours resolving all of the typos in GenBank. Not interested.
I'll see which of these accessions overlap with the coronavirus genomes that we recovered, and I'll resolve manually any of the unresolved hosts in that intersection. I'll leave all of the rest of the unresolved hosts for you to fix, since you are an expert in white-rumped munias and grey-backed thrushes. :-)
@taltman Apologies for lack of clarity. Confirming it was my understanding that you would create both tsvs. I'm not understanding which host names you are planning to resolve manually and which you are asking me to do, but that should become clear once you hand the tsvs over to me. For previously known associations (GB, virushostdb...), IMO we need all host names resolved. For the SRA, I think we need manual host fixes only for assemblies where we recovered an RdRP alignment because this is the go-to confirmation we have a valid Cov . I can provide a list of those assemblies which is probably close to complete, depending how many fell out like Frank.
Here is the cov3 accession table cleaned up for host-source.
GOLD + virushostdb +GB records + ...?