ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Combine host-virus association databases into one tsv file #202

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

GOLD + virushostdb +GB records + ...?

taltman commented 3 years ago

VirusHostDB: ftp://ftp.genome.jp/pub/db/virushostdb/

taltman commented 3 years ago

Annotations from Genomes Online DB with help from Dr. Reddy: GOLD-virus-host-associations.xlsx

taltman commented 3 years ago

Here's the host information for the sequences from the cov3ma.fa file, as scraped from NCBI using BioPython:

Format: accession <tab> virus taxon ID <tab> host free text

host-report.txt

taltman commented 3 years ago

OK @rcedgar , reassigning to you. There are many ways to combine these three different datasets. I need more guidance regarding the desired format (e.g., columns) for the combined TSV. Thanks!

rcedgar commented 3 years ago

Multiple tsvs is fine providing all tsvs have both Virus_TaxID and Host_TaxID. That's all I need for Fig. 1., I can easily combine those columns from multiple tsvs.

taltman commented 3 years ago

I think that we need to think a bit harder about how to combine, as entries in two datasets (or even within a single dataset!) might be talking about the same thing conceptually, but at different levels of granularity ("Canis lupus" vs. "Canis lupus familiaris"). These would require code aware of subsumption relationships between the NCBI Taxonomy IDs to reconcile.

taltman commented 3 years ago

Unless you want me to just drop a hot mess on your plate; of duplicate entries at varying levels of specificity. :-)

rcedgar commented 3 years ago

Yes, drop me the hot mess -- as long as I have taxids that's my problem in Fig. 1.

taltman commented 3 years ago

Ok, I won't over-engineer it. Thanks!

ababaian commented 3 years ago

With a mixed list of entries there is nifty tool from from Taxonomy DB that can clean tihs up and give us the full taxid record (species, genus, family, order ... )

taltman commented 3 years ago

Nifty, indeed! I think the back-end is using the Entrez name reconciliation API, but I'll try this as well. Thanks!

taltman commented 3 years ago

@rcedgar, you want the host Order as TaxID as well, right?

taltman commented 3 years ago

Here's the latest version of the cov3ma dataset:

s3://serratus-taltman/scratch/cov3ma-host-data.tsv

The columns represent: NCBI accession, virus taxID, NCBI Nucleotide host term, host taxID, NCBI Taxonomy host term, Order taxID, and NCBI Taxonomy Order term.

For convenience, here's the same data, in XLSX format:

cov3ma-host-data.xlsx

To keep me sane and to get you this faster, I only manually curated the terms that NCBI couldn't auto-associate when the term was found in three or more entries. @rcedgar Please let me know if this is sufficient for the analysis that you want to perform, and I'll update the other metadata sources as well.

taltman commented 3 years ago

@ababaian @rcedgar Please remind me where the 'master' file is, with the SerraTax & SerraPlace info for each SRA run?

rcedgar commented 3 years ago

There is/are no master file(s) yet, this will be the end product(s) of combining all the individual tsvs. For virus-host, there will be at least two tsvs, one for known associations and one for our assemblies, as noted in #210. It's fine to have more than two tsvs, e.g. one for GB records, one for some other public database etc.

Re. cov3ma-host-data.tsv, it's incomplete -- there are many hosts with recognizable names but without TaxIDs, these should be filled in. Examples (google the name to get scientific name, NCBI Taxonomy search to get TaxID):

white-rumped munia = Lonchura striata TaxID 1766.

grey-backed thrush = Turdus hortulorum TaxID 411519.

broiler = Gallus gallus TaxID 9031.

dromedary camel = Camelus dromedarius TaxID 9838.

taltman commented 3 years ago

For virus-host, there will be at least two tsvs, one for known associations and one for our assemblies, as noted in #210.

Passive voice isn't great for issues. WHO will be creating these two TSVs? I'm planning on creating both unless I hear otherwise.

As I stated above, I only did a limited amount of manual curation, otherwise I'd need to spend several hours resolving all of the typos in GenBank. Not interested.

I'll see which of these accessions overlap with the coronavirus genomes that we recovered, and I'll resolve manually any of the unresolved hosts in that intersection. I'll leave all of the rest of the unresolved hosts for you to fix, since you are an expert in white-rumped munias and grey-backed thrushes. :-)

rcedgar commented 3 years ago

@taltman Apologies for lack of clarity. Confirming it was my understanding that you would create both tsvs. I'm not understanding which host names you are planning to resolve manually and which you are asking me to do, but that should become clear once you hand the tsvs over to me. For previously known associations (GB, virushostdb...), IMO we need all host names resolved. For the SRA, I think we need manual host fixes only for assemblies where we recovered an RdRP alignment because this is the go-to confirmation we have a valid Cov . I can provide a list of those assemblies which is probably close to complete, depending how many fell out like Frank.

ababaian commented 3 years ago

Here is the cov3 accession table cleaned up for host-source.

cov3ma_genbank_host_taxid.xlsx