ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Procedure to construct assembly-host association table #219

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

@taltman has drawn our attention to the semantic mess inhabiting the SRA TaxID and Scientific Name fields. First, the host may be specified as a generic name ("pig") only. This can be fixed systematically (albeit tediously) by manual review. A more serious problem is identifying cases such as "beetles tending mouse carcass" where host TaxID is given as beetle, but the viral host is more likely to be mouse. I would have missed it and assigned beetle as the host, so manual review needs a biologist's eye, i..e. Artem. As a starting point for discussion, I'm going to make a proposal here for how to deal with this.

  1. Do we know of other examples where the SRA host is probably not the viral host? A list would be useful to raise our collective awareness of what kinds of problem cases to look for. The only example I know if is the beetles (ERR2744268). If there are others, post as a comment and I'll build a list.

  2. @taltman Make a tsv mapping informal_name (pig, broiler...) to TaxID. This should be a straightforward wrangle of the GenBank host-virus table after it has been filled out by manual review. Quite likely some of the same informal names are found in the SRA Scientific Name fields, in which case this table will enable us to resolve those.

  3. Restrict the assembly-host association table to assemblies where we have a good RdRp. This is needed to assign a species or OTU. This list will be substantially shorter than the complete list of assembly targets, ~4k RdRps vs. ~12k targets.

  4. @rcedgar Wrangle the tsvs prepared in the previous steps and generate a naive assembly_with_RdRp-host table assuming that the fixed-up TaxID is correct. I expect: (a) The naive TaxID will be correct in a large majority of cases, with a few anomalies such as beetle. (b) In a large majority of cases, the SRA host will be an already-known host and will be noted as such in the known_viral_host.tsv which Tomer is working on. If the naive SRA TaxID is a known host, we believe it and no review is necessary. Therefore, the next step should be to extract cases where the naive table reports a novel virus species and/or a novel virus-host association. This will probably be a short list of around 20 - 100 SRAs is my guess. These are the suspicious cases that need manual review.

  5. @ababaian Manual review of suspicious list from step 4.

ababaian commented 3 years ago

That sounds very reasonable, I think you're right in that when you deal with easily fixable cases such as Pig --> taxid and then filter again for cases where virus host != sra organism we will have a relatively short list that we can manually curate.

I'd be happy to do this.