ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Use 97% RdRp a.a. identity for species OTUs #207

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Per ICTV "[Corona]viruses that share more than 90% aa sequence identity in the conserved replicase domains are considered to belong to the same species."

This threshold is much too low to distinguish known / novel species. As a check, I took the RdRp genes from the ~800 complete genomes in GenBank which are assigned to 77 species. The best fit threshold is around 97% (see histogram below). A few RdRp's are missing (~20) due to VADR bugs but this shouldn't matter. At 90%, we get ~30 OTUs vs. ~80 species.

As another check, there is a series of ~20 GB records for short RdRp fragments "Myotis Bat Alphacoronavirus strain XXX", e.g. KP895506.1 and KP895513.1. Most of these are assigned to separate species, but some pair-wise a.a. identities are >98%, e.g. those two example accessions have 120/122 identical a.a.'s = 98.4%.

I'm posting this issue to invite feedback, especially @ababaian since it appears necessary to deviate quite far from ICTV guidelines.

image

ababaian commented 4 years ago

I think we can function with an operational definition of OTU at RdRP at 97% but if we use the word "species" or "genus", those are treated as reserved words matching ICTV criteria. It's not our job to ensure current taxa meets this criteria, only to ensure the listing we have meet this criteria. We can and should discuss in the paper with respect to OTU which we can define how we like and/or how best describes the data.

I think flushing out that histogram with GenBank / and then stack Serrataus sequences on top of it may be a really good choice for the middle of the cladogram. Take it all the way down to 1 OTU for CoV so the threshold is clearly demarcated (i.e. CoV with Toro as outgroup)

rcedgar commented 4 years ago

Thanks for the comments @asl & @ababaian. I think I got too deep into this one and started seeing too many irrelevant problems in the details. To say 97% RdRp OTUs as a rough approximation to species based on this histogram is solid, that's the way to go.