ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Plan for Fig 1 for discussion on July 24th status call #225

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

Data is organized into OTUs defined by clustering RdRp a.a. sequences.

Fig. 1 uses two OTU thresholds: 99% (~sub-strain) and 97% (~strain).

Central to the figure is a radial cladogram tree, something like this:

image

https://drive5.com/tmp/pol.svg

The tree will be constructed by @Pbdas using Cov OTUs (GB+Serratus) plus three Toro OTUs as an outgroup.

Each leaf on the tree is one 97% OTU.

Segments of the tree are colored according to:

  1. Genus.
  2. Sub-genus.

Novel segments discovered by Serratus (e.g. Epsiloncoronavirus) are visually distinguished.

Exterior to the tree are Circos-like rings.

Ring 1. Previously known virus-host associations. Ring 2. Virus-host associations added by Serratus.

Hosts classified by order (Primate, Rodent...). There will be ~10 orders. 10 is too many colors for a key, will have to use additional visual features such as cross-hatch.

Ring 3. Diversity of each 97% OTU, measured as the number of 99% OTUs it contains, divided into three categories: 3a. OTUs in GenBank only. 3b. OTUs in GenBank and Serratus. 3c. New discoveries, i.e. Serratus only. Visualization of these numbers TBD, they may be small enough to have one dot per 99% OTU.

pierrebarbera commented 3 years ago

trees are here: https://serratus-public.s3.amazonaws.com/pb/results/otus_plus_toro.tar.gz

rooted versions have the .rooted postfix

pierrebarbera commented 3 years ago

note that the above is the result of 100 searches + 1000 bootstraps. If we decide to take this, I can still scale it up. This one took approx. 2:40h

ababaian commented 3 years ago

For the tree search can we include All Toro (I would opt for all Nido) and clip down to 3 leaves post-hoc. We don't know which Toro sequences ahead of time are the closest to CoV/Eps for inclusion.

rcedgar commented 3 years ago

There are four Toro refseqs. I checked identities to Cov and they cluster close to each other and far from Cov. For this round, I think including three as the smallest non-trivial outgroup is right approach. For next round after more discovery, I agree we should do a deeper dive into Nido and check outgroups more carefully. This is a fair amount of additional work to get the PFAM alignments, review how many non-Refseqs we need to include to get good diversity, and so on. SNW IMHO.