ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

OTU and tree construction for Fig. 1 #194

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

We need OTUs for Fig. 1, and perhaps also for supplementary tables where per-OTU is more useful than per-sequence.

Using the rule "species if available otherwise OTU" for organizing figures and tables is harder than it looks at first sight; the explanation is long so I won't put it here. Instead, I propose we use OTUs for all sequences to be included in Fig. 1, including those which have species names in GenBank.

OTU identifiers will be assigned by sequence clustering (UCLUST) using the first 8kb extracted from all complete genomes and complete (or near-complete) assemblies. Fragments will not be assigned to OTUs.

The OTU clustering threshold will be set to >>>>> EDIT 95% nt id <<<<< (previously was 90%, now 95% is looking better). My original idea was to use an approximation to species, but nt identity threshold for Cov species appears to be <<95%id and may not be measurable by nt alignment. Using 90% means that the sequences in one OTU would probably be considered closely-related strains.

I will construct a preliminary set of OTUs and send the exemplar sequences to @Pbdas in the next day or two so that he can test the tree-making pipeline using e.g. monophyly. We will very likely add a few more OTUs as we assemble distant Covs discovered by PSI-Serratus. This means that the final tree construction and generating Fig. 1 will take at least 2-3 days after the final assembly is completed.

rcedgar commented 3 years ago

Closing this issue because we're going to use RdRp at 90% instead of nt sequences, issue description above is now obsolete.