ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Reference-based assembly #77

Closed rcedgar closed 3 years ago

rcedgar commented 4 years ago

The Zoonotic run found what looks like some low-coverage Cov in the data.

I think reference-based assembly will work better than de novo for at least some of these datasets, but I can't find a suitable assembler.

Please either point me at an assembler or explain why you think I'm wrong.

Thanks!

JustinChu commented 4 years ago

Which libraries contain these low coverage reads?

ababaian commented 4 years ago

rce

SRR10951665.summary:acc=pan_genome;hits=866;len=30000;depth=2.92;pctid=95.4;tax=?;cov=1.0000;coverage=...............................O;desc=Pan-genome;
SRR10951664.summary:acc=pan_genome;hits=1142;len=30000;depth=3.84;pctid=97.8;tax=?;cov=1.0000;coverage=OoOOOOooo.OoOoooooooooooOooooOOO;desc=Pan-genome;
SRR10951663.summary:acc=pan_genome;hits=746;len=30000;depth=2.51;pctid=95.1;tax=?;cov=1.0000;coverage=o..oo..............o........oooO;desc=Pan-genome;
SRR10951662.summary:acc=pan_genome;hits=1182;len=30000;depth=3.98;pctid=98.1;tax=?;cov=1.0000;coverage=OooOOOOooOOoOOoOOooOoooOooOoOOOO;desc=Pan-genome;
SRR10951661.summary:acc=pan_genome;hits=677;len=30000;depth=2.28;pctid=94.8;tax=?;cov=1.0000;coverage=............................o..O;desc=Pan-genome;
SRR10951660.summary:acc=pan_genome;hits=1047;len=30000;depth=3.52;pctid=97.9;tax=?;cov=1.0000;coverage=OOoOOO.oooOOooooooooooOOOOOoOOOO;desc=Pan-genome;
SRR10951659.summary:acc=pan_genome;hits=1058;len=30000;depth=3.56;pctid=94.6;tax=?;cov=1.0000;coverage=ooooooooo.oooooOoo.ooooo.oooOoOO;desc=Pan-genome;
SRR10951658.summary:acc=pan_genome;hits=1982;len=30000;depth=6.67;pctid=98.1;tax=?;cov=1.0000;coverage=OOOOOOOOOoOoOOOOOooOOoOOooOOOOOo;desc=Pan-genome;
SRR10951657.summary:acc=pan_genome;hits=1139;len=30000;depth=3.83;pctid=94.0;tax=?;cov=1.0000;coverage=Ooooooooooo.ooOooo.ooooo.oooOoOO;desc=Pan-genome;
SRR10951656.summary:acc=pan_genome;hits=2361;len=30000;depth=7.95;pctid=98.2;tax=?;cov=1.0000;coverage=OOOOOOooOoOoOoooOooOooOOooOOOOOo;desc=Pan-genome;
SRR10951655.summary:acc=pan_genome;hits=1011;len=30000;depth=3.4;pctid=94.7;tax=?;cov=1.0000;coverage=oo.o.o......o..o...oo.oo..o.oooO;desc=Pan-genome;
SRR10951654.summary:acc=pan_genome;hits=1830;len=30000;depth=6.16;pctid=98.0;tax=?;cov=1.0000;coverage=OoOOOOooooOoooooOooOOoOOooOoOOOo;desc=Pan-genome;
ababaian commented 4 years ago

So a very naive way to approach this would be to take the best matching sequence, generate a vcf of the most common variants and "apply" the variants to the reference with bcftools to make a new consensus sequence. Finally remove any parts of the sequence that have a coverage of 0 so as not to induce artifacts.

This isn't so much of an assembly as it is a variant consensus sequence.

JustinChu commented 4 years ago

I'd prefer to at least try to fish for sequences in the libraries again with kollector if the coverage is low due to sequence divergence. If the sequences are not divergent then a reference based assembly makes sense.

In terms of cost though targetted de novo can be pretty expensive.

ababaian commented 4 years ago

Let's figure out what works best and how to manage resources for that later. So far it's a handful of libraries that are interesting, so running a few thousand assemblies de novo is within reason

JustinChu commented 4 years ago

One of the tools I was planning on experimenting with is RNA-Bloom which has a reference guided mode, for multiple transcripts so we don't have to pick one reference genome to perform consensus calling on.

Other than that we could also explore scaffolding tools that can take genomes as input and perform gapfilling afterwards (which may be reference guided).

rcedgar commented 4 years ago

Great comments, thanks. I have some ideas for a specialized tool and hope to work on that over the next couple of days. Would be great to compare with the other approaches if people have some time to implement them.