Closed rcedgar closed 3 years ago
Which libraries contain these low coverage reads?
rce
SRR10951665.summary:acc=pan_genome;hits=866;len=30000;depth=2.92;pctid=95.4;tax=?;cov=1.0000;coverage=...............................O;desc=Pan-genome; SRR10951664.summary:acc=pan_genome;hits=1142;len=30000;depth=3.84;pctid=97.8;tax=?;cov=1.0000;coverage=OoOOOOooo.OoOoooooooooooOooooOOO;desc=Pan-genome; SRR10951663.summary:acc=pan_genome;hits=746;len=30000;depth=2.51;pctid=95.1;tax=?;cov=1.0000;coverage=o..oo..............o........oooO;desc=Pan-genome; SRR10951662.summary:acc=pan_genome;hits=1182;len=30000;depth=3.98;pctid=98.1;tax=?;cov=1.0000;coverage=OooOOOOooOOoOOoOOooOoooOooOoOOOO;desc=Pan-genome; SRR10951661.summary:acc=pan_genome;hits=677;len=30000;depth=2.28;pctid=94.8;tax=?;cov=1.0000;coverage=............................o..O;desc=Pan-genome; SRR10951660.summary:acc=pan_genome;hits=1047;len=30000;depth=3.52;pctid=97.9;tax=?;cov=1.0000;coverage=OOoOOO.oooOOooooooooooOOOOOoOOOO;desc=Pan-genome; SRR10951659.summary:acc=pan_genome;hits=1058;len=30000;depth=3.56;pctid=94.6;tax=?;cov=1.0000;coverage=ooooooooo.oooooOoo.ooooo.oooOoOO;desc=Pan-genome; SRR10951658.summary:acc=pan_genome;hits=1982;len=30000;depth=6.67;pctid=98.1;tax=?;cov=1.0000;coverage=OOOOOOOOOoOoOOOOOooOOoOOooOOOOOo;desc=Pan-genome; SRR10951657.summary:acc=pan_genome;hits=1139;len=30000;depth=3.83;pctid=94.0;tax=?;cov=1.0000;coverage=Ooooooooooo.ooOooo.ooooo.oooOoOO;desc=Pan-genome; SRR10951656.summary:acc=pan_genome;hits=2361;len=30000;depth=7.95;pctid=98.2;tax=?;cov=1.0000;coverage=OOOOOOooOoOoOoooOooOooOOooOOOOOo;desc=Pan-genome; SRR10951655.summary:acc=pan_genome;hits=1011;len=30000;depth=3.4;pctid=94.7;tax=?;cov=1.0000;coverage=oo.o.o......o..o...oo.oo..o.oooO;desc=Pan-genome; SRR10951654.summary:acc=pan_genome;hits=1830;len=30000;depth=6.16;pctid=98.0;tax=?;cov=1.0000;coverage=OoOOOOooooOoooooOooOOoOOooOoOOOo;desc=Pan-genome;
So a very naive way to approach this would be to take the best matching sequence, generate a vcf of the most common variants and "apply" the variants to the reference with bcftools
to make a new consensus sequence. Finally remove any parts of the sequence that have a coverage of 0 so as not to induce artifacts.
This isn't so much of an assembly as it is a variant consensus sequence.
I'd prefer to at least try to fish for sequences in the libraries again with kollector if the coverage is low due to sequence divergence. If the sequences are not divergent then a reference based assembly makes sense.
In terms of cost though targetted de novo can be pretty expensive.
Let's figure out what works best and how to manage resources for that later. So far it's a handful of libraries that are interesting, so running a few thousand assemblies de novo is within reason
One of the tools I was planning on experimenting with is RNA-Bloom which has a reference guided mode, for multiple transcripts so we don't have to pick one reference genome to perform consensus calling on.
Other than that we could also explore scaffolding tools that can take genomes as input and perform gapfilling afterwards (which may be reference guided).
Great comments, thanks. I have some ideas for a specialized tool and hope to work on that over the next couple of days. Would be great to compare with the other approaches if people have some time to implement them.
The Zoonotic run found what looks like some low-coverage Cov in the data.
I think reference-based assembly will work better than de novo for at least some of these datasets, but I can't find a suitable assembler.
Please either point me at an assembler or explain why you think I'm wrong.
Thanks!