chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
44 stars 9 forks source link

Co-assembly #119

Closed andrewjmc closed 2 years ago

andrewjmc commented 2 years ago

I have a dataset which is too large to co-assemble on my HPC. I would like to analyse a species of interest across hundreds of human samples. I've extracted reads from the relevant genus for each sample (kraken), and tested out a co-assembly with MEGAHIT for simplicity. However, I get very short contigs (max 28k bases; shorter than a single assembly of the best covered sample). Looking at the assembly graph (k=141) you can see high complexity, and I believe that the short contigs occur because of the high variation between genomes.

image

Assuming the assembly with metaSPAdes produces a similar graph and small/fragmented contigs, do you think this will be amenable to strain resolution with STRONG?

Thanks,

Andrew

Sebastien-Raguideau commented 2 years ago

Hi Andrew,

From the standpoint of STRONG, the size of the contigs is not important, what is important is the state of the assembly graph. Now, a thing to note is that while we use metaSPAdes, we don't go all the way up to kmer the size of the read length (141): while it tends to increase N50 it also fragments the graph (understand loss of edges) and penalise low coverages.

Then we also work on a "high resolution" assembly graph, not usually outputed by metaspades, that exist before simplifications heuristics are applied to, for instance, resolve bubble.

So, in summary, I would not worry too much about what your graph look like with megahit, but rather if kraken is able to select all the reads of interest without creating gap in the graph. For STRONG to work you would just need to be sure that you are not missing too many reads arounds the SCGs, since STRONG only focus on a select few 36 SCGs. I didn't test this strategy before but SCG are quite conserved and I would guess that kraken would be able to find theses. Also untested, you could also go for some more involved approach like hmm for these SCG on 6frame translated reads. Though I worry about the binning results of just SCGs.

Best, Seb

andrewjmc commented 2 years ago

Thanks for this. In terms of extracting reads with kraken, the conservation of SCG within my species of interest will save us, but conservation with other species will hinder us (since reads from regions like this will be assigned above genus level).

I can check this out though within my alignment to the reference genome.

Sebastien-Raguideau commented 2 years ago

I would not worry too much about this either :) While assembly cannot guaranty to separate sequence at the level of strains, it can at least do so at the level of species. Also, to be clear there is a binning step after assembly and before strain deconvolution through bayespath. So, no issues with recruiting more than one species. The base pipeline assume this is the case.

andrewjmc commented 2 years ago

OK, really helpful, thanks!