Clustering cONtigs with COverage and ComposiTion
120 stars 48 forks source link

Avoid cutting up contigs #219

Closed alneberg closed 5 years ago

alneberg commented 5 years ago

It would be much easier if the user could supply the real contigs as output from the assembler instead of the cut-up ones. One idea would be to use a variation estimate of the coverage to internally sample sub-contigs from longer contigs. I think for example the script jgi_summarize_bam_contig_depths used by metabat gives the variance of coverage as well.

This is related to

The downside is if one would use e.g. Kallisto to get coverage values, no variance is given. In that case, either one could opt-in on to use cutup-contigs or have a crude estimate of the variance based on an assumed distribution. A third alternative would be to try to find a fast quantifier which would give a variance estimate as well, but I haven't find one yet.

alneberg commented 5 years ago

What do you think about the sub-sampling idea @chrisquince?

chrisquince commented 5 years ago

Sorry for being slow responding to this. To incorporate coverage variation into the underlying CONCOCT algorithm would be possible but a bit of a work. We would need to change the GMM.

alneberg commented 5 years ago

Thank you Chris! Would it have to be changed in the GMM though? I was thinking on a more preprocessing level where smaller contigs could be sampled from the larger one.

I'm close to finish a script that generates a coverage table for cutup contigs using bam files created using non-cutup contigs. That would make life easier when regular mappers are used. When e.g. Kallisto is used, the idea discussed above would still be necessary though.

chrisquince commented 5 years ago

Yes you could easily create pseudo-contigs from real ones using a mean and a variance. You might want to think about what distribution to sample from but a Gaussian would be an obvious starting point. How about about we do this in the next release though?

alneberg commented 5 years ago

Yes, I think the subsampling can be put on ice for now. I just realised that you don't get any variance information from Kallisto so the use case for this is diminishing.

alneberg commented 5 years ago

I'll close this for now since there is a fix for the regular mapping case in #222. And for the fast mapper case, e.g. Kallisto, we don't know how to solve it just yet.