Avoid cutting up contigs

BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion

Other

120 stars 48 forks source link

Avoid cutting up contigs #219

Closed alneberg closed 5 years ago

alneberg commented 5 years ago

It would be much easier if the user could supply the real contigs as output from the assembler instead of the cut-up ones. One idea would be to use a variation estimate of the coverage to internally sample sub-contigs from longer contigs. I think for example the script jgi_summarize_bam_contig_depths used by metabat gives the variance of coverage as well.

This is related to https://github.com/bxlab/metaWRAP/issues/76

The downside is if one would use e.g. Kallisto to get coverage values, no variance is given. In that case, either one could opt-in on to use cutup-contigs or have a crude estimate of the variance based on an assumed distribution. A third alternative would be to try to find a fast quantifier which would give a variance estimate as well, but I haven't find one yet.

alneberg commented 5 years ago

What do you think about the sub-sampling idea @chrisquince?

chrisquince commented 5 years ago

Sorry for being slow responding to this. To incorporate coverage variation into the underlying CONCOCT algorithm would be possible but a bit of a work. We would need to change the GMM.

alneberg commented 5 years ago

Thank you Chris! Would it have to be changed in the GMM though? I was thinking on a more preprocessing level where smaller contigs could be sampled from the larger one.

I'm close to finish a script that generates a coverage table for cutup contigs using bam files created using non-cutup contigs. That would make life easier when regular mappers are used. When e.g. Kallisto is used, the idea discussed above would still be necessary though.

chrisquince commented 5 years ago

Yes you could easily create pseudo-contigs from real ones using a mean and a variance. You might want to think about what distribution to sample from but a Gaussian would be an obvious starting point. How about about we do this in the next release though?

alneberg commented 5 years ago

Yes, I think the subsampling can be put on ice for now. I just realised that you don't get any variance information from Kallisto so the use case for this is diminishing.

alneberg commented 5 years ago

I'll close this for now since there is a fix for the regular mapping case in #222. And for the fast mapper case, e.g. Kallisto, we don't know how to solve it just yet.