Closed alneberg closed 5 years ago
What do you think about the sub-sampling idea @chrisquince?
Sorry for being slow responding to this. To incorporate coverage variation into the underlying CONCOCT algorithm would be possible but a bit of a work. We would need to change the GMM.
Thank you Chris! Would it have to be changed in the GMM though? I was thinking on a more preprocessing level where smaller contigs could be sampled from the larger one.
I'm close to finish a script that generates a coverage table for cutup contigs using bam files created using non-cutup contigs. That would make life easier when regular mappers are used. When e.g. Kallisto is used, the idea discussed above would still be necessary though.
Yes you could easily create pseudo-contigs from real ones using a mean and a variance. You might want to think about what distribution to sample from but a Gaussian would be an obvious starting point. How about about we do this in the next release though?
Yes, I think the subsampling can be put on ice for now. I just realised that you don't get any variance information from Kallisto so the use case for this is diminishing.
I'll close this for now since there is a fix for the regular mapping case in #222. And for the fast mapper case, e.g. Kallisto, we don't know how to solve it just yet.
It would be much easier if the user could supply the real contigs as output from the assembler instead of the cut-up ones. One idea would be to use a variation estimate of the coverage to internally sample sub-contigs from longer contigs. I think for example the script
jgi_summarize_bam_contig_depths
used by metabat gives the variance of coverage as well.This is related to https://github.com/bxlab/metaWRAP/issues/76
The downside is if one would use e.g. Kallisto to get coverage values, no variance is given. In that case, either one could opt-in on to use cutup-contigs or have a crude estimate of the variance based on an assumed distribution. A third alternative would be to try to find a fast quantifier which would give a variance estimate as well, but I haven't find one yet.