Closed taylorreiter closed 2 years ago
I tried it both ways, and it's wayyyy faster to use the unique number of matched k-mers(unique_intersect_bp * average_abund) / scaled
...so that's my current preference.
yes :)
(that strikes me as the right number to use for the way the stats are working)
The StatDivLab packages for diversity and richness estimates rely on count data (BreakAway, DivNet). I think the canonical use case these packages were developed around was 16s count data, but they have been used in the context of metagenomes using read mapping counts, and we've talked with amy about using them with k-mer abundances.
Do we have an analogous type of information in the gather profiles? I think I could re-derive the number of matched k-mers using (
unique_intersect_bp
average_abund
) /scaled
. Alternatively, I could just useunique_intersect_bp
average_abund
, but i'm worried those numbers are so large, they might throw off the stats (like in DESeq2, you're more likely to get a small p value when number of reads observed to map against a gene increases, because DESeq2 takes that to mean we can be more confident in the difference between things because they're more observed).IDK. @ctb, thoughts?