back-generating count data that will be compatible for diversity/richness estimates from gather profiles

taylorreiter commented 2 years ago

The StatDivLab packages for diversity and richness estimates rely on count data (BreakAway, DivNet). I think the canonical use case these packages were developed around was 16s count data, but they have been used in the context of metagenomes using read mapping counts, and we've talked with amy about using them with k-mer abundances.

Do we have an analogous type of information in the gather profiles? I think I could re-derive the number of matched k-mers using (unique_intersect_bp average_abund) / scaled. Alternatively, I could just use unique_intersect_bp average_abund, but i'm worried those numbers are so large, they might throw off the stats (like in DESeq2, you're more likely to get a small p value when number of reads observed to map against a gene increases, because DESeq2 takes that to mean we can be more confident in the difference between things because they're more observed).

IDK. @ctb, thoughts?

taylorreiter commented 2 years ago

I tried it both ways, and it's wayyyy faster to use the unique number of matched k-mers(unique_intersect_bp * average_abund) / scaled...so that's my current preference.

ctb commented 2 years ago

yes :)

ctb commented 2 years ago

(that strikes me as the right number to use for the way the stats are working)

dib-lab / 2022-sra-gather

back-generating count data that will be compatible for diversity/richness estimates from gather profiles #10