rarefy to equal sequencing depth?

It looks like this method is working, but sometimes there are weird outliers/patterns, and I think they're driven by very large differences in sequencing depth. The method looks the most promising within a given experiment, but again, then there will be one outlier. I think generally within an experiment, there is roughly consistent sequencing depth that lets biological patterns shine through, but then there will be one really deeply sequenced sample that becomes an outlier. I think one way to address this would be to rarefy everything to the same sequencing depth -- e.g., everything to 20 million reads (or I could stream a download to 1 million reads...but I like the idea of taking advantage of the maximum amount of information possible -- plus, at a scaled of 100k, 1 million might actually be too few reads...TBD).

So as to throw out the minimum amount of data across experiments, I could also band the rarefication. So samples with > 10M but < 20M reads would all be rarefied to 10M. Then samples with >20M but < 30M would be rarefied to 20M reads. I would then look for outliers within each rarefied band.

Arcadia-Science / 2022-tx-not-in-gx

rarefy to equal sequencing depth? #10