Arcadia-Science / 2022-tx-not-in-gx

MIT License
0 stars 0 forks source link

rarefy to equal sequencing depth? #10

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

It looks like this method is working, but sometimes there are weird outliers/patterns, and I think they're driven by very large differences in sequencing depth. The method looks the most promising within a given experiment, but again, then there will be one outlier. I think generally within an experiment, there is roughly consistent sequencing depth that lets biological patterns shine through, but then there will be one really deeply sequenced sample that becomes an outlier. I think one way to address this would be to rarefy everything to the same sequencing depth -- e.g., everything to 20 million reads (or I could stream a download to 1 million reads...but I like the idea of taking advantage of the maximum amount of information possible -- plus, at a scaled of 100k, 1 million might actually be too few reads...TBD).

So as to throw out the minimum amount of data across experiments, I could also band the rarefication. So samples with > 10M but < 20M reads would all be rarefied to 10M. Then samples with >20M but < 30M would be rarefied to 20M reads. I would then look for outliers within each rarefied band.

taylorreiter commented 1 year ago

Or, could I take advantage of any of the packages out of Amy Willis's group that account for uneven sequencing depth when doing statistical things? I think breakaway might be the right package to try (https://adw96.github.io/breakaway/articles/breakaway.html). DivNet might be interesting too, but might be harder because it might need a presence absence df.