biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Behavior of "minimum feature count" is incorrectly documented #98

Closed fedarko closed 4 years ago

fedarko commented 4 years ago

Problem description

From what I can tell from reading the code (and from #27), this parameter actually corresponds to the minimum number of samples a feature must show up in in order to be retained in the Songbird analysis. Songbird's description of this parameter does not reflect this, and instead makes it seem like it's just the minimum number of counts across all samples in the dataset --

https://github.com/biocore/songbird/blob/beceeec550f775e64f371ede96af33218fff8b35/songbird/parameter_info.py#L30-L37

(I know I wrote this file, but I'm pretty sure the descriptions were like that before :P)

This seems like something that should be fixed soon (either making the behavior actually do frequency-based filtering, or making the description actually say "the minimum number of SAMPLES a feature needs to be present in for that feature to be included in the analysis", or something like that). I don't know why filtering based on the minimum number of samples is used at all (it seems kind of counterintuitive to me), but in any case this should be properly documented.

Details

I just realized this today when I attempted to do filtering (i.e. removing features with total frequency < 10) in advance of running Songbird, and Songbird still filtered out like 8,000 features from my dataset.

For documentation's sake, here's the relevant filtering code in Songbird right now:

https://github.com/biocore/songbird/blob/beceeec550f775e64f371ede96af33218fff8b35/songbird/util.py#L154-L165

Although sample_filter() is doing normal frequency-based filtering as far as I can tell, read_filter() is instead:

TLDR

Sorry to bother you with this, but this seems like a pretty big problem to me (although I've pretty much always kept --min-feature-count at the default of 10, I had no idea that this worked this way -- and I suspect other people are similarly confused). At the very least I need to update this section of the Qurro analysis notebook, in which I demonstrated my ignorance of this :)

I can try to submit a PR for this tomorrow, but I can't guarantee I can get this done quickly.

mortonjt commented 4 years ago

ok, that parameter description is misleading. Yes, it is removing features based on the number of samples it appears in.

We don't want to change this to frequency-based filtering - it boils down to how you want run regression. For instance, if a species only appears in 1 sample with a very high abundance - you still want to exclude that, since you'd be fitting a regression with only 1 point for that species. Whereas, if you have 1 species that appears in 20 samples, but with a low abundance in those samples - that could be still informative.

So the parameter should read minimum number of samples.

fedarko commented 4 years ago

Gonna close this and #55 now that #99 has been merged in.