caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
61 stars 45 forks source link

Dropping features not shared between sinks and sources #105

Closed lkursell closed 6 years ago

lkursell commented 6 years ago

This scenario will come up when the sinks and sources are being rarefied prior to being run in SourceTracker. The situation would be that low abundance features' counts would be set to 0 after rarefaction, but the feature column would still be in dataframes. In this way, a truly empty feature could be "seen" in both the sinks and the sources since that empty feature would be seeded with small amounts of sequence due to alpha1.

My suggestion would be to wrap this up into a command outside gibbs, likely related to #54. The command could take the union of features, and then drop features that were empty across all sinks and sources.

johnchase commented 6 years ago

@lkursell Is this different that #54

I would also say we should add the simple get_union and get_intersection functions to the code base so you can call easily. Does that seem reasonable??

lkursell commented 6 years ago

Very related but not identical. Seems that in #54 features that start out as nans can get filled with 0s and therefore be counted as a feature. My argument here is that a feature which is 0 in the source and 0 in the sink would technically be shared (found in both dataframes), but actually should be dropped all together. If a filter to make sure that the union was only taken on non-zero features, the issues would be the same.

johnchase commented 6 years ago

I think that this was one of the goals of that issue, however, if it was not stated explicitly would you add it there and close this issue?