caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
65 stars 46 forks source link

`gibbs` unintuitively fails if source and sink tables don't exactly match features #54

Open wdwvt1 opened 8 years ago

wdwvt1 commented 8 years ago

hat-tip to @johnchase for pointing this out.

If the input dataframes for _gibbs have different numbers of features (different columns) the function will fail with an IndexError like the following

IndexError: index 11751 is out of bounds for axis 0 with size 11290

_gibbs should check that there are the same number of features in both sources_df and sinks_df. An additional check would be that the identity of those features are the same.

wdwvt1 commented 8 years ago

@lkursell discovered another strange error. If dataframes with nan's are passed to the current _gibbs function, we get totally non-useful errors. Need to add type and overlap checking to the API.

gregcaporaso commented 8 years ago

@wdwvt1, was this addressed in #71? If so, could you close this issue?

wdwvt1 commented 8 years ago

in #71.

johnchase commented 6 years ago

I am reopening this issue, because I think we could account for non-overlapping features in a more robust way. Specifically I don't think that gibbs should fail if there are features that are not found in both the source and sink tables. For example if one were to gather data from multiple sources (skin, marine, fecal, soil) and want to compare those against a sink of tap water it is unreasonable to expect features to match 100% across the sinks and sources, and in fact SourceTracker is designed to account for just that. What this means is that the user must combine all tables and then immediately separate them. Or add features with 0 counts to each table. Either of these methods is error prone and time consuming.

lkursell commented 6 years ago

The use case of having the sources comprise perhaps features F1, F2, F3 and the sinks just F1, F4 makes sense - and you would expect that SourceTracker would account for the non-overlapping features into the Unknown. Functionally, the tables will still need to be merged within gibbs, and 0's added, so that the probability of p(t|v) can be calculated, and small alpha counts are added across all the features found in the sinks and sources. Above, we'd have to add F4 to the source table so that we could assign it low counts from the specified sources but with alpha2 give it a higher contribution from the Unknown, which would match our intuition. Perhaps its a question then of if we want the user to have to do this explicitly, or have gibbs do it by default? @wdwvt1 - thoughts?

lkursell commented 6 years ago

@johnchase I could see that the user could pass in an arbitrary list of pd.DataFrames as sources, and the SourceTracker reconciles everything behind the scenes.

jakereps commented 6 years ago

That may be outside the context of this issue. Starting with allowing feature independent input tables (assuming the existence of at least 1 feature intersection) is definitely a gateway for an iterable of sources. Or*args inputs.

wdwvt1 commented 6 years ago

I think that gibbs should fail if the tables don't overlap. The user should have the job of knowing what input features/output features they are looking for.

What about a a function make_feature_union_tables that takes any number of input tables and spits out the same tables whose features are the union of all the features in input tables. The tables would be filled with zeros in these locations. You could easily call this before calling gibbs and it could be a flag in the CLI.

@johnchase @lkursell

johnchase commented 6 years ago

The user should have the job of knowing what input features/output features they are looking for.

I'm not sure I follow this logic. The issue is regarding how data is stored. What information would the user gain by merging and then splitting tables? If we add a function that the user must run every time prior to running sourcetracker why not just do it automatically? It seems self apparent that not all of the source features will be in the sinks and vice-versa

wdwvt1 commented 6 years ago

Backing up to a previous comment a bit:

For example if one were to gather data from multiple sources (skin, marine, fecal, soil) and want to compare those against a sink of tap water it is unreasonable to expect features to match 100% across the sinks and sources, and in fact SourceTracker is designed to account for just that.

ST is not designed to handle this. The Unknown is explicitly modeling an unseen source that has the same features in different proportions, not additional unseen features.

My hesitancy to have tables automatically supersetted to match features is that it seems that people would use this functionality unknowingly. Specifically, if you have two tables that have a large number of non-overlapping features, you should probably be hesitant to use ST in the first place and this error is a great way to tell you that information.

My proposed solution resolves my concern (advanced users will have to explicitly call a function when using from python, CLI users will have to pass a flag that says something like "hey, watch out here") and I think it resolves your concern. It will be done automatically with one additional function call.

Happy to discuss and open to changing

johnchase commented 6 years ago

ST is not designed to handle this. The Unknown is explicitly modeling an unseen source that has the same features in different proportions, not additional unseen feature

Good point, TBH I think was misunderstanding the use of the "Unknown". I realize this may be somewhat tangential, but given the approach of modeling mixtures of the entire source community, does it actually make more sense to take the intersection rather than the union?

johnchase commented 6 years ago

Here is what I do currently:

def get_union(df1, df2):
    df = pd.concat([df1, df2]).fillna(0)
    df1 = df.loc[df1.index].copy()
    df2 = df.loc[df2.index].copy()
    return df1, df2

sinks, sources = get_union(sinks, sources)

After thinking about this more. I think that sourcetracker should either merge tables for you so that the user doesn't have to do this extra step each time. Or it should throw a warning/error if there are non-overlapping features. I don't believe that making the tables match features informs the user that they are violating assumptions regarding the sourcetracker algorithm. Take for example the following two scenarios

  1. Sinks and sources with 1% overlapping features are processed in the same table making step (QIIME for example). Sourcetracker runs with no issues though assumptions are being violated

  2. Sinks and sources with 99% overlap are processed separately, valid use case for sourcetracker but sourcetracker will fail without additional munging.

If the goal is to make researchers aware of assumption violations, this should happen by determining the overlap in features in the sinks and sources, rather than how the researcher has formatted her data

wdwvt1 commented 6 years ago

As I understand, Dan's original paper borrowed from LDA/topic modeling in a document corpus. In that arena, any topic (source) could reasonably have contributed words/ideas (features) to any document (sink). Moreover, you couldn't really say that 'topic X couldn't possibly have contributed to words/ideas in article Y' because the vagary of human thought and the fact that inspiration for field A can come from field B. As such, all sources/sinks shared some connection.

In the use of ST to model the physical flow of microbes I think we can definitively say where microbes are not coming from. For example, I have two isolators in my lab that house mice that have never come into contact with one another. They're from different suppliers and the isolators they live in block transmission of any microbes between inside and out. I know that from a physical flow perspective, isolator1 and isolator2 contribute 0% to one another. However, if I run samples of these isolators in ST2 I get a very high degree of contribution from one to the other. This is because a JAX mouse + standard diet has a very similar microbiome to a Charles river mouse + standard diet.

So, ST2 can give very misleading answers about flow based on the samples you give it. What I am trying to enforce with the failure (when features don't overlap) is that people think about this problem. I agree that it isn't a perfect check, but non-overlapping features seem like a sufficient condition to trigger this concern.

As your examples noted @johnchase, this is by no means a necessary condition for the failure (nor is it sufficient). I would argue for the 'error with intuitive error message'. I would also say we should add the simple get_union and get_intersection functions to the code base so you can call easily. Does that seem reasonable?

lkursell commented 6 years ago

The above recommended get_union and get_intersection should also drop features that have no sequences in both the sinks and the sources. @wdwvt1