AlexanderFabisch / gmr

Gaussian Mixture Regression
https://alexanderfabisch.github.io/gmr/
BSD 3-Clause "New" or "Revised" License
168 stars 49 forks source link

Getting Coefficients #16

Closed ScottGuthart closed 4 years ago

ScottGuthart commented 4 years ago

Hey there,

I have survey data where each person makes multiple observations across different brands.

Is it possible to use this library to extract each person's coefficients/class membership, and the adjusted r2 scores for each class?

Usually, I'd use a software called Latent Gold and was hoping this might be the python way of performing the same analysis.

AlexanderFabisch commented 4 years ago

Hi,

I'm not familiar with the kind of data that you have or the kind of analysis that you want to do. If it is possible to use it for your work with minor modifications of the library I'm willing to invest some time to implement the required features. In this case I would need some assistance to generate test data and a good example.

Can you explain your use case in detail? I guess there is a 1-to-1 relation between "person" and sample and "observations across different brands" correspond to features in ML terminology. Which kind of coefficient are you referring to? Does "class membership" mean cluster index? I only know R2 scores as a metric for regression models. You can compute that with sklearn. If you want to compute it for individual clusters you can certainly do that but not directly in a single function call yet. I would have to read more about the adjusted R2 score.

ScottGuthart commented 4 years ago

Yes, adjusted r2 is just a version of r2 which considers the number of features and the sample size but this can definitely be calculated independently. We use it to compare the fit of different models. Usually we'd take a weighted average of the adjusted r2s of each cluster based on cluster size and compare this to the results of other regression models, like a Ridge Regression or Shapley Value Regression on the same data.

Here's an example of typical data and the output we look for. The sample is composed of people who have multiple observations for a dependent variable (y) and multiple independent variables (X1, X2, X3...).

I believe Latent Gold is aware of the 1 to many relationship between each person and their observations and uses this information to make sure that each person's observations are split in the same cluster and that each person's observations inform the model equally as people can evaluate a different number of brands depending on how many they're aware of. In addition, each person may have a weight. Covariates can also be specified. The software uses a different type of regression depending on if the dependent variable is dichotomous (logistic), nominal (logit), ordinal (multinomial logit), or continuous.

In the output we usually get, each class/cluster has its own regression model with its own coefficients for each independent variable and each person is assigned to a class. We usually will run multiple models and keep the model with the most clusters that is also stable.

Let me know if there's any other questions, or any assistance I can provide if you think that it's possible to implement this.

AlexanderFabisch commented 4 years ago

I believe the model behind this software is more complex than a Gaussian mixture model. Here are corresponding publication. Gaussian mixture models are not able to handle non-continuous features / independent variables well out of the box (a problem that I came across in the past) and they claim to be able to handle this.

Here's an example of typical data and the output we look for. The sample is composed of people who have multiple observations for a dependent variable (y) and multiple independent variables (X1, X2, X3...).

I don't understand this completely. y looks suspiciously like a class label. In this case Gaussian mixture regression is not a good way to model the data.

I guess I can't help you much here.

ScottGuthart commented 4 years ago

Ah gotcha. Well thank you very much for your consideration. Definitely let me know if you come across anything that could be helpful. We do typically get better results with ridge regression, which I am currently able to implement in python but it would be nice to have a similar version of their approach as well to help automate the process of running analyses. However, I think it would be too difficult for me to try to create my own implementation based off of their publications.

AlexanderFabisch commented 4 years ago

I think they are not doing anything magical. There are so many Python packages out there that might help you with the implementation:

Maybe the solution is not that complicated if you use these.