[QUESTION] How to perform basic cluster analysis with weighted factors

RubixML / ML

A high-level machine learning and deep learning library for the PHP language.

MIT License

2.04k stars 184 forks source link

Hello, thanks for such a robust package for PHP. My apologies if this isn't the appropriate forum for this question.

The project I'm working on requires grouping survey respondents into small groups based on the diversity of their answers. I have been researching k-means clustering algorithms to help with this, and mostly find Python examples.

Here is some general information about the project and the dataset:

There are 20 factors to be considered
We are looking for similarities in the first 3 factors and differences with the other 17
For example, we want to group people in the same (or close) timezone. But we want to group people with different sex, age, race, etc.
The ideal size of each group is 4. This can be as little as 1, or as high as 5, depending on the number of samples.
For example, if we have 100 responses, we know there should be 25 groups (i.e. clusters), which should represent K=25 if I understand correctly?
Some factors have greater weight than other. For example, timezone is the most important factor to consider, while something like education is not as important.

Here is a snapshot of a few samples:

My questions are:

How can I use the package to group survey responses?
Is it possible to apply weights to each factor?
Is it possible to attach arbitrary data to each point, for example, a user ID? (I need to be able to loop through the clusters and identify individuals in each cluster).

I have looked at the Colors example and tried to modify it according to my needs, but don't quite understand the results. Also, here are a couple of articles I have been referencing (but again, in Python):

https://towardsdatascience.com/using-weighted-k-means-clustering-to-determine-distribution-centres-locations-2567646fc31d https://medium.com/@dey.mallika/unsupervised-learning-with-weighted-k-means-3828b708d75d https://towardsdatascience.com/a-little-known-trick-in-hierarchical-clustering-weights-762156a2fce0

Any feedback is greatly appreciated, thank you :)

Hi @LarryBarker thanks for the question, I wish you luck on your project. Your use-case is sound but I think it would be helpful to reiterate what machine learning brings to the table as opposed to a symbolic rule-based system. The fact that you've listed a set of a priori assumptions about your features hints that you already know the rulesets - you just need to go and code them out. In machine learning, you do not get to specify how features are interpreted, that is the role of the learning algorithm. What you can do, however, is use your assumptions to validate the trained model. If you don't mind relinquishing your prior assumptions then you can use a full ML solution. You could also take a hybrid approach where you only use ML on the features that you do not already have rule-sets for and combine that signal with your symbolic logic.

https://wiki.pathmind.com/symbolic-reasoning#:~:text=One%20of%20the%20main%20differences,are%20created%20through%20human%20intervention.

How can I use the package to group survey responses?

It sounds like clustering is what you need. K-means is a good place to start.

https://docs.rubixml.com/1.0/clusterers/k-means.html

Is it possible to apply weights to each factor?

No, feature weighting is the job of the learning algorithm. If your features are continuous, you can artificially re-weight features by boosting (or reducing) their value relative to other samples but I do not recommend this. Also, remember that with distance-based estimators (most of the clusterers including K-means) the features should be standardized. Having that said, it is a somewhat common practice to weight samples, for example if you have an under or overrepresented class in your dataset.

https://docs.rubixml.com/1.0/transformers/z-scale-standardizer.html

Is it possible to attach arbitrary data to each point, for example, a user ID? (I need to be able to loop through the clusters and identify individuals in each cluster).

Yes, this is something that you would do before or after assigning cluster numbers. Since user IDs are most likely not correlated with any particular cluster, I would recommend leaving that feature out of the training and inference sets.

RubixML / ML

[QUESTION] How to perform basic cluster analysis with weighted factors #195