Open LarryBarker opened 3 years ago
Hi @LarryBarker thanks for the question, I wish you luck on your project. Your use-case is sound but I think it would be helpful to reiterate what machine learning brings to the table as opposed to a symbolic rule-based system. The fact that you've listed a set of a priori assumptions about your features hints that you already know the rulesets - you just need to go and code them out. In machine learning, you do not get to specify how features are interpreted, that is the role of the learning algorithm. What you can do, however, is use your assumptions to validate the trained model. If you don't mind relinquishing your prior assumptions then you can use a full ML solution. You could also take a hybrid approach where you only use ML on the features that you do not already have rule-sets for and combine that signal with your symbolic logic.
How can I use the package to group survey responses?
It sounds like clustering is what you need. K-means is a good place to start.
https://docs.rubixml.com/1.0/clusterers/k-means.html
Is it possible to apply weights to each factor?
No, feature weighting is the job of the learning algorithm. If your features are continuous, you can artificially re-weight features by boosting (or reducing) their value relative to other samples but I do not recommend this. Also, remember that with distance-based estimators (most of the clusterers including K-means) the features should be standardized. Having that said, it is a somewhat common practice to weight samples, for example if you have an under or overrepresented class in your dataset.
https://docs.rubixml.com/1.0/transformers/z-scale-standardizer.html
Is it possible to attach arbitrary data to each point, for example, a user ID? (I need to be able to loop through the clusters and identify individuals in each cluster).
Yes, this is something that you would do before or after assigning cluster numbers. Since user IDs are most likely not correlated with any particular cluster, I would recommend leaving that feature out of the training and inference sets.
Hello, thanks for such a robust package for PHP. My apologies if this isn't the appropriate forum for this question.
The project I'm working on requires grouping survey respondents into small groups based on the diversity of their answers. I have been researching k-means clustering algorithms to help with this, and mostly find Python examples.
Here is some general information about the project and the dataset:
Here is a snapshot of a few samples:
My questions are:
I have looked at the Colors example and tried to modify it according to my needs, but don't quite understand the results. Also, here are a couple of articles I have been referencing (but again, in Python):
https://towardsdatascience.com/using-weighted-k-means-clustering-to-determine-distribution-centres-locations-2567646fc31d https://medium.com/@dey.mallika/unsupervised-learning-with-weighted-k-means-3828b708d75d https://towardsdatascience.com/a-little-known-trick-in-hierarchical-clustering-weights-762156a2fce0
Any feedback is greatly appreciated, thank you :)