Open yoid2000 opened 7 years ago
Hi Paul, thanks for looking into it. How awesome it would be if we could run that! With my very limited knowledge about the algorithm, I have a few questions:
1) how do you want to do SaD when you don't know the clusters yet? Just omit the users furthest away from the centroid?
2) when you say "the analyst can change K
and re-run", I presume you mean only after step 4 is complete as well? You say we don't have to worry about the random number seed – but since K-means only produces a local optimum, it is possible that a re-run with the exact same parameters gives a different output – no?
3) I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view
including user IDs and their respective cluster. Is that possible?
how do you want to do SaD when you don't know the clusters yet? Just omit the users furthest away from the centroid?
SaD is run on any ranges placed in the WHERE clause before k-means is run. Exactly as we do today. However, later we remove a low-count number of outliers from each k-means cluster.
when you say "the analyst can change K and re-run", I presume you mean only after step 4 is complete as well? You say we don't have to worry about the random number seed – but since K-means only produces a local optimum, it is possible that a re-run with the exact same parameters gives a different output – no?
Yes, this is correct. This needs more thought/experimentation, but we remove a few outliers from each cluster each time, so each re-run is relatively safe. But the reruns don't eliminate noise in the same sense that rerunning a differential privacy query would. In the latter case, you really can statistically eliminate the mean. With k-means, you might find multiple local optimum, but not clear how you use this to eliminate the randomness...
I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view including user IDs and their respective cluster. Is that possible?
Once we build clusters, then if an analyst for instance happens to have user data, he can plug that user into the appropriate cluster. But I really don't think we can release user IDs and their clusters....
Is that the only way our customers get value out of k-means? If so, this needs more thought.
I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view including user IDs and their respective cluster. Is that possible?
Once we build clusters, then if an analyst for instance happens to have user data, he can plug that user into the appropriate cluster. But I really don't think we can release user IDs and their clusters....
Is that the only way our customers get value out of k-means? If so, this needs more thought.
In fact I agree with @fjab, and also believe that user-cluster association would be (most likely) the main way clustering would be used.
For example to then answer questions like: "what is the customer life time of users in the cluster of high price sensitivity", or "or clustering based on income class and saving habits, what are the shops frequented by the different clusters?".
These ways of analysing data would benefit very much from the clusters being stable though.
A sample query would be:
SELECT
cluster,
extract_match(description, 'Shop1|Shop2|...') as shop,
count(*)
FROM (
SELECT uid, description, kmeans(income) as cluster FROM (
...
)
) t
GROUP BY cluster, shop
we remove a few outliers from each cluster each time, so each re-run is relatively safe
You mean each manually triggered re-run, or each iteration of the algorithm? Latter would be a problem.
reruns don't eliminate noise in the same sense that rerunning a differential privacy query would
That's true, probably nothing to worry about.
I really don't think we can release user IDs and their clusters
Ah, that's not what I meant. Of course that's not going to work.
Let's say the db has the following data so far:
ID | name | age | location |
---|---|---|---|
123 | Paul | 45 | Kaiserslautern |
... | ... | ... | ... |
We run a K-means on location, and it turns out you belong to the "Frankfurt" cluster. Then it would be awesome to have a view
(i.e. a table that is not released to the analyst, but rather can be used for further analysis):
ID | name | age | location | location_cluster |
---|---|---|---|---|
123 | Paul | 45 | Kaiserslautern | Frankfurt* |
... | ... | ... | ... | ... |
*: More likely to be some coordinates that mean Frankfurt
Then I can start working with the cluster using where
... Makes sense?
Then I can start working with the cluster using
where
... Makes sense?
This is also how I would like clustering to potentially be used. The problem is only how to give the clusters useful identifiers, or ways in which they could be described? Or maybe I am over complicating things.
It would be cool indeed, but that should be a general analysts' problem, not an Aircloak-specific one. If we can get as far as I've outlined above and have good results, that would be fantastic. Naming is an issue we can take care of later.
But my concern is, how do you have any clue what the different clusters actually mean? I.e. you care about the low-income high-price-sensitivity cluster, but is it 1.21231
or 3.123132
... :| Maybe I just don't understand enough about the output of clustering :)
Graphical representation would help immensely of course. But let's say you cluster in two dimensions, price sensitivity P
and income I
, and you set K=3
, then I presume you get three clusters, the centroid of one will be low P
& high I
("low" and "high" depending on your input range). Then you name that cluster "whales".
/edit: In Paul's output above, of corse you don't actually see the centroids of the clusters. Probably I have not enough clue either about what the algorithm does actually output, but certainly the centroids are one of the results.
Great thread guys. In my output, you would see the centroids of the clusters, and other statistics of each cluster (number of users, mainly). Given this output, the analyst could play around and produce clusters he likes.
But yes being able to then use these in subsequent queries is a great idea! Very cool...
Cool. Looking forward to seeing that in action 👍
When we assign properties to each user (for use in subsequent queries) we need to think through how these values are exported as SQL... Maybe we need functions for:
Here are some thoughts on how to do anonymized k-means clustering.
As a little background, the data returned by a K-means algorithm includes three categories:
(At least, this is what the Turi folks say, at https://turi.com/learn/userguide/clustering/kmeans.html).
An example of the first type of output is this:
An example of the second is this:
The third output involves individual rows, so we cannot output that anyway.
Here is one idea for how to implement this:
K
, can changeK
and re-run, and can supply the initial cluster centers.Thoughts?