ideas for k-means clustering

yoid2000 commented 7 years ago

Here are some thoughts on how to do anonymized k-means clustering.

As a little background, the data returned by a K-means algorithm includes three categories:

Some stats about the results.
Some stats about the clusters.
A table indicating which cluster each row belongs in.

(At least, this is what the Turi folks say, at https://turi.com/learn/userguide/clustering/kmeans.html).

An example of the first type of output is this:

Schema
------
Number of clusters              : 6
Number of examples              : 86
Number of feature columns       : 410
Number of unpacked features     : 410
Row label name                  : row_id

Training Summary
----------------
Training method                 : elkan
Number of training iterations   : 2
Batch size                      : 86
Total training time (seconds)   : 0.2836

An example of the second is this:

+-----------+-----------+-----------+------------+-----------+-----+
|    FNC1   |    FNC2   |    FNC3   |    FNC4    |    FNC5   | ... |
+-----------+-----------+-----------+------------+-----------+-----+
| 0.1870... | 0.0801... | -0.092... | -0.0957298 | 0.0893... | ... |
|  0.21752  | 0.0363... | -0.027... | -0.063...  | 0.0556... | ... |
| 0.2293... | 0.1017... | -0.046... | -0.051...  | 0.2313... | ... |
| 0.1654... | -0.156... | -0.327... | -0.278...  | -0.033... | ... |
| 0.2549... |  0.02532  | 0.0081... | -0.134...  | 0.3875... | ... |
| 0.1072... | 0.0754... | -0.119422 | -0.312...  | 0.1100... | ... |
+-----------+-----------+-----------+------------+-----------+-----+
[6 rows x 413 columns]

The third output involves individual rows, so we cannot output that anyway.

Here is one idea for how to implement this:

Do SaD (ranges) or clause dropping (equalities). This handles difference attacks that look for merely a difference in the output of two k-means computations.
Run k-means. I think we can run k-means as it normally runs, meaning we don't need to worry about the random number seed. The analyst can specify K, can change K and re-run, and can supply the initial cluster centers.
When k-means completes, we take the results (output 3) and remove a noisy number of "outliers" from each cluster, where outlier is defined by the distance from the cluster center.
With outliers removed, and using the cluster centers as initial centers, we run k-means for one more iteration. We provide outputs 1 and 2 to the analyst exactly as given by the algorithm.

Thoughts?

fjab commented 7 years ago

Hi Paul, thanks for looking into it. How awesome it would be if we could run that! With my very limited knowledge about the algorithm, I have a few questions:

1) how do you want to do SaD when you don't know the clusters yet? Just omit the users furthest away from the centroid? 2) when you say "the analyst can change K and re-run", I presume you mean only after step 4 is complete as well? You say we don't have to worry about the random number seed – but since K-means only produces a local optimum, it is possible that a re-run with the exact same parameters gives a different output – no? 3) I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view including user IDs and their respective cluster. Is that possible?

yoid2000 commented 7 years ago

how do you want to do SaD when you don't know the clusters yet? Just omit the users furthest away from the centroid?

SaD is run on any ranges placed in the WHERE clause before k-means is run. Exactly as we do today. However, later we remove a low-count number of outliers from each k-means cluster.

when you say "the analyst can change K and re-run", I presume you mean only after step 4 is complete as well? You say we don't have to worry about the random number seed – but since K-means only produces a local optimum, it is possible that a re-run with the exact same parameters gives a different output – no?

Yes, this is correct. This needs more thought/experimentation, but we remove a few outliers from each cluster each time, so each re-run is relatively safe. But the reruns don't eliminate noise in the same sense that rerunning a differential privacy query would. In the latter case, you really can statistically eliminate the mean. With k-means, you might find multiple local optimum, but not clear how you use this to eliminate the randomness...

I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view including user IDs and their respective cluster. Is that possible?

Once we build clusters, then if an analyst for instance happens to have user data, he can plug that user into the appropriate cluster. But I really don't think we can release user IDs and their clusters....

Is that the only way our customers get value out of k-means? If so, this needs more thought.

sebastian commented 7 years ago

I wonder whether outputs 1 and 2 help the analysts (much). What is really interesting is assigning each user to a cluster – i.e. I would imagine a view including user IDs and their respective cluster. Is that possible?

Once we build clusters, then if an analyst for instance happens to have user data, he can plug that user into the appropriate cluster. But I really don't think we can release user IDs and their clusters....

Is that the only way our customers get value out of k-means? If so, this needs more thought.

In fact I agree with @fjab, and also believe that user-cluster association would be (most likely) the main way clustering would be used.

For example to then answer questions like: "what is the customer life time of users in the cluster of high price sensitivity", or "or clustering based on income class and saving habits, what are the shops frequented by the different clusters?".

These ways of analysing data would benefit very much from the clusters being stable though.

A sample query would be:

SELECT
  cluster,
  extract_match(description, 'Shop1|Shop2|...') as shop,
  count(*)
FROM (
  SELECT uid, description, kmeans(income) as cluster FROM (
     ...
  )
) t
GROUP BY cluster, shop

fjab commented 7 years ago

we remove a few outliers from each cluster each time, so each re-run is relatively safe

You mean each manually triggered re-run, or each iteration of the algorithm? Latter would be a problem.

reruns don't eliminate noise in the same sense that rerunning a differential privacy query would

That's true, probably nothing to worry about.

I really don't think we can release user IDs and their clusters

Ah, that's not what I meant. Of course that's not going to work.

Let's say the db has the following data so far:

ID	name	age	location
123	Paul	45	Kaiserslautern
...	...	...	...

We run a K-means on location, and it turns out you belong to the "Frankfurt" cluster. Then it would be awesome to have a view (i.e. a table that is not released to the analyst, but rather can be used for further analysis):

ID	name	age	location	location_cluster
123	Paul	45	Kaiserslautern	Frankfurt*
...	...	...	...	...

*: More likely to be some coordinates that mean Frankfurt

Then I can start working with the cluster using where... Makes sense?

sebastian commented 7 years ago

Then I can start working with the cluster using where... Makes sense?

This is also how I would like clustering to potentially be used. The problem is only how to give the clusters useful identifiers, or ways in which they could be described? Or maybe I am over complicating things.

fjab commented 7 years ago

It would be cool indeed, but that should be a general analysts' problem, not an Aircloak-specific one. If we can get as far as I've outlined above and have good results, that would be fantastic. Naming is an issue we can take care of later.

sebastian commented 7 years ago

But my concern is, how do you have any clue what the different clusters actually mean? I.e. you care about the low-income high-price-sensitivity cluster, but is it 1.21231 or 3.123132... :| Maybe I just don't understand enough about the output of clustering :)

fjab commented 7 years ago

Graphical representation would help immensely of course. But let's say you cluster in two dimensions, price sensitivity P and income I, and you set K=3, then I presume you get three clusters, the centroid of one will be low P & high I ("low" and "high" depending on your input range). Then you name that cluster "whales".

/edit: In Paul's output above, of corse you don't actually see the centroids of the clusters. Probably I have not enough clue either about what the algorithm does actually output, but certainly the centroids are one of the results.

yoid2000 commented 7 years ago

Great thread guys. In my output, you would see the centroids of the clusters, and other statistics of each cluster (number of users, mainly). Given this output, the analyst could play around and produce clusters he likes.

But yes being able to then use these in subsequent queries is a great idea! Very cool...

fjab commented 7 years ago

Cool. Looking forward to seeing that in action 👍

sebastian commented 7 years ago

When we assign properties to each user (for use in subsequent queries) we need to think through how these values are exported as SQL... Maybe we need functions for:

which cluster the user belongs to
distance from cluster as a vector along the different dimensions, but that would require us to support vector math etc if we want to somehow use it further in the query
scalar distance from the cluster centre on either a single or a set of dimensions

Aircloak / aircloak

ideas for k-means clustering #1127