elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

Per category evaluation of a clustering #23

Closed parmegv closed 8 years ago

parmegv commented 8 years ago

Apart from the evaluations of the complete clustering, it would be nice to be able to get the per-label statistics to understand how the individual quality of the categories affect the global clustering quality.

What do you think?

kno10 commented 8 years ago

ELKI computes a confusion matrix (i.e. cluster-label correspondences). Because most evaluation measures are based on this matrix.

But I don't think it is a good idea to use. In real use, you will not have labels. By looking at the labels (or even evaluation measures) too much, you risk overfitting your method and parameters for one particular data set. We have seen a number of algorithms being published that apparently only work for very narrow parameters on only very few data sets, unfortunately.

What would make more sense is a general "cluster explanation" functionality. For example when clustering text data, the most distinguishing words of each cluster are of interest; even in an unsupervised scenario.

parmegv commented 8 years ago

I am clustering network traces using features from traffic analysis (i.e. number of consecutive packets in each direction, number incoming packets out of the total number packets, and derived...).

What I would like to do is see if given a certain evaluation measure for the whole clustering, there are some websites that are significantly better clustered than others.

The hypothesis I'm testing is if I can distinguish websites without building models of them beforehand. I understand that I must not optimize a clustering to make some categories be very good and others not, due to the overfitting you mention.

Makes sense? I'm fairly new in this ground, maybe I make some reasoning mistakes hehehe.

kno10 commented 8 years ago

Most measures are a sum over individual components (per cluster/label). So they can be computed separately for each cluster. But most will then also just boil down to counting. I.e. if a cluster has a high overlap with a label. I don't expect this to be generally useful enough to become standard functionality; but it is straightforward to compute yourself.

https://github.com/elki-project/elki/blob/master/elki/src/main/java/de/lmu/ifi/dbs/elki/evaluation/clustering/ClusterContingencyTable.java

This class computes a cluster-label confusion matrix. Most measures are derived from that. You could use this as a starting point for your specific evaluation needs. You can e.g. look for the largest values in this matrix, or the largest relative to the margin sizes. The method averageSymmetricGini computes per-cluster and per-label purity values; these may be what you need. But it only returns the mean and average, you want the individual values.