Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
603 stars 352 forks source link

Data with noise class #265

Open gusnunes opened 1 year ago

gusnunes commented 1 year ago

Using "RandomRBFGeneratorEvents" to clustering the data I realized that when the stream has noise in it, the calculation of Purity, for example, is wrong. It happens because in MembershipMatrix, the "classmap" doens't contain the key "-1" that maps the noise label to the last "workcluster" index, instead of that, the noise label key is the number of clusters and it could be mapped to any "workcluster". The line 52 of F1 measure is useless because "mm.hasNoiseClass()" always return false and the number of classes will be the same.

For example, a cluster has 2 instances of a real class and 5 noise instances The current implementation would calculate that group purity is the value (5/7), because the noise index it's not ignored in "mm.getClusterClassWeight()" during the "for loop". Furthermore this also happens when the group contains only noise instances, wich is completely equivocaded.