Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
609 stars 353 forks source link

Clustream WithKmeans: null center and (radius, weight = 0) for BOTH micro and macro clusters #190

Closed onofricamila closed 4 years ago

onofricamila commented 4 years ago

First of all: I am using moa-release-2019.05.0-bin/moa-release-2019.05.0/lib/moa.jar (obtained from https://moa.cms.waikato.ac.nz/downloads/).

Now, let's go to the point: I am trying to use moa.clusterers.clustream.WithKmeans stream clustering algorithm and I have no idea why this is happening ...

import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.clustream.WithKmeans;

public class TestingClustream {
    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx, Math.random());
        }
        return instance;
    }

    public static void main(String[] args) {
        WithKmeans wkm = new WithKmeans();
        wkm.kOption.setValue(5);
        wkm.maxNumKernelsOption.setValue(300);
        wkm.resetLearningImpl();
        for (int i = 0; i < 10000; i++) {
            wkm.trainOnInstanceImpl(randomInstance(2));
        }
        Clustering clusteringResult = wkm.getClusteringResult();
        Clustering microClusteringResult = wkm.getMicroClusteringResult();
    }
}

image

image

I have read the source code many times, and it seems to me that I am using the correct functions, in the correct order ... I do not know what I am missing ... any feedback is welcomed!


EDIT: Thanks to Anony-Mousse on Stackoverflow, I noticed the fields are unused, likely coming from some parent class with a different purpose. Using the getter methods such as getCenter(), getWeight(), and getRadius(), I could get the values.

Now, are that values I got "reliable"?

Moreover, what is the purporse of the weight field? It seemed to me that it represented the number of 'elements' each cluster has, but sometimes I get a real number ... If the weights are integer, the micro clusters ones does not sum up to the total number of samples, and the macro clusters ones does not sum up to the number of micro clusters .... thanks in advance!