Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
609 stars 353 forks source link

java.lang.NullPointerException when trying stream clustering algorithm denstream.WithDBSCAN #191

Closed onofricamila closed 3 years ago

onofricamila commented 4 years ago

I do not know what I am missing ... I chose to use the default parameters configuration, and I do not know why I am getting the error. Any help would be appreciated.

import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.denstream.WithDBSCAN;

public class TestingDenstream {
    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx, Math.random());
        }
        return instance;
    }
    public static void main(String[] args) {
        WithDBSCAN withDBSCAN = new WithDBSCAN();
        withDBSCAN.resetLearningImpl();
        for (int i = 0; i < 1500; i++) {
            DenseInstance d = randomInstance(2);
            withDBSCAN.trainOnInstanceImpl(d);
        }
        Clustering clusteringResult = withDBSCAN.getClusteringResult();
        Clustering microClusteringResult = withDBSCAN.getMicroClusteringResult();

        System.out.println(clusteringResult);

    }
}

image

celikmustafa89 commented 4 years ago

I guess you should set the instances header. just debug the code you will find the null part of your code. dataset is null, try to set it.

onofricamila commented 4 years ago

image

The instance header for every instance d is null, but not the data ... I did not mention that in my question because I did not think of it as the cause of the problem. I used the same data generator, and the same code for StreamKM, and there wasn't any problem with that.

This code works:

import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.streamkm.StreamKM;

public class TestingStreamKM {
    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx, Math.random());
        }
        return instance;
    }
    public static void main(String[] args) {
        StreamKM streamKM = new StreamKM();
        streamKM.numClustersOption.setValue(5); // default setting
        streamKM. resetLearningImpl();
        for (int i = 0; i < 1000; i++) {
            DenseInstance d = randomInstance(2);
            streamKM.trainOnInstanceImpl(d);
        }
        Clustering result = streamKM.getClusteringResult();
    }
}

image


Now, if the null instance header is the problem, where should I set it? It must the same for the whole dataset ...

Thanks for answering so fast!

celikmustafa89 commented 4 years ago

I have updated the code. It is working as I mentioned, you have to assign header to your instance. here is stackoverflow link https://stackoverflow.com/questions/58869442/java-lang-nullpointerexception-when-trying-moa-stream-clustering-algorithm-denst/58910104#58910104

here is the updated code:

static DenseInstance randomInstance(int size) {

    // generates the name of the features which is called as InstanceHeader
    ArrayList<Attribute> attributes = new ArrayList<Attribute>();
    for (int i = 0; i < size; i++) {
        attributes.add(new Attribute("feature_" + i));
    }
    // create instance header with generated feature name
    InstancesHeader streamHeader = new InstancesHeader(
            new Instances("Mustafa Çelik Instance",attributes, size));

    // generates random data
    double[] data = new double[2];
    Random random = new Random();
    for (int i = 0; i < 2; i++) {
        data[i] = random.nextDouble();
    }

    // creates an instance and assigns the data
    DenseInstance inst = new DenseInstance(1.0, data);

    // assigns the instanceHeader(feature name)
    inst.setDataset(streamHeader);

    return inst;
}
public static void main(String[] args) {
    WithDBSCAN withDBSCAN = new WithDBSCAN();
    withDBSCAN.resetLearningImpl();
    withDBSCAN.initialDBScan();
    for (int i = 0; i < 1500; i++) {
        DenseInstance d = randomInstance(5);

        withDBSCAN.trainOnInstanceImpl(d);
    }
    Clustering clusteringResult = withDBSCAN.getClusteringResult();
    Clustering microClusteringResult = withDBSCAN.getMicroClusteringResult();

    System.out.println(clusteringResult);

}

here is the screenshot of debug process, as you see the clustering result is:

Screen Shot 2019-11-18 at 10 52 52 AM

celikmustafa89 commented 4 years ago

image

The instance header for every instance d is null, but not the data ... I did not mention that in my question because I did not think of it as the cause of the problem. I used the same data generator, and the same code for StreamKM, and there wasn't any problem with that.

This code works:

import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.streamkm.StreamKM;

public class TestingStreamKM {
    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx, Math.random());
        }
        return instance;
    }
    public static void main(String[] args) {
        StreamKM streamKM = new StreamKM();
        streamKM.numClustersOption.setValue(5); // default setting
        streamKM. resetLearningImpl();
        for (int i = 0; i < 1000; i++) {
            DenseInstance d = randomInstance(2);
            streamKM.trainOnInstanceImpl(d);
        }
        Clustering result = streamKM.getClusteringResult();
    }
}

image

Now, if the null instance header is the problem, where should I set it? It must the same for the whole dataset ...

Thanks for answering so fast!

Algorithms have different abilities, and differs from some points. Streamkm algorithm can work without assigning header. WithDBSCAN needs the headers, you must assign them. The have different data structures. They may inherit from same classes, but works differently.

Debug your code and try to fill the null parameters. It is a good way to find the gaps.

onofricamila commented 4 years ago

Hey, thanks! You really helped me out here :)

I have a few questions left to ask:

  1. why is the macro clusters weight field = 0 in the debugger?

image

If you open the nested micro clusters for a given macro cluster, you will see they have a weight defined.

image

  1. Furthermore, micro clusters have a null center and radius (which is weird because the MicroCluster class extends CFCluster which extends SphereCluster). I am able to get those values using the getter methods, but that called my attention.

image

  1. Check also the N value for a macro cluster is not the sum of the micro clusters it has inside N values ...

image

It seems something strange is happening ...

Thanks again for the support.