EdwardRaff / JSAT

Java Statistical Analysis Tool, a Java library for Machine Learning
GNU General Public License v3.0
789 stars 205 forks source link

Using MeanShift with given bandwidth? #93

Open brainbytes42 opened 3 years ago

brainbytes42 commented 3 years ago

Maybe I'm missing something - but is it possible tu use MeanShift-Clustering with a given (fixed) bandwidth?

As far as I follow the implementation, even if I provide the KDE with a set bandwidth to the MeanShift-Constructor, it gets overwritten in MeanShift's cluster-method by calling mkde.setUsingData(dataSet, parallel);:

@Override
public int[] cluster(DataSet dataSet, boolean parallel, int[] designations)
{
    // ...

    final KernelFunction k = mkde.getKernelFunction();
    mkde.setUsingData(dataSet, parallel);
    mkde.scaleBandwidth(scaleBandwidthFactor);

    // ...
}

Scaling the bandwidth seems not sufficient for me, as the scaled bandwidth isn't fixed. But as this is done inside the cluster-step, there seems to be no way to intercept or re-set the bandwidth...

A simple example how I've tried to use MeanShift:

SimpleDataSet dataSet = ...
double sigma = ...
MetricKDE metricKDE = new MetricKDE(GaussKF.getInstance(), new EuclideanDistance());
metricKDE.setBandwith(sigma); // <-- gets ignored!
MeanShift meanShift = new MeanShift(metricKDE);
List<List<DataPoint>> clusters = meanShift.cluster(dataSet); // <-- cluster trigger's bandwidth-estimation

Actually, it's obvious, that the Kernel-Density-Estimation wants to estimate the bandwidth, but in my case, I need a consistent bandwidth for multiple runs and need 'only' the clustering-step for the data.

Any help appreciated - thank you.