RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.03k stars 184 forks source link

How to manually set centroid for KMeans ? #178

Closed halimkun closed 3 years ago

halimkun commented 3 years ago

how to manually set centroid for k means ?

I've looked for it in the existing documentation but it's not there. meanwhile sometimes users need to set the centroid manually but based on available data.

because who knows, every data tested by the user has fixed criteria, maybe because of company criteria or others.

andrewdalpino commented 3 years ago

Currently, the only way to set the centroids is to learn them, however, we could perhaps implement a Seeder that will initialize the centroids to some known values. Here's the ++ seeder for a reference to the API.

https://docs.rubixml.com/latest/clusterers/seeders/plus-plus.html

Configurable as parameter number seven in K-means (applicable to other clusterers as well)

https://docs.rubixml.com/latest/clusterers/k-means.html

The new Seeder would take a list of known centroids and output k of them when asked to generate seeds. Would this solve your problem @halimkun?

For reference https://stackoverflow.com/questions/38355153/initial-centroids-for-scikit-learn-kmeans-clustering

Things to consider:

  1. Dimensionality of the centroids and dataset must match
  2. Seeding requires a call to train() first ... in cases where the centroids need to stay static, settings epochs on K-means to 0 could work.
andrewdalpino commented 3 years ago

Ok, we implemented a Preset seeder in the 1.1 branch (see https://github.com/RubixML/ML/commit/5063f5fa7e5e32036a6932aca59008ed70876d48), you can test it with `composer require rubix/ml:"1.1.x-dev" or wait for the release within a couple of weeks.

https://github.com/RubixML/ML/blob/1.1/docs/clusterers/seeders/preset.md

If you decide to test it, please provide us with your feedback. Thank you :)

halimkun commented 3 years ago

wow it's available now, really cool mate because previously I added a few lines to your KMeans.php file. and looks like this.

. . .
public function setCentroids($centr){
   $this->centroids = $centr;
}
 . . .

and add a condition like this to the train() function

if (!empty($this->centroids)){
   $this->centroids = $this->seeder->seed($dataset, $this->k);
} else {
    $this->centroids = $this->centroids;
 }

for now it works here whether it effect into another line of code or not. and to use it just call it after the class declaration

$estimator = new KMeans(3,128,1000,10.0, 10, new Euclidean(), new PlusPlus());
$estimator->setCentroids([
    [4,2,3],
    [2,3,2],
    [2,1,3]
]);

but since it's already officially available from the source, I'll switch now. thanks mate

andrewdalpino commented 3 years ago

Nice @halimkun, your solution looks good. From a library's perspective, we didn't want to encourage directly overwriting the centroids after training. Here is an example of how a solution would look using the new Seeder. Note that epochs is set to 0 so that only the seeds are used and are not updated. If you wish to use the presets as a "starting point", you can of course train as normal after seeding.

use Rubix\ML\Clusterers\KMeans;
use Rubix\ML\Kernels\Distance\Euclidean;
use Rubix\ML\Clusterers\Seeders\Preset;

$centroids = [
    [4,2,3],
    [2,3,2],
    [2,1,3],
];

$estimator = new KMeans(3, 128, 0, 10.0, 10, new Euclidean(), new Preset($centroids));

$estimator->train($dataset); // If necessary, use a dummy sample here with the correct dimensionality