Open bdelespierre opened 3 years ago
It would be good to provide this function as an option. Because the added points will also affect the creation of the cluster. Clustering 100000 users and clustering 100100 users may have different results. So it would be nice to have 2 options when using the library.
Yes. I would propose something like:
$algo = new Kmeans\Algorithm:(new Kmeans\RandomInitialization());
$result = $algo->clusterize($points, $nbClusters);
$serialized = serialize($result);
// later...
$previousRun = unserialize($serialized);
$result = $previousRun->resume($newPoints);
looks good 👍
I've been thinking about a result object for Algorithm::clusterize
. What do you think of this API?
<?php
namespace Bdelespierre\Kmeans\Interfaces;
interface ClusterizationResultInterface extends \Serializable
{
public function hasReachedConvergence(): bool;
/**
* @return int<0, max>
*/
public function iterationsCount(): int;
public function getClusters(): ClusterCollectionInterface;
public function resume(PointCollectionInterface $newPoints): self;
}
Sorry for checking late. (I confirmed that it was committed to pr.)
I think it's fine. But I think we'll have to do some more work to be more confident about the interface design.
It's not implemented in #27. I plan to implement that later
I believe it would be nice to be able to resume algorithm execution after its completion. It could be useful as new points are being added so previous iterations don't need to be re-run again.
Example: I have clustered my 100 000 users into 5 clusters. Since the last clustering, 100 new users have been added. Most of them are probably already very close to the existing clusters' centroids. Hence, I should be able to resume clustering the same dataset PLUS the new users to save time.