Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.
[ ] Refactor init
[ ] Decide on algorithms type (or "distance type can be different from X")
[ ] Document algorithms separately
[ ] Drop extra fields from result.
Below is a short description of these problems.
Refactor init.
Currently init is redundant, since we have init and init_k and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional function create_seed which should accept argument init (and all other necessary arguments). If init is String or Symbol it should fall to smart_init, if Nothing then default kmeans++, otherwise return deepcopy of the init.
All of this should happen in kmeans (before kmeans!), so duplicated copy is avoided.
Decide on algorithms type (distance may have different type)
Currently we infer distance type from the type of the design matrix. This can be wrong, for example, if Xeltype is RGB or Complex, then distance can have different type, usually Float64 or Float32.
This can be solved by turning all algorithms to parametric, for example
Lloyd{Float64, Float64} and we can define something like this
It make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.
On the other hand, currently we infer everything from the matrices itself and distance type can be kmeans argument. I think it can work, but it looks weird.
Better documentation
I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;
Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside kmeans, there should be separate set of utility functions, which can be invoked if need arise.
Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.
init
X
")Below is a short description of these problems.
init
. Currentlyinit
is redundant, since we haveinit
andinit_k
and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional functioncreate_seed
which should accept argumentinit
(and all other necessary arguments). Ifinit
is String orSymbol
it should fall tosmart_init
, ifNothing
then defaultkmeans++
, otherwise return deepcopy of the init.All of this should happen in
kmeans
(beforekmeans!
), so duplicated copy is avoided.distance
type from the type of the design matrix. This can be wrong, for example, ifX
eltype
isRGB
orComplex
, then distance can have different type, usuallyFloat64
orFloat32
.This can be solved by turning all algorithms to parametric, for example
Lloyd{Float64, Float64}
and we can define something like thisIt make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.
On the other hand, currently we infer everything from the matrices itself and distance type can be
kmeans
argument. I think it can work, but it looks weird.I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;
Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside
kmeans
, there should be separate set of utility functions, which can be invoked if need arise.