Closed PeterDSteinberg closed 3 years ago
Hi @TomAugspurger - We have Phase II funding to develop large matrix spectral clustering / embedding models. Please take a look at the models below and the linked implementations for numpy in pysptools
. We need to do the same kind of thing for dask. These four models are listed in the proposal, but feel free to spend some time looking around at other related models in pysptools
to see level of effort involved in supporting those. The work can end up in Elm
or dask-ml
, whichever you feel makes the most sense (for specific NASA funded tasks like this one we should make an Elm issue and its fine if the Elm issue gets solved by changes in another repo, e.g. dask-ml or xarray).
Completing this issue:
@PeterDSteinberg Do you know what the interest in approximate algorithms is, relative to exact solutions? https://www.csie.ntu.edu.tw/~cjlin/papers/psc08.pdf suggests that "traditional" spectral clustering algorithms don't scale very well, O(N^3) apparently?
I'll keep reading around to see what others have done.
Hi @TomAugspurger - Great paper - thanks for sharing.
I think that "traditional" spectral clustering algos are O(N^3) because the similarity matrix is calculation assumed to be O(N^2) and then an O(N^1) operation on top of that. Same thing can be said if you solve K-Means exactly with a dense full distance matrix (O^3) rather than through iterative recalculation of cluster means as an approximation (O^1 k number of iterations, I think). Anything that requires an exact dense distance/similarity matrix is going to be at least O(N^2).
Here are some ideas for simplifying the similarity matrix and order of the algo:
The specs above relate to the usage of spectral clustering in climate / satellite imagery analysis - not sure about how these algos are used in other disciplines.
Did anyone invent anything here:- re: pysptools?
This was implemented in dask-ml; see example at https://examples.pyviz.org/landsat_clustering/landuse_clustering.html
This was implemented in dask-ml; see example at https://examples.pyviz.org/landsat_clustering/landuse_clustering.html
Thanks! Trick now being the endmember part.
The objective is to develop at least four spectral clustering / embedding methods using
pysptools
as an rough prototype, converting the numpy data structures in that library for medium sized datasets to Dask data structures for large data parallelism. These were the four we mentioned in the proposal (not necessarily limiting our scope here):