ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
44 stars 24 forks source link

Develop spectral clustering / embedding methods with dask arrays #213

Closed PeterDSteinberg closed 3 years ago

PeterDSteinberg commented 6 years ago

The objective is to develop at least four spectral clustering / embedding methods using pysptools as an rough prototype, converting the numpy data structures in that library for medium sized datasets to Dask data structures for large data parallelism. These were the four we mentioned in the proposal (not necessarily limiting our scope here):

PeterDSteinberg commented 6 years ago

Hi @TomAugspurger - We have Phase II funding to develop large matrix spectral clustering / embedding models. Please take a look at the models below and the linked implementations for numpy in pysptools. We need to do the same kind of thing for dask. These four models are listed in the proposal, but feel free to spend some time looking around at other related models in pysptools to see level of effort involved in supporting those. The work can end up in Elm or dask-ml, whichever you feel makes the most sense (for specific NASA funded tasks like this one we should make an Elm issue and its fine if the Elm issue gets solved by changes in another repo, e.g. dask-ml or xarray).

Completing this issue:

TomAugspurger commented 6 years ago

@PeterDSteinberg Do you know what the interest in approximate algorithms is, relative to exact solutions? https://www.csie.ntu.edu.tw/~cjlin/papers/psc08.pdf suggests that "traditional" spectral clustering algorithms don't scale very well, O(N^3) apparently?

I'll keep reading around to see what others have done.

PeterDSteinberg commented 6 years ago

Hi @TomAugspurger - Great paper - thanks for sharing.

I think that "traditional" spectral clustering algos are O(N^3) because the similarity matrix is calculation assumed to be O(N^2) and then an O(N^1) operation on top of that. Same thing can be said if you solve K-Means exactly with a dense full distance matrix (O^3) rather than through iterative recalculation of cluster means as an approximation (O^1 k number of iterations, I think). Anything that requires an exact dense distance/similarity matrix is going to be at least O(N^2).

Here are some ideas for simplifying the similarity matrix and order of the algo:

The specs above relate to the usage of spectral clustering in climate / satellite imagery analysis - not sure about how these algos are used in other disciplines.

RichardScottOZ commented 3 years ago

Did anyone invent anything here:- re: pysptools?

jbednar commented 3 years ago

This was implemented in dask-ml; see example at https://examples.pyviz.org/landsat_clustering/landuse_clustering.html

RichardScottOZ commented 3 years ago

This was implemented in dask-ml; see example at https://examples.pyviz.org/landsat_clustering/landuse_clustering.html

Thanks! Trick now being the endmember part.