aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
213 stars 34 forks source link

Export nearest centroids for all data points #338

Open ylwu-amzn opened 2 years ago

ylwu-amzn commented 2 years ago

Refer to https://github.com/opensearch-project/ml-commons/pull/355#discussion_r911563139

Seems current RCF summarizer can't export nearest centroids for all data points. That means client side needs to recalculate nearest centroids with the same distance algorithm which looks like duplicate effort. As RCF summarizer has already calculated nearest centroids for all data points, how about we add some option to return all data points' centroids to avoid recalculation from client side?

sudiptoguha commented 2 years ago

This is a premised question. First, RCF summarizer does not compute nearest centroids for all data points. In fact, no advanced clustering algorithm (say circa late 1990's or later) does so either. I would recommend that new implementations post 2020 take a hard look at the two issues below:

a. Syntactic -- use of clustering is most successful in data reduction/summarization. That reduction could be for (i) computational efficiency (same reasons one uses histograms as in 1d data) or denoising, where 1M points are reduced to say a 100 weighted points (almost 10000x improvement) or (ii) conceptual efficiency (or insights) where one takes a "10000 feet" view of the data -- these are the 10 largest clusters in the data (Look! how are they changing over time, etc. etc.) .

b. Semantic -- assignment to nearest centroid is limiting. It has been shown many times over that soft clustering has more desirable properties in information retrieval. Outputting the centroids leaves open the possibility of soft clustering. "Clustering assignment" applications (as in logistics) often comes with multiple constraints and assign-to-nearest is often not the end solution. Now one could use an intermediate solution of cluster-and-assign-to-nearest as a "first cut" for some problem -- but

Now, there are use cases of more information from clustering. For example, Dendogram corresponds to the hierarchy of cluster where users can zoom in/out, otherwise navigate the clusters without any fixed number of clusters. But these are application dependent. They consume significant computational resources to build and should not be built as a default for people who are interested in data reduction/summarization as in (a).

I understand that some (justifiably well regarded) common ML platforms have the API that outputs hard cluster membership similar to the output of a classification task. But clustering is not classification -- the distinction between unsupervised and supervised algorithms are more fundamental than APIs. For example, the "nearest-centroid" output is a batch concept and limits the use and benefits of streaming. If the specific output of batch oriented ML platform is desirable, why not ask people to use that batch oriented platform?

RCF will continue to focus on streaming algorithms, or at a minimum, not take a step that precludes streaming algorithms. The use of summarize in RCF is to look at the predictions derived from the individual trees in the ensemble and surface them as a "10000 feet" observability insight as in (a)(ii) above.

ylwu-amzn commented 2 years ago

Thanks @sudiptoguha for the detailed explanation. Seems it's not reasonable to calculate every point's nearest centroid inside RCF summary. Is it possible to provide some utility API in RCF to recalculate nearest centroids? Asking this to avoid any wrong calculation and duplicate effort from client side.

sudiptoguha commented 2 years ago

" Is it possible to provide some utility API in RCF to recalculate nearest centroids?" -> No. I think that bloats the RCF library and is scientifically less desirable as I suggested here https://github.com/opensearch-project/ml-commons/issues/358

This is a decision the "client" of any ML should take seriously; trying to avoid making that decision has serious and potentially unnecessary performance costs.

I strongly recommend the reference "The on-line textbook: Information Theory, Inference, and Learning Algorithms" http://www.inference.phy.cam.ac.uk/mackay/itila/

Please check Chapter 20, "An Example Inference Task: Clustering".