elifzeng / Computory-Background

some knowledge about computer
5 stars 0 forks source link

Cluster Method OPTICS and DBSCAN #14

Closed elifzeng closed 1 year ago

elifzeng commented 2 years ago

DBSCAN
example file
official web OPTICS
example file
official web

elifzeng commented 2 years ago

DBSCAN wiki

Density-based spatial clustering of applications with noise
image

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample.

the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter)

the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value. It controls the local neighborhood of the points.

Notice

The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the same order. However, the results can differ when data is provided in a different order. (所以我需要打乱顺序并固定下来?其实也不用,查阅资料后显示,这种情况并不常见,且对整体的聚类结果影响不大,因为DBSCAN对核心点和噪音都是决定性的——wikipedia)
Second and more importantly, the clusters to which non-core samples are assigned can differ depending on the data order. This would happen when a non-core sample has a distance lower than eps to two core samples in different clusters.

Memory consumption

This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume n^2 floats. To get around this, OPTICS can be tried.

elifzeng commented 2 years ago

DBSCAN implement

基于距离的聚类算法的聚类结果是球状的簇,当数据集中的聚类结果是非球状结构时,基于距离的聚类算法的聚类效果并不好。与基于距离的聚类算法不同的是,基于密度的聚类算法可以发现任意形状的聚类。在基于密度的聚类算法中,通过在数据集中寻找被低密度区域分离的高密度区域,将分离出的高密度区域作为一个独立的类别。

我将要使用的DBSCAN OR OPTICS就是基于密度聚类的算法,其core sample是一类样本,这种样本需满足,与距离小于eps的临近样本数量大于等于min_sample。
Notice:用于分析的representation应该是能量低的样本,而不一定是core sample。 此外,DBSCAN的metric默认使用的是欧几里得距离。在我的使用场景里,是原子数相同的同种pair间进行比较,所以和rmsd没有什么差别。

调参

中文参考

MinPts

eps

《Python机器学习算法》这本书上给出了一个计算公式,但是没有解释中间的原因,并不清楚理论依据是什么,算法如下:

def epsilon(data, MinPts):
    '''计算最佳半径
    input:  data(mat):训练数据
            MinPts(int):半径内的数据点的个数
    output: eps(float):半径
    '''
    m, n = np.shape(data)
    xMax = np.max(data, 0)
    xMin = np.min(data, 0)
    eps = ((np.prod(xMax - xMin) * MinPts * math.gamma(0.5 * n + 1)) / (m * math.sqrt(math.pi ** n))) ** (1.0 / n)
    return eps

others in sklearn DBSCAN()

n_jobs=-1 CPU并行数,若值为-1,则用所有的CPU进行运算。

属性

elifzeng commented 2 years ago

OPTICS

DBSCAN算法中,有两个初始参数Eps(邻域半径)和minPts(Eps邻域最小点数)需要手动设置,并且聚类的结果对这两个参数的取值非常敏感,不同的取值将产生不同的聚类结果。为了克服DBSCAN算法这一缺点,提出了OPTICS算法(Ordering Points to identify the clustering structure),翻译过来就是,对点排序以此来确定簇结构。

The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN that relaxes the eps requirement from a single value to a value range. The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both a reachability_ distance, and a spot within the cluster ordering_ attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership. If OPTICS is run with the default value of inf set for max_eps, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given eps value using the cluster_optics_dbscan method. Setting max_eps to a lower value will result in shorter run times, and can be thought of as the maximum neighborhood radius from each point to find other potential reachable points.

image

The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameter xi. There are also other possibilities for analysis on the graph itself, such as generating hierarchical representations of the data through reachability-plot dendrograms, and the hierarchy of clusters detected by the algorithm can be accessed through the cluster_hierarchy_ parameter.

OPTICS还是需要给定epsminPts,但其细微变化对最终结果的影响不大。

Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance matrix, and allow for efficient memory usage on large sets of samples. Different distance metrics can be supplied via the metric keyword.

For large datasets, similar (but not identical) results can be obtained via HDBSCAN. The HDBSCAN implementation is multithreaded, and has better algorithmic runtime complexity than OPTICS, at the cost of worse memory scaling. For extremely large datasets that exhaust system memory using HDBSCAN, OPTICS will maintain (as opposed to ) memory scaling; however, tuning of the max_eps parameter will likely need to be used to give a solution in a reasonable amount of wall time.

调参

min_sample可以设为float 设定参考

Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2).

max_eps 默认为无限大,但可以适当调小减少计算量。 我可以先每种pair随机选十分之一跑reachability plot,然后输出 reachability最大值作为max_eps metric默认是minkowski闵可夫斯基距离,一种衡量数值点之间距离的一种常见方法,p=2为欧几里得距离。这个参数和参数p按默认来就行。如果要用之前算好的rmsdMatrix,可以设定matric="precomputed",然后.fit(rmsdMatrix).

xi默认值为0.5

Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. For example, an upwards point in the reachability plot is defined by the ratio from one point to its successor being at most 1-xi.

n_jobs 设为-1。

elifzeng commented 2 years ago

UMAP

conda install -c conda-forge umap-learn
import numpy as np
import umap

more to read

zhihu 欧式空间 拓扑
黎曼流形