Closed elifzeng closed 1 year ago
Density-based spatial clustering of applications with noise
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm,
min_samples
andeps
, which define formally what we mean when we say dense. Highermin_samples
or lowereps
indicate higher density necessary to form a cluster.we define a core sample as being a sample in the dataset such that there exist
min_samples
other samples within a distance ofeps
, which are defined as neighbors of the core sample.the parameter
min_samples
primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter)the parameter
eps
is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value. It controls the local neighborhood of the points.
The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the same order. However, the results can differ when data is provided in a different order. (所以我需要打乱顺序并固定下来?其实也不用,查阅资料后显示,这种情况并不常见,且对整体的聚类结果影响不大,因为DBSCAN对核心点和噪音都是决定性的——wikipedia)
Second and more importantly, the clusters to which non-core samples are assigned can differ depending on the data order. This would happen when a non-core sample has a distance lower than eps to two core samples in different clusters.
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
To get around this, OPTICS
can be tried.
基于距离的聚类算法的聚类结果是球状的簇,当数据集中的聚类结果是非球状结构时,基于距离的聚类算法的聚类效果并不好。与基于距离的聚类算法不同的是,基于密度的聚类算法可以发现任意形状的聚类。在基于密度的聚类算法中,通过在数据集中寻找被低密度区域分离的高密度区域,将分离出的高密度区域作为一个独立的类别。
我将要使用的DBSCAN
OR OPTICS
就是基于密度聚类的算法,其core sample是一类样本,这种样本需满足,与距离小于eps的临近样本数量大于等于min_sample。
Notice:用于分析的representation应该是能量低的样本,而不一定是core sample。 此外,DBSCAN
的metric默认使用的是欧几里得距离。在我的使用场景里,是原子数相同的同种pair间进行比较,所以和rmsd没有什么差别。
《Python机器学习算法》这本书上给出了一个计算公式,但是没有解释中间的原因,并不清楚理论依据是什么,算法如下:
def epsilon(data, MinPts):
'''计算最佳半径
input: data(mat):训练数据
MinPts(int):半径内的数据点的个数
output: eps(float):半径
'''
m, n = np.shape(data)
xMax = np.max(data, 0)
xMin = np.min(data, 0)
eps = ((np.prod(xMax - xMin) * MinPts * math.gamma(0.5 * n + 1)) / (m * math.sqrt(math.pi ** n))) ** (1.0 / n)
return eps
n_jobs=-1
CPU并行数,若值为-1,则用所有的CPU进行运算。
core_sample_indices_
: 核心点的索引,因为labels_不能区分核心点还是边界点,所以需要用这个索引确定核心点components_
:训练样本的核心点labels_
:每个点所属集群的标签,-1代表噪声点DBSCAN算法中,有两个初始参数Eps(邻域半径)和minPts(Eps邻域最小点数)需要手动设置,并且聚类的结果对这两个参数的取值非常敏感,不同的取值将产生不同的聚类结果。为了克服DBSCAN算法这一缺点,提出了OPTICS算法(Ordering Points to identify the clustering structure),翻译过来就是,对点排序以此来确定簇结构。
The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN that relaxes the
eps
requirement from a single value to a value range. The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both areachability_
distance, and a spot within the clusterordering_
attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership. If OPTICS is run with the default value of inf set formax_eps
, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any giveneps
value using thecluster_optics_dbscan
method. Settingmax_eps
to a lower value will result in shorter run times, and can be thought of as the maximum neighborhood radius from each point to find other potential reachable points.
The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameter
xi
. There are also other possibilities for analysis on the graph itself, such as generating hierarchical representations of the data through reachability-plot dendrograms, and the hierarchy of clusters detected by the algorithm can be accessed through thecluster_hierarchy_
parameter.
OPTICS
还是需要给定eps
和minPts
,但其细微变化对最终结果的影响不大。
Spatial indexing trees are used to avoid calculating the full distance matrix, and allow for efficient memory usage on large sets of samples. Different distance metrics can be supplied via the
metric
keyword.For large datasets, similar (but not identical) results can be obtained via HDBSCAN. The HDBSCAN implementation is multithreaded, and has better algorithmic runtime complexity than OPTICS, at the cost of worse memory scaling. For extremely large datasets that exhaust system memory using HDBSCAN, OPTICS will maintain (as opposed to ) memory scaling; however, tuning of the
max_eps
parameter will likely need to be used to give a solution in a reasonable amount of wall time.
min_sample
可以设为float 设定参考
Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2).
max_eps
默认为无限大,但可以适当调小减少计算量。
我可以先每种pair随机选十分之一跑reachability plot,然后输出 reachability最大值作为max_eps
metric
默认是minkowski
,闵可夫斯基距离,一种衡量数值点之间距离的一种常见方法,p=2为欧几里得距离。这个参数和参数p
按默认来就行。如果要用之前算好的rmsdMatrix,可以设定matric="precomputed"
,然后.fit(rmsdMatrix)
.
xi
默认值为0.5
Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. For example, an upwards point in the reachability plot is defined by the ratio from one point to its successor being at most 1-xi.
n_jobs
设为-1。
DBSCAN
example file
official web OPTICS
example file
official web