clustering.py - Githubissues

WangKehan573 commented 1 year ago

I noticed that during the implementation of the cluster algorithm, line 82-84： for i in range(len(counts)): if counts[i] == 0: centroids[i,:] = 0.0 This shows that the center number of the initial selection may be greater than the number of clusters. The root cause is that the definition of distance in the original paper does not meet the definition of mathematics. In mathematics, the distance that requires one point to itself is equal to 0. This is not reflected in the code.

In addition, the order of the geometries in the algorithm will affect the generation of the new center points, thereby affecting the next iterations of clustering. The approach adopted in the original text is that input geometries are shuffled after each iteration for randomization.

The purpose of this cluster is pre-calculation of interaction lists that remain static throughout optimization (as described in the previous subsection), clustering of input geometries with similar computational demands together (explained below) and alignment of the interaction lists of geometries in the same cluster (by padding as necessary) for efficient memory accesses. The essence of this cluster is to find the division of a collection, so that the sum of the maximum value of each dimension to the maximum value of each dimension and the amount of the number of subset elements is the least.

And the approach I tried is to convert (n1, n2, n3, n4, n5) to (n2, n3, n4, n5n1n1)).
n1, n2, n3, n4 and n5 are the numbers of atoms, 2-body interactions, 3-body interactions, 4-body interactions and periodic boxes.

Then directly perform KMETRIODS ++ cluster according to the Manhattan distance, and select a group with the smallest calculation cost multiple times. In the case where the sample volume is large, I can generally use less time to cluster and get a more robust result, and the calculation costs with smaller costs.

And as a traditional machine learning method, it meets the definition of mathematics and avoids the situation of multiple new center points and the situation you wanna avoid in 82-84 line.

WangKehan573 commented 1 year ago

I tried your examples and found that your method performs better. However the way the center is updated each time can be changed by max_centers[labels[i]] = onp.where(max_centers[labels[i]] > X[i], max_centers[labels[i]], X[i]) ， which is the maximum coordinate of each dimension of each cluster. It is independent of the input order of configuration, and the result is faster and more robust with no need to shuffle the order of geometries. Otherwise, the way to compute computing cost in function modefied_kmeans() should be calc_cost(), not calculate_total_cost()

cagrikymk commented 1 year ago

Hello @WangKehan573, First of all thank you for providing such a detailed description to help to improve the code. I have only tested standard k-means algorithm for clustering for this problem which doesn't work very well. I think I need to explore other options and see if I can simply it. Thank you for trying the k-metroids algorithm and comparing the performance!

Yes, I will update the code to vectorize the center calculation.

However, I am not sure why that should eliminate the need for shuffling. Currently, the clustering algorithm is dependent on the order of the geometries as the centers are updated after each label update unlike the original k-means where the centers are updated after reassigning labels for all geometries. So, shuffling might not help the performance as much but I added it to the code to improve robustness.

I will perform new tests to investigate the effects of shuffling.

WangKehan573 commented 1 year ago

Great Thanks to your answer. The reason for eliminating the need for shuffle is that if you first reassigning labels for all geometries according the distance you define(you can set sy=0 in formula(4) in the paper in code so that the distance(x,y)=0 if x=y), then you can update the center after all geometries is distributed to the specific cluster. As a result, the center is unique because it is like the way do in the kmeans, which doesn't eliminate the performance and can eliminate the need for shuffling.

The code I modified is as followed:(Do not update the center in assgin_labels(), Do it in calc_cost()!)

@numba.njit def assign_labels(X, centroids): counts = onp.zeros(len(centroids)) labels = onp.zeros(len(X),dtype=onp.int32)

for i in range(len(X)):
    costs,new_cents = calculate_cost_and_new_centers(X[i], centroids, counts)
    min_ind = onp.argmin(costs)
    labels[i] = min_ind
    counts[min_ind] += 1  

return labels,counts

def modified_kmeans(systems,k=3,max_iterations=100, rep_count=10, print_mode=True):

all_lists = []
for s in systems:
    #2-body, 3-body, 4-body, hbond count, image count, atom count
    my_list = [s.global_body_2_count,s.global_body_3_count,s.global_body_4_count, s.global_hbond_count, len(s.all_shift_comb),(s.real_atom_count)]
    all_lists.append(my_list)

X = onp.array(all_lists)

'''
X: multidimensional data
k: number of clusters
max_iterations: number of repetitions before clusters are established

Steps:
1. Convert data to numpy aray
2. Pick indices of k random point without replacement
3. Find class (P) of each data point using euclidean distance
4. Stop when max_iteration are reached of P matrix doesn't change

Return:
np.array: containg class of each data point
'''
min_cost = 999999999999999999999
idx = onp.arange(len(X))
centroids = X[idx, :]
counts = onp.ones(len(centroids))
cost,part1,part2 = calculate_total_cost(X,counts,centroids)
if print_mode:
    print("Cost without aligning:", cost)
    print("nonbounded:           ", part2)
    print("bounded:              ", part1)
for r in range(rep_count):
    idx = onp.random.choice(len(X), k, replace=False)
    centroids = X[idx, :]

    P_labels,counts = assign_labels(X, centroids)
    #centroids = calc_cost_and_centers(X, P_labels, k)
    cost,part1,part2,centroids,counts = calc_cost(X,P_labels,k)
    for iter_c in range(max_iterations):
        labels,counts = assign_labels(X, centroids)
        cost,part1,part2,centroids,counts = calc_cost(X,P_labels,k)
        #centroids = calc_cost_and_centers(X, P_labels, k)
        if iter_c != 0 and onp.array_equal(P_labels,labels):break
        P_labels = labels
        #print(cost,part1,part2)
    if cost < min_cost:
        min_cost = cost
        min_labels = P_labels
        min_centr = centroids
        min_counts = counts
        min_part1 = part1
        min_part2 = part2
if print_mode:
    print("Number of clusters:   ", k)
    print("Cost after aligning:  ", min_cost)
    print("nonbounded:           ", min_part2)
    print("bounded:              ", min_part1)

return min_labels,min_centr,min_counts,min_cost

cagrikymk / JAX-ReaxFF

clustering.py #5