Clustering - Githubissues

This adds a new class in data to prepare trajectories for training.

It first loads one or multiple trajectories and removes all atoms that are not part of a protein. In the case of multiple trajectories it joins them to one trajectory.

After that it can be used to subsample the trajectories with 3 different methods:

Stride: calculates the step size to take n_cluster frames with a given step size spacing between them, if it is not possible to calculate the exact number it remove as many as needed to get n_cluster
distance_cluster: clusters all frames based on their RMSD to all other frames, then for each cluster it calculates the representative frame - the frame with the highest similarity to all other frames in the cluster. The current clustering method used is agglomerative clustering which can be easily changed to any distance matrix accepting clustering method
pca_cluster: creates a principal component analysis over all frames of the trajectory and uses the first n components as input for KMeans clustering and proceeds as distance_cluster to find the representative frames

after that it saves a new topology file for the trajectory, the new trajectory as dcd file and a txt file where the indices of the frames of the original trajectory are saved

Degiacomi-Lab / molearn

Clustering #19