Adding Clustering via Extended Similarity Metrics

Amber-MD / cpptraj

Biomolecular simulation trajectory/data analysis.

Other

138 stars 64 forks source link

Adding Clustering via Extended Similarity Metrics #1049

Open drroe opened 1 year ago

drroe commented 1 year ago

In collaboration with @ramirandaq @lexin-chen, expand the cluster analysis capabilities of cpptraj by adding clustering via extended similarity metrics (and more).

Some background reading:

https://link.springer.com/article/10.1186/s13321-021-00505-3

https://link.springer.com/article/10.1007/s10822-022-00444-7

ramirandaq commented 1 year ago

Here https://github.com/Amber-MD/cpptraj/pull/1051#event-10450499445 it says "Calculate extended comparison similarity values for each trajectory frame." Is this the complementary similarity used to then find medoids and outliers in the trajectory?

drroe commented 1 year ago

Is this the complementary similarity used to then find medoids and outliers in the trajectory?

Yes - it's equivalent to the gen_sim_dict routine from src/tools/esim_modules.py in MDANCE.

ramirandaq commented 1 year ago

gen_sim_dict will take as an input a set of frames/conformations, and output a number (the extended similarity) for the whole set, not a number for every frame. To calculate the outliers and medoids, the function is calculate_comp_sim (in src/tools/bts.py). The complementary similarity does assign a number to every frame in a set, which can be used to rank the frames from high- to low-density.

drroe commented 1 year ago

gen_sim_dict will take as an input a set of frames/conformations, and output a number (the extended similarity) for the whole set, not a number for every frame.

Yes, I understand that. Let me be more clear.

The ExtendedSimilarity::Comparison() function is most like gen_sim_dict. The ExtendedSimilarity::CalculateCompSim() function (which is what the extendedcomp command, Exec_ExtendedComparison class) is using under the hood is more like calculate_comp_sim. Let me know if you have any more questions.

ramirandaq commented 1 year ago

Sounds great! The functionality in bts.py is a bit more general, because it accommodates extended indices and MSD in a more general way, but this is perfect.