Amber-MD / pytraj

Python interface of cpptraj
https://amber-md.github.io/pytraj
172 stars 38 forks source link

clustering: cpptraj vs sklearn #693

Closed hainm closed 9 years ago

hainm commented 9 years ago

matched

kmeans from sklearn

In [16]: from sklearn.cluster import k_means

In [17]: k_means?

In [18]: k_means(xyz, 3)
Out[18]:
(array([[ 14.23858333,  27.95998001,  17.09424305, ...,  16.28630257,
          25.05233765,  13.6071167 ],
        [ 14.5832448 ,  27.82231808,  17.40318394, ...,  16.74180794,
          24.76988983,  14.1815896 ],
        [ 14.65508294,  27.89365482,  18.1200428 , ...,  16.84584761,
          25.13659239,  14.6453898 ]]),
 array([2, 2, 2, 2, 1, 1, 0, 0, 0, 0], dtype=int32),
 11.786212114724094)

kmeans from cpptraj

In [19]: from pytraj.cluster import kmeans

In [20]: kmeans(traj, n_clusters=3)
#Clustering: 3 clusters 10 frames
#Cluster 0 has average-distance-to-centroid 5.643003
#Cluster 1 has average-distance-to-centroid 5.645873
#Cluster 2 has average-distance-to-centroid 4.141970
#DBI: 1.298084
#pSF: 3.236341
#Algorithm: Kmeans nclusters 3 maxit 100
#Representative frames: 9 5 1
Out[20]: array([2, 2, 1, 1, 1, 1, 0, 0, 0, 0], dtype=int32)
swails commented 9 years ago

Doesn't look like a match to me... Frame 1 belongs to the same cluster as Frame 3 according to sklearn, but not according to cpptraj. But you're also doing different analyses, aren't you? @drroe can correct me if I'm wrong, but cpptraj will automatically cluster a trajectory based on RMSD, right? This is effectively one feature per frame. Whereas with sklearn you just asked it to cluster the trajectories based on the Cartesian coordinates, which is 3N features per frame (!!).

An equivalent interpretation is the sklearn approach is clustering on the projection of each trajectory on the Cartesian basis set (which is just the identity matrix) for each frame for each "mode". A much better coordinate system to use for this is the basis of principal components. That way, you only need to use the projections of the first 2 to 5 modes on each frame, rather than all 3N modes, to get almost all the same information. And the fact that you'll have at most 5 features in your clustering algorithm (rather than, say 5000) means better cluster definitions IMO.

If you wanted to really compare them, you need to give sklearn the RMSD time series rather than the trajectory. Or give cpptraj 3N data sets -- the X, Y, and Z-coordinates of the trajectory for each atom.

hainm commented 9 years ago

@swails yeah, I just looked at first few and last few numbers :D

array([2, 2, 2, 2, 1, 1, 0, 0, 0, 0], dtype=int32) from sklearn

and Out[20]: array([2, 2, 1, 1, 1, 1, 0, 0, 0, 0], dtype=int32)

It turns out that for sklearn I used CA atoms while using all atoms for cpptraj. ha ha.

Yes, the results don't match.

That's why I just closed this issue before your comment. (should give notice)

hainm commented 9 years ago

But you're also doing different analyses, aren't you?

right, different mask and different metrics. apple and orange.

hainm commented 9 years ago

If you wanted to really compare them, you need to give sklearn the RMSD time series rather than the trajectory. Or give cpptraj 3N data sets -- the X, Y, and Z-coordinates of the trajectory for each atom.

I see.

drroe commented 9 years ago

On Thu, Aug 20, 2015 at 7:30 AM, Jason Swails notifications@github.com wrote:

If you wanted to really compare them, you need to give sklearn the RMSD time series rather than the trajectory.

Actually it would need the 2D pairwise RMS matrix, right?

— Reply to this email directly or view it on GitHub https://github.com/Amber-MD/pytraj/issues/693#issuecomment-133008334.


Daniel R. Roe, PhD Department of Medicinal Chemistry University of Utah 30 South 2000 East, Room 307 Salt Lake City, UT 84112-5820 http://home.chpc.utah.edu/~cheatham/ (801) 587-9652 (801) 585-6208 (Fax)

swails commented 9 years ago

Actually it would need the 2D pairwise RMS matrix, right?

Is that what cpptraj does?

hainm commented 9 years ago

Yes I think pairwise rmsd is what cpptraj does (with other metrics too). And this make more sense than rmsd time series for a single ref.

hainm commented 9 years ago

'Stole' from mdtraj tut

http://amber-md.github.io/pytraj/doc/build/html/tutorials/mdtraj_adapted.html

drroe commented 9 years ago

Yes - by default, if you do not choose a distance metric, the 'distance' between frames is the coordinate best-fit RMSD between those frames. Clustering on RMSD to a single reference could be useful in some cases, but since RMSD to a single reference is "fuzzier" the higher the value is you can get wildly different structures placed into the same cluster.

On Thu, Aug 20, 2015 at 8:05 AM, Jason Swails notifications@github.com wrote:

Actually it would need the 2D pairwise RMS matrix, right?

Is that what cpptraj does?

— Reply to this email directly or view it on GitHub https://github.com/Amber-MD/pytraj/issues/693#issuecomment-133020121.


Daniel R. Roe, PhD Department of Medicinal Chemistry University of Utah 30 South 2000 East, Room 307 Salt Lake City, UT 84112-5820 http://home.chpc.utah.edu/~cheatham/ (801) 587-9652 (801) 585-6208 (Fax)