Saving cluster centers with 'trj_filename'

bowman-lab / enspara

Modeling molecular ensembles with scalable data structures and parallel computing

https://enspara.readthedocs.io

GNU General Public License v3.0

33 stars 16 forks source link

Saving cluster centers with 'trj_filename' #205

Closed sbhakat closed 7 months ago

sbhakat commented 3 years ago

Is there a way to save the cluster centers with ''trj_filename' so that one can identifies which cluster center originates from which trajectory? I guess the following part of the script is doing the saving part

if type(topology) == str:
        centers_location = np.array(
            centers_location, dtype=[
                ('state', 'int'), ('conf', 'int'), ('traj_num', 'int'),
                ('frame', 'int'), ('trj_filename', np.str_, 500),
                ('output', np.str_, 500),
                ('topology', np.str_, 500)])
    else:
        centers_location = np.array(
            centers_location, dtype=[
                ('state', 'int'), ('conf', 'int'), ('traj_num', 'int'),
                ('frame', 'int'), ('trj_filename', np.str_, 500),
                ('output', np.str_, 500),
                ('topology', type(topology))])
    unique_trajs = np.unique(centers_location['traj_num'])

Is it as simple as to change unique_trajs = np.unique(centers_location['traj_num']) to unique_trajs = np.unique(centers_location['trj_filename']) or is there other tricks?

justinrporter commented 3 years ago

I assume you're talking about enspara/cluster/save_states.py?

The unique_trajs variable looks to me like it stores which trajectories have states that will get saved, not the names. I think that the "state number" (in the output path/to/output/State0-0.pdb) is the cluster center index. This is @mizimmer90 's code... maybe he wants to weigh in?

What are you trying to do? There may be an easier way!

Justin

sbhakat commented 3 years ago

Thanks @justinrporter . So I wanted 2000 cluster which I indicated by n_clusters = 2000 and the centers_masses are saved as state000000-00.pdb ... state001999-00.pdb (2000 of them zero indexed). I am using multiple trajectories which are named as traj_genheavy15000_.xtc .... traj_genheavy17999_1.xtc . I want to know which state00**-00.pdb comes from which traj_genheavy*.xtc ? It is enspara/cluster/save_states.py and I believe an analogous version of the save_states.py also exits with fast https://github.comcom/bowman-lab/fast/blob/master/msm_gen/save_states.py My script looks like

from fast.msm_gen import save_states as ss
dist_cutoff = 0.15
assignments = ra.load("./data/assignments.h5")
distances = ra.load("./data/distances.h5")
ss.save_states(assignments, distances, save_routine='masses',largest_center=dist_cutoff, n_confs=1, n_procs=64)

I think fast.msm_gen import save_states as ss is analogous to enspara save_states.py

justinrporter commented 3 years ago

I see. Your cluster centers should be all the frames where distances == 0 (try ra.where(distances == 0) to get the list of pairs (trajectory_id, frame_id)). You can then use cluster.util.load_frames to load them up into a single trajectory and then save that (I do as an .h5 file... this will be smaller on disk than pdbs and likely easier to work with anyway).

Not exactly what you asked, but maybe gets you where you want to go?