bowman-lab / enspara

Modeling molecular ensembles with scalable data structures and parallel computing
https://enspara.readthedocs.io
GNU General Public License v3.0
33 stars 16 forks source link

Pooled clustering followed by reassignment #199

Closed PeptideSimulator01 closed 5 months ago

PeptideSimulator01 commented 4 years ago

Dear all,

I try to clusters 50 trajectories each from 2 different peptides (both with 14 aa). I only wright the C, CA and N atoms in the trajectory to be able to clusters them together. For sure, I also only cluster based on RMSD of C, CA and N. After the clustering I want to count which cluster was used how many times by peptide 1 or 2 and hopefully see a different distribution.

What I do:

python cluster.py \
  --trajectories peptide1.h5\
  --topology peptide1.pdb \
  --trajectories peptide2.h5\
  --topology peptide2.pdb \
  --atoms '(name CA or name C or name N)' \
  --algorithm khybrid \
  --cluster-number 10 \
  --distances /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-distances.h5 \
  --center-features /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-centers.pickle \
  --assignments /home/psc/enspara/enspara/apps/fs-khybrid-clusters0020-assignments.h5

If I then track back which frame from which peptide was used for cluster I have the feeling, that the first clusters are build by the first 50 trajectories and the last 5 clusters by the the 50 trajectories od peptide2. This is also represented by the centroid structures. First 5 are peptide1, last 5 peptide2. Am I making a horrible mistake with this?

Any hint is aprreciated, thanks for your effort.

The code I used:

df=pd.DataFrame(cl_ass)
H3=df.iloc[0:50]
ss=df.iloc[50:100]
count_peptide1=pd.value_counts(peptide.values.ravel())
print('peptide1:', count_peptide1)
count_peptide2=pd.value_counts(peptide2.values.ravel())
print('peptide2', count_peptide2)

infile=open('fs-khybrid-clusters0020-centers.pickle', 'rb')
new_file=pickle.load(infile)
first_tr=new_file[0]
first_tr.save_pdb('0.pdb')
sec_tr=new_file[1]
sec_tr.save_pdb('1.pdb')
third_tr=new_file[2]
third_tr.save_pdb('2.pdb')
four_tr=new_file[3]
four_tr.save_pdb('3.pdb')
five_tr=new_file[4]
five_tr.save_pdb('4.pdb')
six_tr=new_file[5]
six_tr.save_pdb('5.pdb')
seven_tr=new_file[6]
seven_tr.save_pdb('6.pdb')
eigth_tr=new_file[7]
eigth_tr.save_pdb('7.pdb')
nine_tr=new_file[8]
nine_tr.save_pdb('8.pdb')
ten_tr=new_file[9]
ten_tr.save_pdb('9.pdb')
Justin-J-Miller commented 5 months ago

Apologies for the delayed response. I'm sure you've moved on from this by now, but leaving this as a note for anyone else who may be curious.

One of the easiest ways to figure out where a cluster center is coming from is to provide --center-indices indices.npy (or whatever you want the file to be named) as an input argument for the clustering app. The resulting file will be a 2D array of length clusters with values: trajectory #, frame #. To deconstruct this, glob the trajectories as the clustering app does and map the trajectory numbers in your center-indices array back to the trajectories of interest.

The easiest way to figure out which frames went to which cluster centers is to parse through the assignments file. In this case, assignments is a ragged array of shape (n_trajectories, n_frames), with each value being the cluster center that frame was assigned to. As before, you can match the assignments file trajectories to the original trajectories by comparing to the same glob pattern as the clustering app does.