Open EllaNguyen1711 opened 3 years ago
Hi! Thanks for taking the time to make a contribution!
We use enspara on F@H datasets all the time, so I'm surprised you ran in to trouble with this function.
I'm also surprised that changing it from np.hstack
to np.concatenate(..., axis=1)
fixed it, since (I thought, at least) those are the same!
Could you post a minimal example of the problem you were having?
Actually, changing it from np.hstack to np.concatenate(..., axis=1) is not enough, I had to delete a part of the code used for checking whether the input data had the shape 1d or not (from line 149 to 154). And then it was successful to build MSM for @.*** input data that I had.
[image: image.png]
So when you got the dataset that does not have the shape of (n_traj, n_frames) but only (n_traj,) due to different number frames in each trajectory. It would raise this error. Sorry that I have found that there is code for transforming from 2d to 1d in your code before using np.hstack.
However, it still raise an error for the case when we have trajectories of different number frames as input if codes from the line 149 to 154 is kept.
Sorry, I'm still having some trouble figuring out your situation. Because you responded with email, your image didn't make it into your post and GH censored the F@H dataset as "@.***" (thinking that it was an email address, I guess?).
I had to delete a part of the code used for checking whether the input data had the shape 1d
This change is not in your PR.
More importantly, I find it very strange that your data triggered the condition if len(assigns.shape) == 1
. The only time you assignments could reasonably be 1d (i.e. shape == (n_frames,)
rather than shape == (n_trjs, n_frames)
) is if you have only one trajectory.
I strongly suspect there is some issue with how you've loaded the data. Perhaps you have loaded your data concatenated, so that each trajectory is appended end-to-end with the last? If so, this is unlikely to be what you want, since it will create artifactual transitions between the last frames of trajectory n and the first few frames of trajectory n+1.
Cheers!
Thanks again for your reply. Maybe posting the error will help clarify the issue:
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
100%|█████████████████████████████████████████████████████████████████████████████| 1301/1301 [00:02<00:00, 620.61it/s]
MSM_building_05.py:85: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
cluster_assig = np.array(cluster_assig)
MSM_building_05.py:89: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
assig = np.array(assig)
(25500, 1)
MSM is being built ...
Traceback (most recent call last):
File "MSM_building_05.py", line 92, in <module>
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
File "MSM_building_05.py", line 61, in assigns_to_counts1
'The given assignments array has 1-dimensional shape %s. '
enspara.exception.DataInvalid: The given assignments array has 1-dimensional shape %s. Two dimensional shapes = (n_trj, n_frames) are expected. If this is really what you want, try using assignments.reshape(1, -1) to create a single-row 2d array.
I believe this happens due to the length of each trajectory in the array having a different number of frames. Perhaps you are right that I am loading the data in a way the program was not expecting, but, as far as I can tell, each trajectory has different lengths and that this should be the correct way to load it (for the reason you stated about having artifactual transitions). The numpy array, therefore, shows a shape of (n_frames, ) due to the second dimension having different shapes for each item.
Thanks for looking into this.
What code are you running that reaches this error? Can you post the contents of MSM_building_05.py
as well?
import mdtraj as md
import pyemma.coordinates as coor
import numpy as np
import pickle
import pyemma
import numbers
import os
import enspara
import h5py
import pandas as pd
import numpy as np
import logging
import scipy
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse.csgraph import connected_components
from enspara import exception
from enspara.msm import MSM, builders, transition_matrices
from enspara.msm.transition_matrices import _transitions_helper
import pyemma.plots as pyemma_plots
from pyemma.util.contexts import settings
def assigns_to_counts1(
assigns, lag_time, max_n_states=None, sliding_window=True):
"""Count transitions between states in a single trajectory.
Parameters
----------
assigns : array, shape=(traj_len, )
A 2-D array where each row is a trajectory consisting of a
sequence of state indices.
lag_time : int
The lag time (i.e. observation interval) for counting
transitions.
max_n_states : int, default=None
The number of states. This is useful for controlling the
dimensions of the transition count matrix in cases where the
input trajectory does not necessarily visit every state.
sliding_window : bool, default=True
Whether to use a sliding window for counting transitions or to
take every lag_time'th state.
Returns
-------
C : array, shape=(n_states, n_states)
A transition count matrix.
"""
if not isinstance(lag_time, numbers.Integral):
raise exception.DataInvalid(
"The lag time must be an integer. Got %s type %s." %
lag_time, type(lag_time))
if lag_time < 1:
raise exception.DataInvalid(
"Lag times must be be strictly greater than 0. Got '%s'." %
lag_time)
# if it's 1d, later stuff will fail
if len(assigns.shape) == 1:
raise exception.DataInvalid(
'The given assignments array has 1-dimensional shape %s. '
'Two dimensional shapes = (n_trj, n_frames) are expected. '
'If this is really what you want, try using '
'assignments.reshape(1, -1) to create a single-row 2d array.')
assigns = np.array([a[np.where(a != -1)] for a in assigns])
if max_n_states is None:
max_n_states = np.concatenate(assigns).max() + 1
transitions = [
_transitions_helper(
assign, lag_time=lag_time, sliding_window=sliding_window)
for assign in assigns]
# generate sparse matrix
mat_coords = np.hstack(transitions)
mat_data = np.ones(mat_coords.shape[1], dtype=int)
C = scipy.sparse.coo_matrix(
(mat_data, mat_coords), shape=(max_n_states, max_n_states))
return C
#Building MSM
msm_lag = 8
cluster_numbers = 5000
cluster_assig = coor.load('clustering/chi1_2_5000_trajs_n_stride_5.h5')
cluster_assig = np.array(cluster_assig)
assig = []
for frame in cluster_assig:
assig.append(frame.astype(int))
assig = np.array(assig)
print (assig[0].shape)
print ('MSM is being built ...')
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
prior_counts = 1/tcounts.shape[0]
tcounts = builders._apply_prior_counts(tcounts, prior_counts)
probs = builders._row_normalize(tcounts)
eq_probs_ = builders.eq_probs(probs)
print ('transition maxtrix: ', tcounts)
print ('transition probabilities: ', probs)
print ('equilibrium probabilities: ', eq_probs_)
np.save (f'MSM/tcounts_{cluster_numbers}_{msm_lag}.npy', tcounts)
np.save (f'MSM/tprobs_{cluster_numbers}_{msm_lag}.npy', probs)
np.save (f'MSM/populations_{cluster_numbers}_{msm_lag}.npy', eq_probs_)
print('MSM was built successfully')
Here you are!
Without sitting down with this code and it's inputs and debugging more carefully, I'm not sure what's going on.
Needless to say, this isn't how we intended the code to be used. It's just a convention, but mostly you shouldn't need to use the methods with names beginning with _
.
Building an MSM from your assignments should be as easy as shown here!
Oh, also, if you are looking to do prior counts, check out the example here! It uses the built-in "partial" method to wrap normalize
with the pertinent prior_counts
value.
This change allows loading trajectory assignment arrays of different shapes. Previously an error would be thrown by np.hstack if assignment arrays of different shapes were used, which is common when using folding@home data sets (for example).