bowman-lab / enspara

Modeling molecular ensembles with scalable data structures and parallel computing
https://enspara.readthedocs.io
GNU General Public License v3.0
33 stars 16 forks source link

Change np.hstack to np.concatenate #207

Open EllaNguyen1711 opened 3 years ago

EllaNguyen1711 commented 3 years ago

This change allows loading trajectory assignment arrays of different shapes. Previously an error would be thrown by np.hstack if assignment arrays of different shapes were used, which is common when using folding@home data sets (for example).

justinrporter commented 3 years ago

Hi! Thanks for taking the time to make a contribution!

We use enspara on F@H datasets all the time, so I'm surprised you ran in to trouble with this function.

I'm also surprised that changing it from np.hstack to np.concatenate(..., axis=1) fixed it, since (I thought, at least) those are the same!

Could you post a minimal example of the problem you were having?

EllaNguyen1711 commented 3 years ago

Actually, changing it from np.hstack to np.concatenate(..., axis=1) is not enough, I had to delete a part of the code used for checking whether the input data had the shape 1d or not (from line 149 to 154). And then it was successful to build MSM for @.*** input data that I had.

[image: image.png]

So when you got the dataset that does not have the shape of (n_traj, n_frames) but only (n_traj,) due to different number frames in each trajectory. It would raise this error. Sorry that I have found that there is code for transforming from 2d to 1d in your code before using np.hstack.

However, it still raise an error for the case when we have trajectories of different number frames as input if codes from the line 149 to 154 is kept.

justinrporter commented 3 years ago

Sorry, I'm still having some trouble figuring out your situation. Because you responded with email, your image didn't make it into your post and GH censored the F@H dataset as "@.***" (thinking that it was an email address, I guess?).

I had to delete a part of the code used for checking whether the input data had the shape 1d

This change is not in your PR.

More importantly, I find it very strange that your data triggered the condition if len(assigns.shape) == 1. The only time you assignments could reasonably be 1d (i.e. shape == (n_frames,) rather than shape == (n_trjs, n_frames)) is if you have only one trajectory.

I strongly suspect there is some issue with how you've loaded the data. Perhaps you have loaded your data concatenated, so that each trajectory is appended end-to-end with the last? If so, this is unlikely to be what you want, since it will create artifactual transitions between the last frames of trajectory n and the first few frames of trajectory n+1.

Cheers!

EllaNguyen1711 commented 3 years ago

Thanks again for your reply. Maybe posting the error will help clarify the issue:

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
100%|█████████████████████████████████████████████████████████████████████████████| 1301/1301 [00:02<00:00, 620.61it/s]
MSM_building_05.py:85: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  cluster_assig = np.array(cluster_assig)
MSM_building_05.py:89: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  assig = np.array(assig)
(25500, 1)
MSM is being built ...
Traceback (most recent call last):
  File "MSM_building_05.py", line 92, in <module>
    tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
  File "MSM_building_05.py", line 61, in assigns_to_counts1
    'The given assignments array has 1-dimensional shape %s. '
enspara.exception.DataInvalid: The given assignments array has 1-dimensional shape %s. Two dimensional shapes = (n_trj, n_frames) are expected. If this is really what you want, try using assignments.reshape(1, -1) to create a single-row 2d array.

I believe this happens due to the length of each trajectory in the array having a different number of frames. Perhaps you are right that I am loading the data in a way the program was not expecting, but, as far as I can tell, each trajectory has different lengths and that this should be the correct way to load it (for the reason you stated about having artifactual transitions). The numpy array, therefore, shows a shape of (n_frames, ) due to the second dimension having different shapes for each item.

Thanks for looking into this.

justinrporter commented 3 years ago

What code are you running that reaches this error? Can you post the contents of MSM_building_05.py as well?

EllaNguyen1711 commented 3 years ago
import mdtraj as md
import pyemma.coordinates as coor
import numpy as np
import pickle
import pyemma
import numbers
import os
import enspara
import h5py
import pandas as pd
import numpy as np
import logging
import scipy
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse.csgraph import connected_components
from enspara import exception
from enspara.msm import MSM, builders, transition_matrices
from enspara.msm.transition_matrices import _transitions_helper
import pyemma.plots as pyemma_plots
from pyemma.util.contexts import settings

def assigns_to_counts1(
        assigns, lag_time, max_n_states=None, sliding_window=True):
    """Count transitions between states in a single trajectory.
    Parameters
    ----------
    assigns : array, shape=(traj_len, )
        A 2-D array where each row is a trajectory consisting of a
        sequence of state indices.
    lag_time : int
        The lag time (i.e. observation interval) for counting
        transitions.
    max_n_states : int, default=None
        The number of states. This is useful for controlling the
        dimensions of the transition count matrix in cases where the
        input trajectory does not necessarily visit every state.
    sliding_window : bool, default=True
        Whether to use a sliding window for counting transitions or to
        take every lag_time'th state.
    Returns
    -------
    C :  array, shape=(n_states, n_states)
        A transition count matrix.
    """

    if not isinstance(lag_time, numbers.Integral):
        raise exception.DataInvalid(
            "The lag time must be an integer. Got %s type %s." %
            lag_time, type(lag_time))
    if lag_time < 1:
        raise exception.DataInvalid(
            "Lag times must be be strictly greater than 0. Got '%s'." %
            lag_time)       
    # if it's 1d, later stuff will fail
    if len(assigns.shape) == 1:
        raise exception.DataInvalid(
            'The given assignments array has 1-dimensional shape %s. '
            'Two dimensional shapes = (n_trj, n_frames) are expected. '
            'If this is really what you want, try using '
            'assignments.reshape(1, -1) to create a single-row 2d array.')
    assigns = np.array([a[np.where(a != -1)] for a in assigns])

    if max_n_states is None:
        max_n_states = np.concatenate(assigns).max() + 1

    transitions = [
        _transitions_helper(
            assign, lag_time=lag_time, sliding_window=sliding_window)
        for assign in assigns]
    # generate sparse matrix
    mat_coords = np.hstack(transitions)
    mat_data = np.ones(mat_coords.shape[1], dtype=int)
    C = scipy.sparse.coo_matrix(
        (mat_data, mat_coords), shape=(max_n_states, max_n_states))
    return C
#Building MSM
msm_lag = 8 
cluster_numbers = 5000
cluster_assig = coor.load('clustering/chi1_2_5000_trajs_n_stride_5.h5')
cluster_assig = np.array(cluster_assig)
assig = []
for frame in cluster_assig:
    assig.append(frame.astype(int))
assig = np.array(assig)
print (assig[0].shape)
print ('MSM is being built ...')
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
prior_counts = 1/tcounts.shape[0]
tcounts = builders._apply_prior_counts(tcounts, prior_counts)
probs = builders._row_normalize(tcounts)
eq_probs_ = builders.eq_probs(probs)

print ('transition maxtrix: ', tcounts)
print ('transition probabilities: ', probs)
print ('equilibrium probabilities: ', eq_probs_)

np.save (f'MSM/tcounts_{cluster_numbers}_{msm_lag}.npy', tcounts)
np.save (f'MSM/tprobs_{cluster_numbers}_{msm_lag}.npy', probs)
np.save (f'MSM/populations_{cluster_numbers}_{msm_lag}.npy', eq_probs_)

print('MSM was built successfully')

Here you are!

justinrporter commented 3 years ago

Without sitting down with this code and it's inputs and debugging more carefully, I'm not sure what's going on.

Needless to say, this isn't how we intended the code to be used. It's just a convention, but mostly you shouldn't need to use the methods with names beginning with _.

Building an MSM from your assignments should be as easy as shown here!

justinrporter commented 3 years ago

Oh, also, if you are looking to do prior counts, check out the example here! It uses the built-in "partial" method to wrap normalize with the pertinent prior_counts value.