Order of labels returned from parallel resampling

alexbw commented 11 years ago

I have a method to get the labels from a model, stored in self.hsmm_model

    def _get_labels(self):
        if self.parallel == False:
            labels_ = self.hsmm_model.states_list[0].stateseq
        else:
            # data_ids = [s.data_id for s in self.hsmm_model.states_list]
            labels_ = []
            for i in range(len(self.hsmm_model.states_list)):
                labels_.append(self.hsmm_model.states_list[i].stateseq)
            labels_ = np.hstack(labels_)
        return labels_

You'll note that I commented out data_ids, because at some point, it became unnecessary to keep track of the data_ids when retrieving properties. Now it seems that's been reverted.

Here's why I think that: I'm showing you a representation of around ~600K frames from the full OFA dataset. The x-axis is time. The y-axis of the various plots is explained below.

Top row is a representation of syllable usage. Dark horizontal streaks mean heavy usage of one syllable. Middle row is a plot of which mouse should be present in the dataset. Below that are simply ticks which divide the dataset into 8 even parts, meant to represent where the data was split before being handed off to the IPython clients. You'll see that those ticks, demarcate obvious breaks in the syllable usage plot. Those breaks should (from experience) fall along the mouse boundaries, not the arbitrary data boundaries. I think as the data is getting split up for the clients, the labels are not being reassembled properly.

Has anything changed in this part of the code?

alexbw commented 11 years ago

Here's a better picture where I also indicate individual mice, not just the strains:

mattjj commented 11 years ago

Yes, it definitely can shuffle the order of the states list; I forgot because we haven't looked at this code in a while.

This line is the culprit: https://github.com/mattjj/pyhsmm/blob/master/models.py#L220

I'm going to make a change to the pyhsmm code, then pull it through to this repo. The new semantics will be that states_list always stays in the order of add_data calls. Soon.

mattjj commented 11 years ago

I made two fixes.

First, I only did any of the random selection stuff if numtoresample is not set to its default, so there's no scrambling in that case: https://github.com/mattjj/pyhsmm/blob/master/models.py#L221

Second, for the numtoresample != 'all' case, I just save the order of states_list at the start and then restore it: https://github.com/mattjj/pyhsmm/blob/master/models.py#L233

I tested it by running examples/hsmm-parallel.py and checking that the data array hashes (their memory addresses) are always in the same order in states_list.

So states_list will always stay in the add_data order now.

alexbw commented 11 years ago

The proof's in the plot:

Looks good to me.

dattalab / pyhsmm-library-models

Order of labels returned from parallel resampling #31