dattalab / pyhsmm-library-models

library models built on top of pyhsmm
0 stars 1 forks source link

Memory surge (in sparse transition matrix code) #50

Closed alexbw closed 10 years ago

alexbw commented 10 years ago

Control-c'd in the parallel code. Happening within the state resampling. Going to break inside of the serial version now. Problem is, if I don't catch the ctrl-c fast enough, the machine locks up.

^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/home/alexbw/Code/pyhsmm_library_models/real_data_plots/parallel-library-subhmms.py in <module>()
    142 for itr in progprint_xrange(num_iter,perline=1):
    143     print "About to enter resample_model_parallel"
--> 144     model.resample_model_parallel()
    145     print "Resampled model, now getting likelihoods"
    146     loglike = model.log_likelihood()/len(training_data)

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.pyc in resample_model_parallel(self, *args, **kwargs)
    492     def resample_model_parallel(self,*args,**kwargs):
    493         self.resample_dur_distns()
--> 494         super(HSMM,self).resample_model_parallel(*args,**kwargs)
    495
    496     def _get_parallel_kwargss(self,states_objs):

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.pyc in resample_model_parallel(self, numtoresample, temp)
    185         # actually resample the states
    186         self.states_list = self.resample_states_parallel(
--> 187                 states_to_resample,states_to_hold_out,temp=temp)
    188
    189         # add back the held-out states

/home/alexbw/Code/pyhsmm_library_models/library_subhmm_models.pyc in resample_states_parallel(self, states_to_resample, states_to_hold_out, temp)
    102                 [s._frozen_aBls[0] for s in states_to_resample],
    103                 kwargss=self._get_parallel_kwargss(states_to_resample),
--> 104                 engine_globals=dict(global_model=self,temp=temp))
    105
    106         for s, (big_stateseq,like) in zip(states_to_resample,raw):

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/parallel.pyc in map_on_each(fn, added_datas, kwargss, engine_globals)
     97     ars = [c[data_residency[data_id]].apply_async(_call,fn,data_id,**kwargs)
     98                     for data_id, data, kwargs in indata]
---> 99     dv.wait(ars)
    100     results = [ar.get() for ar in ars]
    101

/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.pyc in wait(self, jobs, timeout)

/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.pyc in sync_results(f, self, *args, **kwargs)
     61     self._in_sync_results = True
     62     try:
---> 63         ret = f(self, *args, **kwargs)
     64     finally:
     65         self._in_sync_results = False

/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.pyc in wait(self, jobs, timeout)
    278         if jobs is None:
    279             jobs = self.history
--> 280         return self.client.wait(jobs, timeout)
    281
    282     def abort(self, jobs=None, targets=None, block=None):

/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.pyc in wait(self, jobs, timeout)
   1078             if timeout >= 0 and ( time.time()-tic ) > timeout:
   1079                 break
-> 1080             time.sleep(1e-3)
   1081             self.spin()
   1082         return len(theids.intersection(self.outstanding)) == 0

KeyboardInterrupt:
alexbw commented 10 years ago

The above error occurs during dv.wait, indicating the error is in the engines.

Also, I'm running in serial now. In serial, the memory creeps up slower, another indication that the memory allocation is happening in the engines. Here's the breakpoint for the serial mode — 

^C^C^C^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/home/alexbw/Code/pyhsmm_library_models/real_data_plots/parallel-library-subhmms.py in <module>()
    143     print "About to enter resample_model_parallel"
    144     # model.resample_model_parallel()
--> 145     model.resample_model()
    146     print "Resampled model, now getting likelihoods"
    147     loglike = model.log_likelihood()/len(training_data)

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.pyc in resample_model(self, **kwargs)
    469     def resample_model(self,**kwargs):
    470         self.resample_dur_distns()
--> 471         super(HSMM,self).resample_model(**kwargs)
    472
    473     def resample_dur_distns(self):

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.pyc in resample_model(self, temp)
    120         self.resample_trans_distn()
    121         self.resample_init_state_distn()
--> 122         self.resample_states(temp=temp)
    123
    124     def resample_obs_distns(self):

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.pyc in resample_states(self, temp)
    138     def resample_states(self,temp=None):
    139         for s in self.states_list:
--> 140             s.resample(temp=temp)
    141
    142     def copy_sample(self):

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/internals/states.pyc in resample(self, temp)
   1344         # TODO something with temperature
   1345         self._remove_substates_from_subHMMs()
-> 1346         alphan = self.messages_forwards_normalized()
   1347         self.sample_backwards_normalized(alphan)
   1348

/home/alexbw/Code/pyhsmm_library_models/pyhsmm/internals/states.pyc in messages_forwards_normalized(self)
   1324                 self.rs,self.ps,
   1325                 self.subhmm_trans_matrices,self.subhmm_pi_0s,
-> 1326                 self.aBls,self._alphan)
   1327
   1328         return self._alphan

KeyboardInterrupt:
alexbw commented 10 years ago

Evidence points to https://github.com/dattalab/pyhsmm/blob/subhmms/internals/subhmm_messages.cpp#L216

I added logs around all Python, Cython and C++ code. The log statement before that line is what hangs. Now double-checking the inner loop, and letting things run out a bit farther.

mattjj commented 10 years ago

We got through this, it's about as memory-lean as it can be.

alexbw commented 10 years ago

After investigating, everything's working properly, these models are just huge. Closing.