Parallel sampling in library_subhmm_models fails

alexbw commented 11 years ago

    191     kwargs = {}
    192     for key in info['kw_keys']:
--> 193         kwarg, kwarg_bufs = unserialize_object(kwarg_bufs, g)
    194         kwargs[key] = kwarg
    195     assert not kwarg_bufs, "Shouldn't be any kwarg bufs left over"
/home/abw11/anaconda/lib/python2.7/site-packages/IPython/kernel/zmq/serialize.pyc in unserialize_object(buffers, g)
    123         # a zmq message
    124         pobj = bytes(pobj)
--> 125     canned = pickle.loads(pobj)
    126     if istype(canned, sequence_types) and len(canned) < MAX_ITEMS:
    127         for c in canned:
ImportError: No module named library_subhmm_models

Can't seem to find the module upon unpickling. Which is silly, because I tell it explicitly to import like this:

dviews = Client(profile='default')[:]
dviews.execute("import pyhsmm_library_models.library_subhmm_models as library_subhmm_models").get() # We slap on a get() to make sure it doesn't error

alexbw commented 11 years ago

The data added is fine, as far as I can tell.

alexbw commented 11 years ago

Now it's just hangin there, no CPU activity at all.

mattjj commented 11 years ago

Can you run it on jefferson? Or interactively on orchestra?

It works in ~mattjj/work/pyhsmm_library_modelson jefferson

Matt

Sent from my phone

On Oct 15, 2013, at 12:27 PM, Alex Wiltschko notifications@github.com wrote:

Now it's just hangin there, no CPU activity at all.

— Reply to this email directly or view it on GitHub.

alexbw commented 11 years ago

I'm doing it interactively on Orchestra right now.

On Tue, Oct 15, 2013 at 12:33 PM, Matthew Johnson notifications@github.comwrote:

Can you run it on jefferson? Or interactively on orchestra?

It works in ~mattjj/work/pyhsmm_library_modelson jefferson

Matt

Sent from my phone

On Oct 15, 2013, at 12:27 PM, Alex Wiltschko notifications@github.com wrote:

Now it's just hangin there, no CPU activity at all.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26350557 .

alexbw commented 11 years ago

I'll run it non-interactively to see.

On Tue, Oct 15, 2013 at 12:34 PM, Alex Wiltschko alex.bw@gmail.com wrote:

I'm doing it interactively on Orchestra right now.

On Tue, Oct 15, 2013 at 12:33 PM, Matthew Johnson < notifications@github.com> wrote:

Can you run it on jefferson? Or interactively on orchestra?

It works in ~mattjj/work/pyhsmm_library_modelson jefferson

Matt

Sent from my phone

On Oct 15, 2013, at 12:27 PM, Alex Wiltschko notifications@github.com wrote:

Now it's just hangin there, no CPU activity at all.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26350557 .

alexbw commented 11 years ago

It might be a shared compilation issue. I'll look into how to get weave to point to /hsm/scratch1/abw11/tmp/

alexbw commented 11 years ago

It's hanging for awhile, and then errors out. I do not recognize the traceback. I'm just starting a cluster the regular way, ipcluster start --n=8

I don't know what I'm doing wrong here, but it feels like I'm just not starting the cluster in the right way, or precompiling something, or something.

/home/alexbw/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:727: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/alexbw/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
Adding parallel training data
Beginning our resampling
Traceback (most recent call last):
  File "/home/alexbw/Code/pyhsmm_library_models/real_data_plots/parallel-library-subhmms.py", line 67, in <module>
    model.resample_model_parallel()
  File "/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.py", line 512, in resample_model_parallel
    super(HSMM,self).resample_model_parallel(*args,**kwargs)
  File "/home/alexbw/Code/pyhsmm_library_models/pyhsmm/models.py", line 230, in resample_model_parallel
    self.resample_states_parallel(temp=temp)
  File "/home/alexbw/Code/pyhsmm_library_models/library_subhmm_models.py", line 91, in resample_states_parallel
    engine_globals=dict(global_model=self,temp=temp),
  File "/home/alexbw/Code/pyhsmm_library_models/pyhsmm/parallel.py", line 100, in map_on_each
    results = [ar.get() for ar in ars]
  File "/home/alexbw/anaconda/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 126, in get
    raise self._exception
IPython.parallel.error.RemoteError: SystemError(error return without exception set)

mattjj commented 11 years ago

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client

def foo(x):
    return x**2

ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]
results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

alexbw commented 11 years ago

And you ran this yourself on Jefferson? I must be doing something wrong. I'll look into it soon.

On Tue, Oct 15, 2013 at 5:19 PM, Matthew Johnson notifications@github.comwrote:

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client def foo(x): return x**2 ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373231 .

mattjj commented 11 years ago

Yes of course! I ran it on my laptop and on jefferson.

Possible problems:

some merge error (this morning I may not have merged things correctly)
some ipython parallel setup error (maybe paths)

On Tue, Oct 15, 2013 at 5:20 PM, Alex Wiltschko notifications@github.comwrote:

And you ran this yourself on Jefferson? I must be doing something wrong. I'll look into it soon.

On Tue, Oct 15, 2013 at 5:19 PM, Matthew Johnson notifications@github.comwrote:

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client def foo(x): return x**2 ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

— Reply to this email directly or view it on GitHub< https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373231>

.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373368 .

alexbw commented 11 years ago

Where do you set your engine paths?

On Oct 15, 2013, at 5:37 PM, Matthew Johnson notifications@github.com wrote:

Yes of course! I ran it on my laptop and on jefferson.

Possible problems:

some merge error (this morning I may not have merged things correctly)

some ipython parallel setup error (maybe paths)

On Tue, Oct 15, 2013 at 5:20 PM, Alex Wiltschko notifications@github.comwrote:

And you ran this yourself on Jefferson? I must be doing something wrong. I'll look into it soon.

On Tue, Oct 15, 2013 at 5:19 PM, Matthew Johnson notifications@github.comwrote:

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client def foo(x): return x**2 ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

— Reply to this email directly or view it on GitHub< https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373231>

.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373368 .

— Reply to this email directly or view it on GitHub.

mattjj commented 11 years ago

Since ipython includes the pwd in the path, I just start the (local) engines in that directory.

On Tue, Oct 15, 2013 at 6:41 PM, Alex Wiltschko notifications@github.comwrote:

Where do you set your engine paths?

On Oct 15, 2013, at 5:37 PM, Matthew Johnson notifications@github.com wrote:

Yes of course! I ran it on my laptop and on jefferson.

Possible problems:

some merge error (this morning I may not have merged things correctly)

some ipython parallel setup error (maybe paths)

On Tue, Oct 15, 2013 at 5:20 PM, Alex Wiltschko < notifications@github.com>wrote:

And you ran this yourself on Jefferson? I must be doing something wrong. I'll look into it soon.

On Tue, Oct 15, 2013 at 5:19 PM, Matthew Johnson notifications@github.comwrote:

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client def foo(x): return x**2 ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

— Reply to this email directly or view it on GitHub<

https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373231>

.

— Reply to this email directly or view it on GitHub< https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373368>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26379058 .

alexbw commented 11 years ago

Hrm I do that too

On Tue, Oct 15, 2013 at 8:41 PM, Matthew Johnson notifications@github.comwrote:

Since ipython includes the pwd in the path, I just start the (local) engines in that directory.

On Tue, Oct 15, 2013 at 6:41 PM, Alex Wiltschko notifications@github.comwrote:

Where do you set your engine paths?

On Oct 15, 2013, at 5:37 PM, Matthew Johnson notifications@github.com

wrote:

Yes of course! I ran it on my laptop and on jefferson.

Possible problems:

some merge error (this morning I may not have merged things correctly)

some ipython parallel setup error (maybe paths)

On Tue, Oct 15, 2013 at 5:20 PM, Alex Wiltschko < notifications@github.com>wrote:

And you ran this yourself on Jefferson? I must be doing something wrong. I'll look into it soon.

On Tue, Oct 15, 2013 at 5:19 PM, Matthew Johnson notifications@github.comwrote:

It definitely looks like a setup issue, since it's failing on the asyncresult get method. Can you verify that the engines' paths include /home/alexbw/Code/pyhsmm_library_models? If it does, maybe you could try a simple test that does the same thing, like

from IPython.parallel import Client def foo(x): return x**2 ars = [dview.apply_async(foo,i) for i,dview in enumerate(Client())]results = [ar.get() for ar in ars]

That's basically what parallel.py is doing.

— Reply to this email directly or view it on GitHub<

https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373231>

.

— Reply to this email directly or view it on GitHub<

https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26373368>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26379058>

.

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26384743 .

mattjj commented 11 years ago

So did that snippet work? It's a few comments up.

alexbw commented 11 years ago

Got sidetracked. Will try again this morning.

On Oct 15, 2013, at 11:33 PM, Matthew Johnson notifications@github.com wrote:

So did that snippet work? It's a few comments up.

— Reply to this email directly or view it on GitHub.

alexbw commented 11 years ago

The snippet works on Jefferson and Orchestra.

alexbw commented 11 years ago

I never specified which file was failing, which is my bad. correctness_tests/library-subhmms-parallel.py is working just fine right now. It's real_data_plots/parallel-library-subhmms.py that's giving me the trouble.

mattjj commented 11 years ago

Right now I'm pushing model and (per-engine) data sizes to see where things start failing.

With 6 engines on Jefferson crunching on 6 sequences of length 1000, Nmaxsuper=20, libsize=200, iterations look like this:

Beginning our resampling
.  [  1/50,   98.09sec avg, 4806.63sec ETA ]
.  [  2/50,   69.25sec avg, 3324.03sec ETA ]
.  [  3/50,   54.04sec avg, 2540.09sec ETA ]
.  [  4/50,   45.13sec avg, 2075.76sec ETA ]
.  [  5/50,   39.58sec avg, 1781.06sec ETA ]
.  [  6/50,   35.54sec avg, 1563.56sec ETA ]
.  [  7/50,   32.62sec avg, 1402.82sec ETA ]
.  [  8/50,   30.36sec avg, 1275.32sec ETA ]
.  [  9/50,   28.58sec avg, 1171.72sec ETA ]
.  [ 10/50,   27.12sec avg, 1084.65sec ETA ]
.  [ 11/50,   25.96sec avg, 1012.35sec ETA ]
.  [ 12/50,   24.99sec avg,  949.65sec ETA ]
.  [ 13/50,   24.18sec avg,  894.59sec ETA ]
.  [ 14/50,   23.45sec avg,  844.38sec ETA ]
.  [ 15/50,   22.85sec avg,  799.69sec ETA ]
.  [ 16/50,   22.29sec avg,  758.02sec ETA ]
.  [ 17/50,   21.82sec avg,  719.95sec ETA ]
.  [ 18/50,   21.40sec avg,  684.86sec ETA ]
.  [ 19/50,   21.03sec avg,  652.06sec ETA ]
.  [ 20/50,   20.67sec avg,  620.08sec ETA ]
.  [ 21/50,   20.38sec avg,  590.98sec ETA ]
.  [ 22/50,   20.10sec avg,  562.75sec ETA ]
.  [ 23/50,   19.86sec avg,  536.24sec ETA ]
.  [ 24/50,   19.62sec avg,  510.17sec ETA ]
.  [ 25/50,   19.41sec avg,  485.25sec ETA ]
.  [ 26/50,   19.20sec avg,  460.80sec ETA ]
.  [ 27/50,   19.01sec avg,  437.17sec ETA ]
.  [ 28/50,   18.86sec avg,  414.87sec ETA ]
.  [ 29/50,   18.73sec avg,  393.28sec ETA ]
.  [ 30/50,   18.57sec avg,  371.41sec ETA ]
.  [ 31/50,   18.44sec avg,  350.32sec ETA ]
.  [ 32/50,   18.31sec avg,  329.67sec ETA ]
.  [ 33/50,   18.19sec avg,  309.31sec ETA ]
.  [ 34/50,   18.09sec avg,  289.49sec ETA ]
.  [ 35/50,   17.99sec avg,  269.92sec ETA ]
.  [ 36/50,   17.90sec avg,  250.60sec ETA ]
.  [ 37/50,   17.82sec avg,  231.68sec ETA ]
.  [ 38/50,   17.74sec avg,  212.91sec ETA ]
.  [ 39/50,   17.68sec avg,  194.43sec ETA ]

With 6 sequences of length 2000, iterations look like this:

Beginning our resampling
.  [  1/50,  100.67sec avg, 4932.59sec ETA ]
.  [  2/50,   74.15sec avg, 3559.09sec ETA ]
.  [  3/50,   59.62sec avg, 2802.34sec ETA ]
.  [  4/50,   50.48sec avg, 2322.09sec ETA ]
.  [  5/50,   44.93sec avg, 2021.91sec ETA ]
.  [  6/50,   41.31sec avg, 1817.63sec ETA ]
.  [  7/50,   38.84sec avg, 1670.01sec ETA ]
.  [  8/50,   36.92sec avg, 1550.66sec ETA ]
.  [  9/50,   35.38sec avg, 1450.48sec ETA ]
.  [ 10/50,   34.21sec avg, 1368.39sec ETA ]
.  [ 11/50,   33.25sec avg, 1296.76sec ETA ]
.  [ 12/50,   32.44sec avg, 1232.79sec ETA ]
.  [ 13/50,   31.74sec avg, 1174.21sec ETA ]
.  [ 14/50,   31.15sec avg, 1121.48sec ETA ]
.  [ 15/50,   30.66sec avg, 1073.19sec ETA ]
.  [ 16/50,   30.22sec avg, 1027.45sec ETA ]
.  [ 17/50,   29.89sec avg,  986.28sec ETA ]

alexbw commented 11 years ago

Not double time why?

2013/10/16 Matthew Johnson notifications@github.com

Right now I'm pushing model and (per-engine) data sizes to see where things start failing.

With 6 engines on Jefferson crunching on 6 sequences of length 1000, Nmaxsuper=20, libsize=200, iterations look like this:

Beginning our resampling . [ 1/50, 98.09sec avg, 4806.63sec ETA ] . [ 2/50, 69.25sec avg, 3324.03sec ETA ] . [ 3/50, 54.04sec avg, 2540.09sec ETA ] . [ 4/50, 45.13sec avg, 2075.76sec ETA ] . [ 5/50, 39.58sec avg, 1781.06sec ETA ] . [ 6/50, 35.54sec avg, 1563.56sec ETA ] . [ 7/50, 32.62sec avg, 1402.82sec ETA ] . [ 8/50, 30.36sec avg, 1275.32sec ETA ] . [ 9/50, 28.58sec avg, 1171.72sec ETA ] . [ 10/50, 27.12sec avg, 1084.65sec ETA ] . [ 11/50, 25.96sec avg, 1012.35sec ETA ] . [ 12/50, 24.99sec avg, 949.65sec ETA ] . [ 13/50, 24.18sec avg, 894.59sec ETA ] . [ 14/50, 23.45sec avg, 844.38sec ETA ] . [ 15/50, 22.85sec avg, 799.69sec ETA ] . [ 16/50, 22.29sec avg, 758.02sec ETA ] . [ 17/50, 21.82sec avg, 719.95sec ETA ] . [ 18/50, 21.40sec avg, 684.86sec ETA ] . [ 19/50, 21.03sec avg, 652.06sec ETA ] . [ 20/50, 20.67sec avg, 620.08sec ETA ] . [ 21/50, 20.38sec avg, 590.98sec ETA ] . [ 22/50, 20.10sec avg, 562.75sec ETA ] . [ 23/50, 19.86sec avg, 536.24sec ETA ] . [ 24/50, 19.62sec avg, 510.17sec ETA ] . [ 25/50, 19.41sec avg, 485.25sec ETA ] . [ 26/50, 19.20sec avg, 460.80sec ETA ] . [ 27/50, 19.01sec avg, 437.17sec ETA ] . [ 28/50, 18.86sec avg, 414.87sec ETA ] . [ 29/50, 18.73sec avg, 393.28sec ETA ] . [ 30/50, 18.57sec avg, 371.41sec ETA ] . [ 31/50, 18.44sec avg, 350.32sec ETA ] . [ 32/50, 18.31sec avg, 329.67sec ETA ] . [ 33/50, 18.19sec avg, 309.31sec ETA ] . [ 34/50, 18.09sec avg, 289.49sec ETA ] . [ 35/50, 17.99sec avg, 269.92sec ETA ] . [ 36/50, 17.90sec avg, 250.60sec ETA ] . [ 37/50, 17.82sec avg, 231.68sec ETA ] . [ 38/50, 17.74sec avg, 212.91sec ETA ] . [ 39/50, 17.68sec avg, 194.43sec ETA ]

With 6 sequences of length 2000, iterations look like this:

Beginning our resampling . [ 1/50, 100.67sec avg, 4932.59sec ETA ] . [ 2/50, 74.15sec avg, 3559.09sec ETA ] . [ 3/50, 59.62sec avg, 2802.34sec ETA ] . [ 4/50, 50.48sec avg, 2322.09sec ETA ] . [ 5/50, 44.93sec avg, 2021.91sec ETA ] . [ 6/50, 41.31sec avg, 1817.63sec ETA ] . [ 7/50, 38.84sec avg, 1670.01sec ETA ] . [ 8/50, 36.92sec avg, 1550.66sec ETA ] . [ 9/50, 35.38sec avg, 1450.48sec ETA ] . [ 10/50, 34.21sec avg, 1368.39sec ETA ] . [ 11/50, 33.25sec avg, 1296.76sec ETA ] . [ 12/50, 32.44sec avg, 1232.79sec ETA ] . [ 13/50, 31.74sec avg, 1174.21sec ETA ] . [ 14/50, 31.15sec avg, 1121.48sec ETA ] . [ 15/50, 30.66sec avg, 1073.19sec ETA ] . [ 16/50, 30.22sec avg, 1027.45sec ETA ] . [ 17/50, 29.89sec avg, 986.28sec ETA ]

— Reply to this email directly or view it on GitHubhttps://github.com/dattalab/pyhsmm-library-models/issues/32#issuecomment-26454541 .

mattjj commented 11 years ago

It is double time (approximately); check out the later iterations. It's slightly better than double time because of lower relative overhead, I'd guess.

mattjj commented 11 years ago

Here's 6 sequences of length 4000:

Beginning our resampling
.  [  1/50,  100.98sec avg, 4948.05sec ETA ]
WARNING: NegativeBinomialIntegerRVariantDuration:
data has zero probability under the model, ignorin
g
.  [  2/50,   73.80sec avg, 3542.59sec ETA ]
.  [  3/50,   62.39sec avg, 2932.55sec ETA ]
.  [  4/50,   56.03sec avg, 2577.24sec ETA ]
.  [  5/50,   51.93sec avg, 2336.75sec ETA ]
.  [  6/50,   49.35sec avg, 2171.62sec ETA ]
.  [  7/50,   47.60sec avg, 2046.91sec ETA ]
.  [  8/50,   46.42sec avg, 1949.49sec ETA ]
.  [  9/50,   45.37sec avg, 1860.03sec ETA ]
.  [ 10/50,   44.59sec avg, 1783.68sec ETA ]
.  [ 11/50,   43.96sec avg, 1714.42sec ETA ]
.  [ 12/50,   43.39sec avg, 1648.99sec ETA ]
.  [ 13/50,   42.94sec avg, 1588.60sec ETA ]
.  [ 14/50,   42.55sec avg, 1531.78sec ETA ]
.  [ 15/50,   42.19sec avg, 1476.76sec ETA ]
.  [ 16/50,   41.92sec avg, 1425.43sec ETA ]
.  [ 17/50,   41.71sec avg, 1376.54sec ETA ]
.  [ 18/50,   41.51sec avg, 1328.24sec ETA ]
.  [ 19/50,   41.28sec avg, 1279.66sec ETA ]
.  [ 20/50,   41.09sec avg, 1232.59sec ETA ]
.  [ 21/50,   40.92sec avg, 1186.69sec ETA ]
.  [ 22/50,   40.75sec avg, 1141.13sec ETA ]
.  [ 23/50,   40.64sec avg, 1097.38sec ETA ]
.  [ 24/50,   40.51sec avg, 1053.32sec ETA ]
.  [ 25/50,   40.43sec avg, 1010.69sec ETA ]
.  [ 26/50,   40.33sec avg,  968.04sec ETA ]
.  [ 27/50,   40.22sec avg,  925.15sec ETA ]
.  [ 28/50,   40.15sec avg,  883.32sec ETA ]
.  [ 29/50,   40.07sec avg,  841.49sec ETA ]
.  [ 30/50,   40.01sec avg,  800.13sec ETA ]
.  [ 31/50,   39.93sec avg,  758.59sec ETA ]
.  [ 32/50,   39.87sec avg,  717.60sec ETA ]
.  [ 33/50,   39.79sec avg,  676.40sec ETA ]
.  [ 34/50,   39.72sec avg,  635.50sec ETA ]
.  [ 35/50,   39.64sec avg,  594.59sec ETA ]
.  [ 36/50,   39.58sec avg,  554.10sec ETA ]
.  [ 37/50,   39.52sec avg,  513.79sec ETA ]
.  [ 38/50,   39.47sec avg,  473.60sec ETA ]
.  [ 39/50,   39.40sec avg,  433.40sec ETA ]
.  [ 40/50,   39.37sec avg,  393.71sec ETA ]
.  [ 41/50,   39.83sec avg,  358.46sec ETA ]
.  [ 42/50,   39.89sec avg,  319.14sec ETA ]
.  [ 43/50,   40.26sec avg,  281.82sec ETA ]
.  [ 44/50,   40.41sec avg,  242.44sec ETA ]
.  [ 45/50,   40.78sec avg,  203.91sec ETA ]
.  [ 46/50,   40.75sec avg,  163.00sec ETA ]

I think it actually segfaulted right there, since it bailed out of ipython without printing anything. Wtf? Maybe within this data size range I can reproduce the problem consistently and get more info about it.

alexbw commented 11 years ago

This issue was originally about parallel resampling not starting at all. This is now a line of investigation about how large a data sequence can be on a single worker before it craps out. I'd recommend closing this, and reopening a fresh issue that's more akin to benchmarking than bug-fixing.

mattjj commented 11 years ago

Good idea, I'll do that after just a little more investigation of the possible jefferson segfault.

alexbw commented 11 years ago

To get job output output in the filesystem that you can cat, you can use, in your bsub command to grab an interactive node

bsub -q interactive -W 720 -o job.out -e job.err -R "rusage[mem=50000]" fish

To just get an email with the output, you can do

bsub -q interactive -W 720 -N -R "rusage[mem=50000]" fish

mattjj commented 11 years ago

This was just an ipython parallel memory buildup thing. The solution was to add this to the ipcontroller_config.py file:

c.HubFactory.db_class = 'NoDB'

to be safe, I also added every kind of 'purge' command I could find to pyhsmm's parallel.py; those are supposed to have the same effect, but they didn't seem to have the effect the documentation says they should have.

dattalab / pyhsmm-library-models

Parallel sampling in library_subhmm_models fails #32