Per-connection delays between populations cause Iceberg execution to fail silently

dbux commented 7 years ago

When using a connection list on a projection between neural populations, setting a per-connection delay will cause execution of the model to silently and suddenly fail on Iceberg. Executing the same model using a global delay works as expected.

See commit 2f6c9b8 on branch weibull of abrg_local: The models 'Test_network-nogap-fixeddelay' and 'Test_network-nogap' are the same except that the former has fixed global delays for projections between populations.

Both models execute normally on my desktop machine. Running the first model on Iceberg completes as normal. Running the second model on Iceberg causes the run_brahms process to terminate prematurely with no error message; the log looks something like this:

W2 3.682291 EVENT_INIT_POSTCONNECT on D2_MSNs W2 3.682322 ...OK W2 3.682351 thread W2 IDLE... C 3.682417 synchronizing with peers... W1 3.682513 thread W1 IDLE... W2 3.683275 thread W2 IDLE... W1 3.683427 WORKER MAIN LOOP ENTER W2 3.688032 WORKER MAIN LOOP ENTER C 3.692110 enter main loop and release worker threads

sebjameswml commented 7 years ago

Hi Dave, can you find out if the versions of brahms and SpineML_2_BRAHMS are the same both on your desktop PC and on your Iceberg account?

dbux commented 7 years ago

I don't know if there's a command that returns the current version number but I get an 'Already up-to-date' message on both machines when doing 'git pull' for all of SpineML_2_BRAHMS, SpineML_PreFlight and BRAHMS.

sebjameswml commented 7 years ago

Ok, that's not the issue then

dbux commented 7 years ago

It turns out that either the connection list delay problem isn't the only one, or there's a different problem that's manifesting in different ways.

In the model Integrated_microcircuit_BG/Integrated_test on commit 6d9fda6 on branch weibull, I've stripped away everything apart from the MSN populations and a spike source input, and even with global fixed delays on all connections, the model won't finish running on Iceberg.

The problem seems to be in the connections between D1 and D2 MSNs - removing any one of D1-D1, D1-D2, D2-D1 or D2-D2 projections allows the model to run, so it's not a loop problem. As far as I can tell the only major difference between 'Test_network-nogap-fixeddelay' (which works) and 'Integrated_test' is the number of MSNs (and spike inputs), but I don't get any out-of-memory errors and I've tried dramatically increasing the amount of memory requested by my script without any effect. This isn't the 85,000 neuron model either, this is the one that used to work fine and which has generated all my data so far.

dbux commented 7 years ago

I've also tried: Running the model with randomised global delays: still terminates unexpectedly. Running the model with delays set at 0.1, 0.2, 0.3 and 0.5ms to stagger the connections: the run_brahms process continues for a long time as if the model was running but eventually quits without any data or any logfile whatsoever.

dbux commented 7 years ago

So! It turns out that this is all due to jobs running out of memory because something changed on Iceberg that means my memory allocation requests are no longer being inherited by convert_script_s2b.

Using the -n option on convert_script_s2b now appears to start a process with the default memory allocation instead of whatever was allocated to the job that called it.

sebjameswml commented 6 years ago

Hi Dave, thanks for closing this. I'm sorry I couldn't come over and assist with it; we're in the summer holidays and so things are pretty busy at home, and I've got an upcoming paper deadline...

dbux commented 6 years ago

No worries! It ended up not even being SpineML related anyway 😂

Good luck with the paper!

SpineML / SpineML_2_BRAHMS

Per-connection delays between populations cause Iceberg execution to fail silently #35