Building genn executable fails after a certain number of successful runs when parameter fitting

TheSalocin commented 4 years ago

I'm attempting to fit parameters for a network using the SNPE algorithm in delfi. So far it runs fine using just brian2 and brian2 with the "cpp_standalone" device. It also in principle runs using brian2genn, but after some number of iterations it fails when building the executable and gives the attached output. According to my device monitor it shouldn't be a memory issue.

brian2genn_errorMessage.txt

mstimberg commented 4 years ago

Thanks for the report. Unfortunately I don't quite see what is going on from the error message. Could you run it again but using the debug option? Either set_device('genn', ..., debug=True) or in device.build if you are using it.

TheSalocin commented 4 years ago

Ok, I did that. Since I'm running it via jupyter, I get two outputs: some printed lines, and the actual python error message. Here are both:

brian2genn_errorMessage_printed.txt

brian2genn_python_errorMessage.txt

mstimberg commented 4 years ago

I'm still not quite sure what is going on. You are getting this only on later iterations, right? Do you reinitialize the device between runs?

device.reinit()
device.activate()

TheSalocin commented 4 years ago

Yes, when not reinitialising I get the message that multiple runs aren't supported. The error persists even when defining each object of the network anew during each step (and then running reinit and activate after the run command). I know it only occurs on later iterations since it outputs

running brian code generation ... building genn executable ... ['/home/nicolas/genn/bin/genn-buildmodel.sh', '-i', '/home/nicolas/Desktop/WS_20-21/Levina_Lab:/home/nicolas/Desktop/WS_20-21/Levina_Lab/GeNNworkspace:/home/nicolas/Desktop/WS_20-21/Levina_Lab/GeNNworkspace/brianlib/randomkit', 'magicnetwork_model.cpp'] executing genn binary on GPU ...

and some print commands I added to track progress multiple times

mstimberg commented 4 years ago

The error persists even when defining each object of the network anew during each step (and then running reinit and activate after the run command).

I am a bit confused by the word "even" – if you use reinit and activate you have to redefine each object.

Could you maybe simplify your code into something simple that exhibits the same problem, i.e. fails after several iterations, that you could then share? Remote diagnostics are always difficult... I still suspect that it has to do with some inconsistency in the generated code. Maybe you could manually delete the GeNNworkspace directory between iterations?

mstimberg commented 4 years ago

Not directly related to your issue. But if you have an example of your use of Delfi with Brian2 to share (does not have to be your actual code, could be a toy example), this would be a great addition to https://brian.discourse.group/c/showcase/8

TheSalocin commented 4 years ago

Sorry for the late response. Here is the simplest network in which I encounter this error: A single neuron that spikes at regular intervals, with SNPE automatically fitting refractory time and membrane time constant without any defined summary statistics.

Brian2SNPE_minimal.zip

If this is already enough example of how one can use delfi with brian2 I can of course share it on the discourse, otherwise I can upload something a bit more complex.

tnowotny commented 4 years ago

Did the manual deletion of GeNNWorkspace make any difference? Beyond that I can't really think of anything that would persist between repeated runs ... even though I am a bit hazy on the version of GeNN that's involved here; there maybe stuff that's compiled in the GeNN directory as happened in older GeNN versions. I'll boldly involve @neworderofjamie for an opinion.

neworderofjamie commented 4 years ago

This is an interesting one! I can reproduce using the minimal model - after some iterations the model fails with an undefined reference to `_run_spikemonitor_codeobject()' linker error. Digging into the generated code, I see spikemonitor_codeobject.cpp and spikemonitor_codeobject_1.cpp in the code_objects directory but the Makefile only links in spikemonitor_codeobject_1.cpp whereas _run_spikemonitor_codeobject()' is implemented in spikemonitor_codeobject.cpp. My understanding of how code objects are generated is hazy but maybe this is helpful to @mstimberg or @tnowotny

mstimberg commented 4 years ago

Many thanks for the example @TheSalocin and thanks for the analysis @neworderofjamie I can reproduce the problem on my machine as well and will have a closer look soon.

If this is already enough example of how one can use delfi with brian2 I can of course share it on the discourse, otherwise I can upload something a bit more complex.

This is already great, would be very happy if you could share it. It gives others a starting point, extending it to different models, more parameters, etc. seems to be rather straightforward. Maybe remove the use of brian2genn for now, though ;)

mstimberg commented 4 years ago

I think I figured it out, I have prepared a fix in the fix_codeobject_names branch. The issue has to do with the names of the "code objects" (the objects representing the code doing the actual computation bits). Most objects have a single code object, taking care of the actual computation. The name of this code object is usually object_name_codeobject, so in the example above the SpikeMonitor would have a code object spikemonitor_codeobject. Now, the generation of these code objects goes through Brian's general code generation framework which assures unique names for all objects. When deleting and recreating objects, it is possible that a SpikeMonitor is already deleted (so we can create a new SpikeMonitor of the same name), but its code object is still around (and we therefore append _1 to its name). The behaviour is non-deterministic since it depends on the invocation of the garbage collector. All this is not a problem in general, since we don't care about these names, they are just used to name functions and the corresponding source files. The problem occured because Brian2GeNN's code generation hard-coded the assumption that these objects are called object_name_codeobject, and compilation therefore fails as soon as such an object is called ..._codeobject_1 instead. In my fix, I replaced the hardcoded name by the actual name.

This fixes the issue with the provided example for me, but hopefully @TheSalocin can confirm that it also fixes it for the more complex code. Also I did not run the full test suite yet, so it might actually break something...

TheSalocin commented 4 years ago

Yep, it runs through without problems now. Thank you

brian-team / brian2genn

Building genn executable fails after a certain number of successful runs when parameter fitting #120