Fix repeat runs - Githubissues

kernfel commented 2 years ago

This pull request attempts to fix an issue where the names of StateUpdaters, Thresholders etc. are suffixed with numeric indices, and are therefore not picked up by the hardcoded name checks in process_*_groups.

However, since this fix raises other hidden issues (see below), I can't recommend merging it just yet.

kernfel commented 2 years ago

So, the issue I was trying to address is the following. I have a moderately complex model where, by the inscrutable magic of brian2's naming schemes, in the second run, the NeuronGroup would be named as requested (e.g. neurons), but its subordinate objects (Thresholder, StateUpdater etc.) would have a numeric suffix (e.g. neurons_thresholder_1). Since process_neuron_groups is looking specifically for {NeuronGroup.name}_thresholder, it would not pick these suffixed objects up, and therefore not process them, nor add them to the code passed to GeNN.

So far, so good: I changed all remaining hardcoded names that I could find, reflected in the two commits above. Tests pass, job done, right?

Not so. Now that the models in the first and second run are almost completely identical, it seems that GeNN is not picking up the subtle differences in naming. To wit:

from brian2 import *
import brian2genn
import os
set_device('genn')

for i in range(2):
#    os.system('rm -rf GeNNworkspace')
    if i > 0:
        device.reinit()
        device.activate()

    G = NeuronGroup(1, model='dv/dt = -v/(1*ms) : 1',
                    threshold='rand()>.5', method='exact', name='neurons')
    S = Synapses(G, G, '', on_pre='v += 1')
    S.connect()
    spikemon = SpikeMonitor(G)
    statemon = StateMonitor(G, variables='v', record=[0])
    net = Network(G, S, spikemon, statemon)
    net.run(10*ms)

    print(spikemon.count, statemon.v[0, -1])

Running this code as-is, the first run (usually) passes just fine, but the second one raises during the final model compilation step:

~ ...snip: make output... ~
g++ -std=c++11 -Wno-write-strings -I. -Ibrianlib/randomkit -w -O3 -ffast-math -fno-finite-math-only -march=native -std=c++11  main.cpp  code_objects/statemonitor_1_codeobject.cpp  objects.cpp  code_objects/synapses_1_synapses_create_generator_codeobject.cpp  code_objects/synapses_1_pre_push_spikes.cpp  code_objects/spikemonitor_1_codeobject.cpp  brianlib/randomkit/randomkit.cc -o main -Lmagicnetwork_model_CODE -lrunner -Wl,-rpath magicnetwork_model_CODE 
~ ...snip: warnings... ~
In file included from main.cpp:19:
engine.cpp: In member function ‘void engine::run(double)’:
engine.cpp:45:19: error: ‘vneurongroup_1’ was not declared in this scope; did you mean ‘vneurongroup’?
   45 |       std::copy_n(vneurongroup_1, 1, brian::_array_neurongroup_1_v);
      |                   ^~~~~~~~~~~~~~
      |                   vneurongroup
engine.cpp:51:7: error: ‘pullneurongroup_1CurrentSpikesFromDevice’ was not declared in this scope; did you mean ‘pullneurongroupCurrentSpikesFromDevice’?
   51 |       pullneurongroup_1CurrentSpikesFromDevice();
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |       pullneurongroupCurrentSpikesFromDevice
engine.cpp:52:7: error: ‘pullneurongroup_1StateFromDevice’ was not declared in this scope; did you mean ‘pullneurongroupStateFromDevice’?
   52 |       pullneurongroup_1StateFromDevice();
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |       pullneurongroupStateFromDevice
main.cpp: In function ‘int main(int, char**)’:
main.cpp:73:30: error: ‘rowLengthsynapses_1’ was not declared in this scope; did you mean ‘rowLengthsynapses’?
   73 |                              rowLengthsynapses_1, indsynapses_1, maxRowLengthsynapses_1,
      |                              ^~~~~~~~~~~~~~~~~~~
      |                              rowLengthsynapses
main.cpp:73:51: error: ‘indsynapses_1’ was not declared in this scope; did you mean ‘indsynapses’?
   73 |                              rowLengthsynapses_1, indsynapses_1, maxRowLengthsynapses_1,
      |                                                   ^~~~~~~~~~~~~
      |                                                   indsynapses
main.cpp:73:66: error: ‘maxRowLengthsynapses_1’ was not declared in this scope; did you mean ‘maxRowLengthsynapses’?
   73 |                              rowLengthsynapses_1, indsynapses_1, maxRowLengthsynapses_1,
      |                                                                  ^~~~~~~~~~~~~~~~~~~~~~
      |                                                                  maxRowLengthsynapses
main.cpp:79:49: error: ‘ineurongroup_1’ was not declared in this scope; did you mean ‘ineurongroup’?
   79 |   std::copy_n(brian::_array_neurongroup_1_i, 1, ineurongroup_1);
      |                                                 ^~~~~~~~~~~~~~
      |                                                 ineurongroup
main.cpp:80:49: error: ‘vneurongroup_1’ was not declared in this scope; did you mean ‘vneurongroup’?
   80 |   std::copy_n(brian::_array_neurongroup_1_v, 1, vneurongroup_1);
      |                                                 ^~~~~~~~~~~~~~
      |                                                 vneurongroup
main.cpp:87:7: error: ‘_seedneurongroup_1’ was not declared in this scope; did you mean ‘_seedneurongroup’?
   87 |       _seedneurongroup_1[i]= (uint64_t) (rand()*0x0000FFFFFFFFFFFFLL);
      |       ^~~~~~~~~~~~~~~~~~
      |       _seedneurongroup
code_objects/spikemonitor_1_codeobject.cpp: In function ‘void _run_spikemonitor_1_codeobject()’:
code_objects/spikemonitor_1_codeobject.cpp:108:24: error: ‘spikeCount_neurongroup_1’ was not declared in this scope; did you mean ‘spikeCount_neurongroup’?
  108 |  int32_t _num_events = spikeCount_neurongroup_1;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~
      |                        spikeCount_neurongroup
code_objects/spikemonitor_1_codeobject.cpp:117:23: error: ‘spike_neurongroup_1’ was not declared in this scope; did you mean ‘spike_neurongroup’?
  117 |      const int _idx = spike_neurongroup_1[_j];
      |                       ^~~~~~~~~~~~~~~~~~~
      |                       spike_neurongroup
make: *** [Makefile:10: main] Error 1
ERROR      Brian 2 encountered an unexpected error. If you think this is a bug in Brian 2, please report this issue either to the discourse forum at <http://brian.discourse.group/>, or to the issue tracker at <https://github.com/brian-team/brian2/issues>. Please include this file with debug information in your report: /tmp/brian_debug_e2sgnrl0.log  Additionally, you can also include a copy of the script that was run, available at: /tmp/brian_script_rfh3a_al.py You can also include a copy of the redirected std stream outputs, available at '/tmp/brian_stdout_l9pglhv0.log' and '/tmp/brian_stderr_yv5s018d.log'. Thanks! [brian2]
Traceback (most recent call last):
  File "/home/felix/projects/lib/brian2genn/brian2genn/device.py", line 877, in build
    self.compile_source(debug, directory, use_GPU)
  File "/home/felix/projects/lib/brian2genn/brian2genn/device.py", line 1166, in compile_source
    check_call(["make"], cwd=directory, env=env)
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['make']' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/felix/projects/ikeda/scratch.py", line 24, in <module>
    net.run(10*ms)
  File "/home/felix/projects/lib/brian2/brian2/core/base.py", line 291, in device_override_decorated_function
    return getattr(curdev, name)(*args, **kwds)
  File "/home/felix/projects/lib/brian2genn/brian2genn/device.py", line 1844, in network_run
    super(GeNNDevice, self).network_run(net=net, duration=duration,
  File "/home/felix/projects/lib/brian2/brian2/devices/cpp_standalone/device.py", line 1502, in network_run
    self.build(direct_call=False, **self.build_options)
  File "/home/felix/projects/lib/brian2genn/brian2genn/device.py", line 879, in build
    raise RuntimeError(('Project compilation failed (Command {cmd} '
RuntimeError: Project compilation failed (Command ['make'] failed with error code 2).
See the output above (if any) for more details.

There seems to be a disconnect between what's provided to GeNN (code objects, magicnetwork_model.cpp etc. with suffixed neurongroup_1 etc.) and what comes out the other end (magicnetwork_model_CODE/* with suffix-less neurongroup). I've found that only the Makefile is actually touched, whereas all other files in the _CODE directory are left untouched in the second iteration. Of course, removing the GeNNworkspace directory fixes this, but that's hardly an ideal workaround.

Since this may well be a GeNN issue, I'm tagging @neworderofjamie here.

Unrelatedly, and without wanting to add to everyone's workload: Is it just me, or is brian2 master + brian2genn master + GeNN 4.6.0 completely broken?

neworderofjamie commented 2 years ago

Nice work @kernfel and definitely an interesting potential bug 😄 Can you try two things:

Does magicnetwork_model.cpp change between runs?
If you hack a -r onto the genn-buildmodel.sh command line, does the problem go away? (this bypasses GeNN's new caching system and forces a model rebuild)

kernfel commented 2 years ago

Does magicnetwork_model.cpp change between runs?

Yes: All instances of synapses are replaced with synapses_1 in the second iteration (as they should, following the code object naming scheme).

If you hack a -r onto the genn-buildmodel.sh command line, does the problem go away? (this bypasses GeNN's new caching system and forces a model rebuild)

Assuming you meant -f: Yes, that rescues it.

neworderofjamie commented 2 years ago

I did indeed mean -f but that is a smoking gun - will fix ASAP

kernfel commented 2 years ago

Ah, the perils of checking further things while writing a bug report. I notice the errors reported in the second post are a little more extensive than what you get with the code I provided. Removing the explicit NeuronGroup name in the code restores the posted error. This does not impinge on the bug, per se. (... although it suggests another solution: Just name all your objects?)

kernfel commented 2 years ago

... which actually works and I wish I'd realised that sooner. A nice catch-22:

Explicitly naming objects breaks repeat runs via brian2genn (fixed in this PR).
Not explicitly naming objects breaks repeat runs via GeNN.
Debugging with a mixture of named and unnamed objects breaks the spirit.

neworderofjamie commented 2 years ago

Can you try https://github.com/genn-team/genn/pull/495?

kernfel commented 2 years ago

Yup, that fixes it. Thank you so much!

neworderofjamie commented 2 years ago

Wonderful!

tnowotny commented 2 years ago

stupid question: Does this mean that if you name your populations no recompile is necessary/done and if you don't name them explicitly Brian 2 will rename and force a GeNN recompile? If this is the case one should recommend explicit naming ...? (apologies if I am having the completely wrong end of the stick)

kernfel commented 2 years ago

You're absolutely right, Thomas. Brian2 respects explicit names, but will number any unnamed items it gets. I don't think this makes a difference on the Brian2 end of the pipeline, but I'd agree that for b2g purposes, it would make sense to recommend explicit names throughout.

Incidentally, the subordinate objects whose automatic naming and numbering gave rise to this PR in the first place are only Brian-side, of course.

kernfel commented 2 years ago

Having used this successfully with Jamie's fix for a while now, and said fix being merged into GeNN master, I believe this is ready for prime time.

mstimberg commented 2 years ago

Hi everyone. Apologies, I did not really follow this issue until now, but I will have a look later today. I'll merge it right away if I have nothing to complain about :blush:

kernfel commented 2 years ago

No apologies necessary! I'm happy to contribute at a moderate pace, no complaints here.

mstimberg commented 2 years ago

Thanks for the PR, this all makes a lot of sense and not relying on the names should be much more robust. Just to give some background on the whole name machinery: as you saw, Brian will respect names that you set manually. It will only complain when you call build/run and you have given identical name to several objects. If you don't give names, it will automatically choose names, appending _1 etc. to make the names unique. The problem is that our current system requires names to be set when we create the object, so the mechanism for the unique naming keeps a list of all existing objects (using "weak references", so this list will not keep the objects alive) and picks suffixes accordingly. Unfortunately, this system has its limits when you run multiple simulations in a loop. E.g. with:

for _ in range(3):
  G = NeuronGroup(...)
  run(...)

The group will be named neurongroup in the first run, but when it creates a new group for the second run, the first one still exists so it has to name the new group neurongroup_1. Right after that line, the old group does not exist anymore, so the name neurongroup becomes available again. As a result, the third run will again name the object neurongroup, and so on. You can get around this with name='neurongroup', but there's no mechanism to manually set the name of sub-objects like Thresholder and so on, so these will still cycle through the names with and without a suffix. This is indeed problematic because it interferes with the caching mechanism, see also brian-team/brian2#680 . In the long run, I think the best solution would be to not require names at object creation time, and only generate automatic names when objects are added to a Network. As a short term solution, manually giving names is indeed the best approach (not only for GeNN's caching, but also for Brian's other modes). We should be able to fix the problem with the sub-objects quite easily: some sub-objects exists only once per object (e.g. a StateUpdater), so we can directly give it a fixed name based on the parent name – if the parent name is unique, then the sub-object name is unique as well. Other sub-object types can have multiple instances (e.g. Thresholder for custom events), but for them, we can make the name unique ourselves, e.g. by including the name of the event in the case of a Thresholder.

mstimberg commented 2 years ago

Thanks for the additional changes, going ahead with the merge :+1:

brian-team / brian2genn

Fix repeat runs #141