INCF / MUSIC

MUSIC, the MUltiSimulation Coordinator
GNU General Public License v3.0
37 stars 37 forks source link

[pymusic] Endless loop in `setup.runtime(..)` #35

Closed mhoff closed 7 years ago

mhoff commented 7 years ago

Hello everyone,

while debugging a more complex experiment using MUSIC I recognized that two of my pymusic-nodes simply never return from calling music_setup.runtime(..), whereas music_setup = music.Setup().

I have been able to reduce the code to a very minimal not working example. When I execute the music configuration attached to this issue via mpirun -np 1 music test.music nothing more than "before runtime" will get printed. The process appears to enter an endless loop when calling runtime on the setup object. In contrast, mpirun -np 1 ./test_node.py works fine.

I'm well aware that this trivial node combined with this trivial configuration might be a special case for MUSIC, but this issue also arises with a more complex setup and nodes which actually consume data via incoming connections.



Minimal example (download: test.zip):

test_node.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import music

music_setup = music.Setup()

print("before runtime")

runtime = music_setup.runtime(0.02)

print("after runtime -- not called")

for current_time in runtime:
    print(current_time)

test.music

[test]
  binary=./test_node.py
  args=
  np=1
uahic commented 7 years ago

At the first glance it looks like that starting MUSIC with a single process and no ports does not work when launched with mpiexec or mpirun. If I start it with ipython only, it works. Ill have a look at it

mdjurfeldt commented 7 years ago

Yes, unfortunately, the new MUSIC scheduler hangs if there is only one application.

This should of course be fixed.

Den 16 jan. 2017 16:19 skrev "Martin Schulze" notifications@github.com:

At the first glance it looks like that starting MUSIC with a single process and no ports does not work when launched with mpiexec or mpirun. If I start it with ipython only, it works. I first need to get Cython debugging working before I can dive into the problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-272889347, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCWZnamNjWZRXiVeGgj-ofIF_9Gofks5rS4psgaJpZM4Lkd4V .

uahic commented 7 years ago

@mdjurfeldt Do you know how to enable gdb to debug Cython code?

I found this: http://docs.cython.org/en/latest/src/userguide/debugging.html but there are multiple cythonize() function calls in the pymusic folder setup.py.in and I am not sure where to set the gdb flag

mhoff commented 7 years ago

It seems not to be the special case of only one application. Running two of the above nodes with -np 2 yields the same behaviour for both nodes.

[test1]
  binary=./test_node.py
  args=
  np=1

[test2]
  binary=./test_node.py
  args=
  np=1
uahic commented 7 years ago

@I think its a bug when you dont have any ports connecting these applications or maybe not any ports at all (note that you also have to actually create the ports in your applications)

mhoff commented 7 years ago

Seems likely. However, in my experiment I have a purely consuming node (2 incoming connections) which exhibits the same behaviour. But as it is a big node I can not say for certain that there is no misconfiguration.

mhoff commented 7 years ago

This is really weird. I reduced my implementation to a point at which all configured nodes & connections are working as intended. Then I add another node (pynest) with its three ports. Two of these ports are inputs which I simply connect to pre-existing outputs (which have been proven to be working in the minimal setup). The third port is an event output. I connect this port to another port on a pre-existing node which has not been used until that. In the code of the pre-existing node I do not publish the input. Now, when I run the whole thing, all nodes freeze in runtime(..) without an error message. Next, I add the publish code to make the connection proper and run it again. Now, all nodes run except for the pre-existing node for which I just added a connection.

uahic commented 7 years ago

Eventhough there is a method in the MUSIC interface to check whether a port is connected, NEST does not make use of it and always assumes ports to be connected. I will fix this soon (if it makes sense ... ) as I am refactoring all recording devices in NEST anyways (already started with that). Be prepared 8-)

mhoff commented 7 years ago

Many thanks. But still the bigger issue what happens after setting up the connection properly:

Next, I add the publish code to make the connection proper and run it again. Now, all nodes run except for the pre-existing node for which I just added a connection.

Suddenly, after hooking in another node, a node which has been working before does not return from runtime() anymore. I published and mapped the new event input port the same way I have set up another event input port on the same node which has been working before. Really weird.

mdjurfeldt commented 7 years ago

Can you provide the two examples?

Den 16 jan. 2017 21:30 skrev "Michael Hoff" notifications@github.com:

Many thanks. But still the bigger issue what happens after setting up the connection properly:

Next, I add the publish code to make the connection proper and run it again. Now, all nodes run except for the pre-existing node for which I just added a connection. Suddenly, after hooking in another node, a node which has been working before does not return from runtime() anymore. I published and mapped the new event input port the same way I have set up another event input port on the same node which has been working before. Really weird.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-272955383, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCbzJIdMvFFeUpZF5NN5pGp9_6Fq0ks5rS9MogaJpZM4Lkd4V .

mhoff commented 7 years ago

That would be difficult. The pynest node relies on a custom nest module (which in turn is responsible for creating one of the music ports). Maybe I can reproduce the faulty behavior with a more simple setup.

But before I do that, can you tell me why the above use case with two independent music nodes fails? I totally understand that one single node can be a special case for the music scheduler. And I certainly also understand that having simply two unconnected music nodes is a somehow related scenario in terms of the scheduler. But do both behaviors really emerge from the same origin?

mhoff commented 7 years ago

Is the order of nodes & connections in the music configuration relevant for the scheduler? While trying to produce a showcase for this issue I was moving around nodes and connections inside the configuration file. Now other nodes are not responding anymore. Attached to this comment you find my current configuration file. Everything above # 3 is the reduced version which was working before I changed the order in the configuration file. Now even the part up to # 1 is not working anymore (ros_sensor_adapter does not return from runtime..).


[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=10

[reward_node]
  binary=../../src/compute_node.py
  args=
  np=1

distance_provider.out -> reward_node.distance_in [1]

# 1

[plotter]
  np=1
  binary=../../src/plotter_node.py

reward_node.reward_out -> plotter.reward_in [1]

# 2

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.5
  ros_topic=/dvs/events
  message_type=EventArray
  sensor_update_rate=1

dvs.out -> plotter.pattern_in [200]

# 3

#[nest]
#  binary=../../src/network_node.py
#  args=
#  np=1

#dvs.out -> nest.pattern_in [200]
#nest.activity_out -> plotter.activity_in [20]

# 4

#reward_node.reward_out -> nest.reward_in [1]

# 5

#[decoder]
#  binary=linear_readout_decoder
#  args=
#  np=1
#  music_timestep=0.05
#  tau=0.03
#  weights_filename=activity_to_velocity_translation_weights.dat
#[command]
#  binary=ros_command_adapter
#  args=
#  np=1
#  music_timestep=0.05
#  ros_topic=/cmd_vel
#  message_mapping_filename=velocity_to_twist.dat
#  command_rate=20

#nest.activity_out -> decoder.in [20]
#decoder.out -> command.in [2]
mhoff commented 7 years ago

I have to retreat on my last statement. A little bug (stoptime = ... got missing) caused the weird behaviour with the order appearing to be relevant. The above configuration file is now working up to # 4. If I activate the connection reward_node.reward_out -> nest.reward_in [1], the plotter_node will not return from runtime(..) anymore. This is weird as this node is not participating in this specific connection.

mdjurfeldt commented 7 years ago

Dear Michael,

I will have to investigate this. Unfortunately, I can do this first during the weekend at the earliest.

Obviously, this is not how MUSIC should behave and we will fix it!

Best regards, Mikael

On Wed, Jan 18, 2017 at 2:09 PM, Michael Hoff notifications@github.com wrote:

I have to retreat on my last statement. A little bug (stoptime = ... got missing) caused the weird behaviour with the order appearing to be relevant. The above configuration file is now working up to # 4. If I activate the connection reward_node.reward_out -> nest.reward_in [1], the plotter_node will not return from runtime(..) anymore. This is weird as this node is not participating in this specific connection.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-273470989, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCf68OBRymljWFCL0W6YiAk9X2lsQks5rTg8QgaJpZM4Lkd4V .

mhoff commented 7 years ago

@mdjurfeldt Thank you very much for your effort!

mhoff commented 7 years ago

Maybe this helps in the investigation: When running the minified experiment I experience very inconsistent performance over time, even though all nodes appear to have roughly constant workload. screenshot In the above graphics you can see how the reward (negative proportional to distance) changes based on the monotonic movement of an object through space. The object passes through the center of the world which maximizes the reward and decreases afterwards until being reset to the starting position. The distance_provider is a ROS/MUSIC node which reads the distance from a real-valued ROS topic with frequency 1000 and transfers this into the MUSIC world. The reward_node MUSIC node in place reads the distance and emits the reward. The plotter MUSIC node then gathers this signal and draws it as seen above. As you can see, the reward signal is not as linear as it should be. Looking into the execution logs shows the following for correctly sampled partitions of the signal:

reward: ts = 1.78, last_cycle_time = 8.51154e-05
plotter: ts = 1.68, last_cycle_time = 0.000126839
reward: ts = 1.8, last_cycle_time = 0.000105858
plotter: ts = 1.7, last_cycle_time = 0.000150919
reward: ts = 1.82, last_cycle_time = 9.5129e-05
plotter: ts = 1.72, last_cycle_time = 0.000129938
reward: ts = 1.84, last_cycle_time = 9.98974e-05
plotter: ts = 1.74, last_cycle_time = 0.000133991
reward: ts = 1.86, last_cycle_time = 7.82013e-05
plotter: ts = 1.76, last_cycle_time = 0.000138998
reward: ts = 1.88, last_cycle_time = 0.000101089
plotter: ts = 1.78, last_cycle_time = 0.000138044
reward: ts = 1.9, last_cycle_time = 0.000106096

As both MUSIC nodes share the same timestep this is also the expected output. However, when simulation performance drops and signal sampling gets incorrect you can see the following:

reward: ts = 4.72, last_cycle_time = 3.79086e-05
reward: ts = 4.74, last_cycle_time = 0.000125885
reward: ts = 4.76, last_cycle_time = 3.79086e-05
reward: ts = 4.78, last_cycle_time = 3.91006e-05
reward: ts = 4.8, last_cycle_time = 3.8147e-05
plotter: ts = 4.62, last_cycle_time = 0.000164032
plotter: ts = 4.64, last_cycle_time = 0.000178099
plotter: ts = 4.66, last_cycle_time = 0.000163794
plotter: ts = 4.68, last_cycle_time = 0.000164032
plotter: ts = 4.7, last_cycle_time = 0.000169039
reward: ts = 4.82, last_cycle_time = 3.91006e-05
reward: ts = 4.84, last_cycle_time = 0.000127077
reward: ts = 4.86, last_cycle_time = 3.79086e-05
reward: ts = 4.88, last_cycle_time = 3.8147e-05
reward: ts = 4.9, last_cycle_time = 3.91006e-05
plotter: ts = 4.72, last_cycle_time = 0.000164986
plotter: ts = 4.74, last_cycle_time = 0.00016284
plotter: ts = 4.76, last_cycle_time = 0.000164986
plotter: ts = 4.78, last_cycle_time = 0.000165939
plotter: ts = 4.8, last_cycle_time = 0.000185966
reward: ts = 4.92, last_cycle_time = 3.8147e-05
reward: ts = 4.94, last_cycle_time = 0.000123024
reward: ts = 4.96, last_cycle_time = 3.79086e-05
reward: ts = 4.98, last_cycle_time = 3.88622e-05
reward: ts = 5.0, last_cycle_time = 3.8147e-05
plotter: ts = 4.82, last_cycle_time = 0.000179052
plotter: ts = 4.84, last_cycle_time = 0.000163078
plotter: ts = 4.86, last_cycle_time = 0.000141859
plotter: ts = 4.88, last_cycle_time = 0.0001888

Still, both nodes execute with the same timestep. I know that this might be a console issue gathering the logs from the different processes. But still, this behaviour is clearly reproducable and correlates with the data in the graph. So I presume this could also be a scheduler issue.


Second thing I wanted to add is an error which just occurred for the first time in many many simulation runs. I know this is more related to NEST, but still maybe it helps in some way.

[figipc156:11872] *** Process received signal ***
[figipc156:11872] Signal: Segmentation fault (11)
[figipc156:11872] Signal code: Address not mapped (1)
[figipc156:11872] Failing at address: 0xffffffffe5663870
[figipc156:11872] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7ff0f2d508d0]
[figipc156:11872] [ 1] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest17MusicEventHandlerclEdN5MUSIC11GlobalIndexE+0xc) [0x7ff0ee793f4c]
[figipc156:11872] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC28EventInputSubconnectorGlobal7receiveEPci+0x86) [0x7ff0ecff1736]
[figipc156:11872] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC19EventInputConnector4tickEv+0x1c0) [0x7ff0ecff93d0]
[figipc156:11872] [ 4] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC12UnicommAgent4tickERNS_5ClockE+0x4a) [0x7ff0ed00beda]
[figipc156:11872] [ 5] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC9Scheduler4tickERNS_5ClockE+0x24) [0x7ff0ed012324]
[figipc156:11872] [ 6] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime4tickEv+0x6c) [0x7ff0ecfbd45c]
[figipc156:11872] [ 7] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest12Communicator18advance_music_timeEl+0x25) [0x7ff0ee721765]
[figipc156:11872] [ 8] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x61ba9b) [0x7ff0ee77ea9b]
[figipc156:11872] [ 9] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x3f) [0x7ff0ec2557df]
[figipc156:11872] [10] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler6updateEv+0x1d6) [0x7ff0ee77b8e6]
[figipc156:11872] [11] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler6resumeEv+0x4e) [0x7ff0ee77bd9e]
[figipc156:11872] [12] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN4nest9Scheduler8simulateERKNS_4TimeE+0x301) [0x7ff0ee783151]
[figipc156:11872] [13] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZNK4nest10NestModule16SimulateFunction7executeEP14SLIInterpreter+0x2d3) [0x7ff0ee749543]
[figipc156:11872] [14] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN14SLIInterpreter8execute_Em+0x255) [0x7ff0ee7c0e25]
[figipc156:11872] [15] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(_ZN14SLIInterpreter7executeERKSs+0x135) [0x7ff0ee7c13a5]
[figipc156:11872] [16] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x2e1545) [0x7ff0ee444545]
[figipc156:11872] [17] /home/hoff/.local/lib/python2.7/site-packages/nest/pynestkernel.so(+0x2e1073) [0x7ff0ee444073]
[figipc156:11872] [18] /usr/bin/python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:11872] [19] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [20] /usr/bin/python() [0x4e5fe8]
[figipc156:11872] [21] /usr/bin/python(PyEval_EvalFrameEx+0x28bb) [0x4cc36b]
[figipc156:11872] [22] /usr/bin/python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:11872] [23] /usr/bin/python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1]
[figipc156:11872] [24] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [25] /usr/bin/python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:11872] [26] /usr/bin/python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:11872] [27] /usr/bin/python() [0x5030ef]
[figipc156:11872] [28] /usr/bin/python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:11872] [29] /usr/bin/python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:11872] *** End of error message ***
mhoff commented 7 years ago

If I can help you with anything please don't hesitate to let me know. Also if you have any idea about the potential source of the problem, maybe I can assist in the investigation if I know where to look.

mhoff commented 7 years ago

I found another weird behaviour. When trying to feed nest.reward_in with data of other ports (which by the way does not change the above behaviour) I experienced reproducible SegFaults for publishing unconnected ContOutputs.

Based on the above version which is working without the questionable connection, I add

music_setup.publishContOutput("sim_time_out")

to plotter_node (before runtime()) without connecting this port in the music configuration (see configuration file below). Now, I experience the following when I try to execute the modified setup:

[figipc156:25236] *** Process received signal ***
[figipc156:25236] Signal: Segmentation fault (11)
[figipc156:25236] Signal code: Address not mapped (1)
[figipc156:25236] Failing at address: 0x59
[figipc156:25236] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7fc21e9fb8d0]
[figipc156:25236] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup19maybePostponedSetupEv+0) [0x7fc21caf4850]
[figipc156:25236] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC4PortC1EPNS_5SetupESs+0x46) [0x7fc21cb11926]
[figipc156:25236] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup17publishContOutputESs+0x46) [0x7fc21caf38a6]
[figipc156:25236] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17e14) [0x7fc21d58fe14]
[figipc156:25236] [ 5] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:25236] [ 6] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 7] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 8] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:25236] [ 9] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:25236] [10] python() [0x5030ef]
[figipc156:25236] [11] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:25236] [12] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:25236] [13] python(Py_Main+0x562) [0x4982f2]
[figipc156:25236] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc21dd3fb45]
[figipc156:25236] [15] python() [0x497ca0]
[figipc156:25236] *** End of error message ***

This error appears in publishContOutput and is reliably reproducible. Publishing unconnected ContOutputs works perfectly for trivial test cases, but in this setup it clearly fails.


stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000

[reward_gen]
  binary=python/reward_node.py
  args=
  np=1

distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/camera/dvs/events
  message_type=EventArray
  sensor_update_rate=60

dvs.out -> plotter.pattern_in [200]

[nest]
  binary=python/network_node.py
  args=
  np=1

dvs.out -> nest.pattern_in [200]
nest.activity_out -> plotter.activity_in [20]

[decoder]
  binary=linear_readout_decoder
  args=
  np=1
  music_timestep=0.05
  tau=0.03
  weights_filename=res/activity_to_velocity_translation_weights.dat
[command_gen]
  binary=ros_command_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/cmd_vel
  message_mapping_filename=res/velocity_to_twist.dat
  command_rate=20

nest.activity_out -> decoder.in [20]
decoder.out -> command_gen.in [2]
mdjurfeldt commented 7 years ago

Hi Michael,

I noticed first now that the segfault is in the pthreads library. The current version of MUSIC is not thread aware and should only be called by a single thread. Do you in any way use thread concurrency in your setup?

Den 25 jan 2017 17:42 skrev "Michael Hoff" notifications@github.com:

I found another weird behaviour. When trying to feed nest.reward_in with data of other ports (which by the way does not change the above behaviour) I experienced reproducible SegFaults for publishing unconnected ContOutputs.

Based on the above version which is working without the questionable connection, I add

music_setup.publishContOutput("sim_time_out")

to plotter_node (before runtime()) without connecting this port in the music configuration (see configuration file below). Now, I experience the following when I try to execute the modified setup:

[figipc156:25236] Process received signal [figipc156:25236] Signal: Segmentation fault (11) [figipc156:25236] Signal code: Address not mapped (1) [figipc156:25236] Failing at address: 0x59 [figipc156:25236] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7fc21e9fb8d0] [figipc156:25236] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup19maybePostponedSetupEv+0) [0x7fc21caf4850] [figipc156:25236] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC4PortC1EPNS_5SetupESs+0x46) [0x7fc21cb11926] [figipc156:25236] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC5Setup17publishContOutputESs+0x46) [0x7fc21caf38a6] [figipc156:25236] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17e14) [0x7fc21d58fe14] [figipc156:25236] [ 5] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe] [figipc156:25236] [ 6] python(PyEval_EvalFrameEx+0xae2) [0x4ca592] [figipc156:25236] [ 7] python(PyEval_EvalFrameEx+0xae2) [0x4ca592] [figipc156:25236] [ 8] python(PyEval_EvalFrameEx+0xae2) [0x4ca592] [figipc156:25236] [ 9] python(PyEval_EvalCodeEx+0x411) [0x4c87a1] [figipc156:25236] [10] python() [0x5030ef] [figipc156:25236] [11] python(PyRun_FileExFlags+0x82) [0x4f8c72] [figipc156:25236] [12] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77] [figipc156:25236] [13] python(Py_Main+0x562) [0x4982f2] [figipc156:25236] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc21dd3fb45] [figipc156:25236] [15] python() [0x497ca0] [figipc156:25236] End of error message

This error appears in publishContOutput and is reliably reproducible. Publishing unconnected ContOutputs works perfectly for trivial test cases, but in this setup it clearly fails.

stoptime=100000.0

[distance_provider] binary=ros_sensor_adapter args= np=1 music_timestep=0.01 ros_topic=/distance/to_center message_type=FloatArray sensor_update_rate=1000

[reward_gen] binary=python/reward_node.py args= np=1

distance_provider.out -> reward_gen.distance_in [1]

[plotter] np=1 binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

[dvs] binary=ros_event_sensor_adapter args= np=1 music_timestep=0.01 ros_topic=/camera/dvs/events message_type=EventArray sensor_update_rate=60

dvs.out -> plotter.pattern_in [200]

[nest] binary=python/network_node.py args= np=1

dvs.out -> nest.pattern_in [200] nest.activity_out -> plotter.activity_in [20]

[decoder] binary=linear_readout_decoder args= np=1 music_timestep=0.05 tau=0.03 weights_filename=res/activity_to_velocity_translation_weights.dat [command_gen] binary=ros_command_adapter args= np=1 music_timestep=0.05 ros_topic=/cmd_vel message_mapping_filename=res/velocity_to_twist.dat command_rate=20

nest.activity_out -> decoder.in [20] decoder.out -> command_gen.in [2]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-275160589, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCaCWfRTRqXBwlE5EEH1EhwMp6-Njks5rV3t5gaJpZM4Lkd4V .

mhoff commented 7 years ago

The nest node uses the standard NEST-inherent thread-based parallelization (e.g., nest.SetKernelStatus({'local_num_threads': 5}). The nodes plotter and reward_gen are standard pymusic nodes without parallelity (except for some MPI communication, which is currently disabled). The ros_music_adapters use may use one thread per node to bridge between MUSIC and ROS.

But I think I have experienced this kind of fault ("Address not mapped" originating from libpthread.so) already months ago, long before I was using the ros_music_adapters at all. Do you have any idea how pthreads could get into my workflow? I have even tested the network_node with local_num_threads being not set, which did not change a thing.


Update: I can reproduce this error with one pymusic and one pynest node or with one pymusic and one ros_music_adapter node. That means it can not be the parallelization of NEST, nor the threading of the ros_music_adapter. But it can also not be the pymusic node, which is not using threading or any other form of parallelity at all.

mdjurfeldt commented 7 years ago

Neither NEST nor the ROS interface should be a problem. Could you give me the versions of your OS, MPI and NEST?

Den 25 jan. 2017 18:21 skrev "Michael Hoff" notifications@github.com:

The nest node uses the standard NEST-inherent thread-based parallelization (e.g., nest.SetKernelStatus({'local_num_threads': 5}). The nodesplotterandreward_gen` are standard pymusic nodes without parallelity (except for some MPI communication, which is currently disabled). The ros_music_adapters use may use one thread per node to bridge between MUSIC and ROS.

But I think I have experienced this kind of fault ("Address not mapped" originating from libpthread.so) already months ago, long before I was using the ros_music_adapters at all. Do you have any idea how pthreads could get into my workflow? I have even tested the network_node with local_num_threads = 0, which did not change a thing.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-275172276, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCY1ngmGobJlFPY0Pp6mLtOXTGvVLks5rV4SUgaJpZM4Lkd4V .

mhoff commented 7 years ago

As of music, I have tested against the current INCF version (8e0a609b29835be604ae556c1592aad9b4be1827) and the current version of the fork by Philipp Weidel (weidel-p/MUSIC@dd96d32670f3156ebbd75b4fd8e6c196c23850e4).

mhoff commented 7 years ago

When building music, ./configure outputs the following:

[...]
checking which MPI system we think we are using... SYSGUESS=openmpi
checking MPI_CXXFLAGS... -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread
checking MPI_CFLAGS... -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread
checking MPI_LDFLAGS... -pthread -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc
[...]

I tried ./configure MPI_CXXFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_CFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_LDFLAGS="-L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc" to get rid of pthreads. After, I installed MUSIC again and rebuilt all ros_music_adapters. Then, I ran the testcase with pymusic+ros_music_adapter. But still, "Address not mapped" in libpthread.so.

ldd `which music` still contains pthread:

        linux-vdso.so.1 (0x00007fff9f9b7000)
        libmusic.so.1 => /home/hoff/.local/lib/libmusic.so.1 (0x00007f8c0f263000)
        libmpi_cxx.so.1 => /usr/lib/libmpi_cxx.so.1 (0x00007f8c0f048000)
        libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007f8c0ecb6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8c0eab2000)
        libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f8c0e86d000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8c0e562000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8c0e261000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8c0e04b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8c0de2e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8c0da83000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f8c0d87b000)
        libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f8c0d663000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f8c0d460000)
        libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f8c0d256000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f8c0f559000)
        libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f8c0d04b000)

Continuing the crusade..

In a clean clone of MUSIC:

$ grep -r pthread . | grep -v ros/

./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);

$ ./autogen.sh && grep -r pthread . | grep -v ros/

[... autogen ...]
./ltmain.sh:       # Do not include libc_r directly, use -pthread flag.
./ltmain.sh:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./ltmain.sh:    -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/Makefile.in:viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);

$ ./configure --prefix=/home/hoff/.local MPI_CXXFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_CFLAGS="-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi" MPI_LDFLAGS="-L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc && grep -r pthread . | grep -v ros/

[... configure ...]
./ltmain.sh:       # Do not include libc_r directly, use -pthread flag.
./ltmain.sh:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./ltmain.sh:    -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./utils/VisualiseNeurons.h:#include <pthread.h>
./utils/VisualiseNeurons.h:  pthread_t tickThreadID_;
./utils/Makefile.am: viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/Makefile:viewevents_LDADD = $(top_builddir)/src/libmusic.la -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc  -lglut -lpthread
./utils/Makefile.in:viewevents_LDADD = $(top_builddir)/src/libmusic.la @MPI_LDFLAGS@  -lglut -lpthread
./utils/VisualiseNeurons.cpp:    pthread_create(&tickThreadID_, NULL, runMusic, &synchFlag_);
./utils/VisualiseNeurons.cpp:    pthread_join(tickThreadID_,&exitStatus);
./libtool:     # Do not include libc_r directly, use -pthread flag.
./libtool:      -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \
./libtool:  -mt|-mthreads|-kthread|-Kthread|-pthread|-pthreads|--thread-safe \

$ make -j 12 && grep -r pthread . | grep -v ros/

Now, many many files and objects contain pthread. See grep result.

After install, $ ldd `which music`:

    linux-vdso.so.1 (0x00007ffe52bed000)
    libmusic.so.1 => /home/hoff/.local/lib/libmusic.so.1 (0x00007f04700fd000)
    libmpi_cxx.so.1 => /usr/lib/libmpi_cxx.so.1 (0x00007f046fee2000)
    libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007f046fb50000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f046f94c000)
    libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f046f707000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f046f3fc000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f046f0fb000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f046eee5000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f046eb3a000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f046e932000)
    libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f046e71a000)
    libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f046e517000)
    libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f046e30d000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f046e0f0000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f04703f3000)
    libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f046dee5000)

I am not able to produce a music binary without containing pthread. Presumably, I misunderstood you and the music binary containing references to pthread is totally fine.

mhoff commented 7 years ago

I found the source of this SegFault problem. I accidentally inserted the code for the test output after runtime was called. That lead to pymusic calling Setup::publishContOutput with a setup object being the null pointer. Maybe pymusic should be aware of that issue and report an error message instead.


I reworked my code to disallow the above behaviour in the future. Now, I run into another kind of faulty behaviour. A very simple configuration:

stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000
[reward_gen]
  binary=python/reward_node.py
  args=
  np=1
distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

The above part with the ros_music_adapter and the reward_gen node works very reliable. With the plotter node I am experimenting right now. When I add an unconnected EventOutput to the plotter node (while setup is still valid..) I experience two different kinds of error messages.

First of all, this is how I added the Output to the code which has been working before 100%:

event_out_proxy = music_setup.publishEventOutput("test_event_out")
print("event_out_proxy connection status {}".format(event_out_proxy.isConnected()))

Connection status is False, because there is no connection in the music configuration, which is correct. Error type 1:

event_out_proxy connection status False
1485433258 | __init__ | WARNING  | Output port sim_time_out is not connected
1485433258 | __init__ | WARNING  | Input port test_cont_in is not connected
[figipc156:03614] *** Process received signal ***
[figipc156:03614] Signal: Segmentation fault (11)
[figipc156:03614] Signal code: Invalid permissions (2)
[figipc156:03614] Failing at address: 0x7f8109db56b8
[figipc156:03614] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f810a6ed8d0]
[figipc156:03614] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a56b8) [0x7f8109db56b8]
[figipc156:03614] *** End of error message ***

Error type 2:

event_out_proxy connection status False
1485433289 | node | INFO     | Dropping to runtime with timestep 0.02...
1485433289 | __init__ | WARNING  | Output port sim_time_out is not connected
1485433289 | __init__ | WARNING  | Input port test_cont_in is not connected
[figipc156:03655] *** Process received signal ***
[figipc156:03655] Signal: Segmentation fault (11)
[figipc156:03655] Signal code: Address not mapped (1)
[figipc156:03655] Failing at address: 0x30
[figipc156:03655] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7efd780768d0]
[figipc156:03655] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC15EventOutputPort10buildTableEv+0x29) [0x7efd7618b2f9]
[figipc156:03655] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime11buildTablesEPNS_5SetupE+0x24) [0x7efd76169334]
[figipc156:03655] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7RuntimeC1EPNS_5SetupEd+0x2cc) [0x7efd76169dbc]
[figipc156:03655] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x1dd1f) [0x7efd76c10d1f]
[figipc156:03655] [ 5] python() [0x4ba865]
[figipc156:03655] [ 6] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17a8c) [0x7efd76c0aa8c]
[figipc156:03655] [ 7] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe]
[figipc156:03655] [ 8] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [ 9] python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1]
[figipc156:03655] [10] python(PyEval_EvalFrameEx+0xae2) [0x4ca592]
[figipc156:03655] [11] python(PyEval_EvalCodeEx+0x411) [0x4c87a1]
[figipc156:03655] [12] python() [0x5030ef]
[figipc156:03655] [13] python(PyRun_FileExFlags+0x82) [0x4f8c72]
[figipc156:03655] [14] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77]
[figipc156:03655] [15] python(Py_Main+0x562) [0x4982f2]
[figipc156:03655] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7efd773bab45]
[figipc156:03655] [17] python() [0x497ca0]
[figipc156:03655] *** End of error message ***

That means, even though the proxy is correctly classified as unconnected something in MUSIC is heavily affected. Note that removing these two lines leads to perfectly correct behaviour again.

mdjurfeldt commented 7 years ago

Thanks, Michael.

This is why it is so good to have a small example replicating the problem. There's no way I could have found the segfault problem had I started to debug it.

You are right that there should be more helpful error messages in MUSIC. This is on the agenda.

Do you have a smaller example replicating the new problem below?

Den 26 jan. 2017 13:25 skrev "Michael Hoff" notifications@github.com:

I found the source of this SegFault problem. I accidentally inserted the code for the test output after runtime was called. That lead to pymusic calling Setup::publishContOutput with a setup object being the null pointer. Maybe pymusic should be aware of that issue and report an error message instead.

I reworked my code to disallow the above behaviour in the future. Now, I run into another kind of faulty behaviour. A very simple configuration:

stoptime=100000.0

[distance_provider] binary=ros_sensor_adapter args= np=1 music_timestep=0.01 ros_topic=/distance/to_center message_type=FloatArray sensor_update_rate=1000 [reward_gen] binary=python/reward_node.py args= np=1 distance_provider.out -> reward_gen.distance_in [1]

[plotter] np=1 binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

The above part with the ros_music_adapter and the reward_gen node works very reliable. With the plotter node I am experimenting right now. When I add an unconnected EventOutput to the plotter node (while setup is still valid..) I experience two different kinds of error messages.

First of all, this is how I added the Output to the code which has been working before 100%:

event_out_proxy = music_setup.publishEventOutput("test_event_out") print("event_out_proxy connection status {}".format(event_out_proxy.isConnected()))

Connection status is False, because there is no connection in the music configuration, which is correct. Error type 1:

event_out_proxy connection status False 1485433258 | init | WARNING | Output port sim_time_out is not connected 1485433258 | init | WARNING | Input port test_cont_in is not connected [figipc156:03614] Process received signal [figipc156:03614] Signal: Segmentation fault (11) [figipc156:03614] Signal code: Invalid permissions (2) [figipc156:03614] Failing at address: 0x7f8109db56b8 [figipc156:03614] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f810a6ed8d0] [figipc156:03614] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a56b8) [0x7f8109db56b8] [figipc156:03614] End of error message

Error type 2:

event_out_proxy connection status False 1485433289 | node | INFO | Dropping to runtime with timestep 0.02... 1485433289 | init | WARNING | Output port sim_time_out is not connected 1485433289 | init | WARNING | Input port test_cont_in is not connected [figipc156:03655] Process received signal [figipc156:03655] Signal: Segmentation fault (11) [figipc156:03655] Signal code: Address not mapped (1) [figipc156:03655] Failing at address: 0x30 [figipc156:03655] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7efd780768d0] [figipc156:03655] [ 1] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC15EventOutputPort10buildTableEv+0x29) [0x7efd7618b2f9] [figipc156:03655] [ 2] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7Runtime11buildTablesEPNS_5SetupE+0x24) [0x7efd76169334] [figipc156:03655] [ 3] /home/hoff/.local/lib/libmusic.so.1(_ZN5MUSIC7RuntimeC1EPNS_5SetupEd+0x2cc) [0x7efd76169dbc] [figipc156:03655] [ 4] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x1dd1f) [0x7efd76c10d1f] [figipc156:03655] [ 5] python() [0x4ba865] [figipc156:03655] [ 6] /home/hoff/.local/lib/python2.7/site-packages/music/pymusic.so(+0x17a8c) [0x7efd76c0aa8c] [figipc156:03655] [ 7] python(PyEval_EvalFrameEx+0x130e) [0x4cadbe] [figipc156:03655] [ 8] python(PyEval_EvalCodeEx+0x411) [0x4c87a1] [figipc156:03655] [ 9] python(PyEval_EvalFrameEx+0x1e11) [0x4cb8c1] [figipc156:03655] [10] python(PyEval_EvalFrameEx+0xae2) [0x4ca592] [figipc156:03655] [11] python(PyEval_EvalCodeEx+0x411) [0x4c87a1] [figipc156:03655] [12] python() [0x5030ef] [figipc156:03655] [13] python(PyRun_FileExFlags+0x82) [0x4f8c72] [figipc156:03655] [14] python(PyRun_SimpleFileExFlags+0x197) [0x4f7d77] [figipc156:03655] [15] python(Py_Main+0x562) [0x4982f2] [figipc156:03655] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7efd773bab45] [figipc156:03655] [17] python() [0x497ca0] [figipc156:03655] End of error message

That means, even though the proxy is correctly classified as unconnected something in MUSIC is heavily affected. Note that removing these two lines leads to perfectly correct behaviour again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/MUSIC/issues/35#issuecomment-275376569, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcfCWbsB-Viahq3oJ8llRDHnCeT0fC1ks5rWJC8gaJpZM4Lkd4V .

mhoff commented 7 years ago

I built a very small test case (test.zip). I am actually surprised that this does not work. Maybe I just forgot a parameter or something.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import music

setup = music.Setup()

event_out = setup.publishEventOutput("event_out_1")
print("connected = {}".format(event_out.isConnected()))

times = setup.runtime(0.02)

for time in times:
    print(time)
[node1]
    binary=./node.py
    np=1
    args=

mpirun -np 1 music conf.music

connected = False
[figipc156:05420] *** Process received signal ***
[figipc156:05420] Signal: Segmentation fault (11)
[figipc156:05420] Signal code: Invalid permissions (2)
[figipc156:05420] Failing at address: 0x7f6846669718
[figipc156:05420] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7f6846fa18d0]
[figipc156:05420] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3a5718) [0x7f6846669718]
[figipc156:05420] *** End of error message ***

Note: Yes, again this is only one node, but this time there is a clear and deterministic error. Also, I can reproduce this with two nodes, connected with two continuous ports, plus this unconnected event output. The same behaviour occurs.

mhoff commented 7 years ago

I am getting closer to the original problem regarding the single connection rendering one node unusuable. At the bottom you find my complete setup. The problematic part is:

[...]
reward_gen.reward_out -> plotter.test_cont_in [1]
plotter.test_cont_out -> nest.reward_in [1]
#reward_gen.reward_out -> nest.reward_in [1]
[...]

If I use the two connections, which are currently active, the whole experiment is working. But if I instead use the direct connection, with the real reward data, the plotter node stops after a few timesteps.

In comparison:

htop shows the process is still running and working on 100% CPU.


stoptime=100000.0

[distance_provider]
  binary=ros_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/distance/to_center
  message_type=FloatArray
  sensor_update_rate=1000

[reward_gen]
  binary=python/reward_node.py
  args=
  np=1

distance_provider.out -> reward_gen.distance_in [1]

[plotter]
  np=1
  binary=python/plotter_node.py

reward_gen.reward_out -> plotter.reward_in [1]

[dvs]
  binary=ros_event_sensor_adapter
  args=
  np=1
  music_timestep=0.01
  ros_topic=/camera/dvs/events
  message_type=EventArray
  sensor_update_rate=60

dvs.out -> plotter.pattern_in [200]

[nest]
  binary=python/network_node.py
  args=
  np=1

dvs.out -> nest.pattern_in [200]
nest.activity_out -> plotter.activity_in [20]

# TODO
reward_gen.reward_out -> plotter.test_cont_in [1]
plotter.test_cont_out -> nest.reward_in [1]
#reward_gen.reward_out -> nest.reward_in [1]

[decoder]
  binary=linear_readout_decoder
  args=
  np=1
  music_timestep=0.05
  tau=0.03
  weights_filename=res/activity_to_velocity_translation_weights.dat
[command_gen]
  binary=ros_command_adapter
  args=
  np=1
  music_timestep=0.05
  ros_topic=/cmd_vel
  message_mapping_filename=res/velocity_to_twist.dat
  command_rate=20

nest.activity_out -> decoder.in [20]
decoder.out -> command_gen.in [2]
mhoff commented 7 years ago

This shows the blocking situation with the direct connection between nest and reward_gen.

music_connection_hang_1

mhoff commented 7 years ago

We could track down the problem to a very minimal working example. On some architectures this produces a significant simulation time offset between two nodes with a unidirectional connection, whereas the connection is configured to have nor latency or buffering. Other architectures might be able to execute this minimal example, but fail when are more complex setup is used. (I can provide such an example on request).

@mdjurfeldt suggests --disable-isend when configuring the build to circumvent this faulty behaviour. I can confirm that this eliminates the issue. Just dropping this here to document this.

mdjurfeldt commented 7 years ago

Since the problem @mhoff had is resolved by a workaround, I have condensed the MUSIC-related problem in the new issue #37 and am closing this issue.