eclipse-zenoh / zenoh-pico

Eclipse zenoh for pico devices
Other
107 stars 66 forks source link

[Bug] Failed to obtain RX buffer #595

Open alexzeit opened 4 weeks ago

alexzeit commented 4 weeks ago

Describe the bug

In case of restart of zenoh-pico mcu sporadic got <"err> eth_stm32_hal: Failed to obtain RX buffer" and subscriber stopped receiving the messages. This happens only if publisher e.p python is already running or have been started before the mcu with zenoh-pico have been started. An additional delay (e.g. 10s) before z_declare_subscriber does not help. Used peer connection.

To reproduce

  1. configure python publisher and subscriber peer mode and interval=2ms
  2. configure zenoh-pico peer mode subscriber and publisher interval=1ms
  3. start python publisher and subscriber
  4. start zenoh-pico publisher and subscriber
  5. restart mcu with zenoh-pico

System info

jean-roland commented 3 weeks ago

Thanks for the report @alexzeit, will try to reproduce on my F767ZI.

jean-roland commented 2 weeks ago

Just to clarify you have a Python publisher sending a message every 2ms to a pico subscriber And a pico publisher sending a message every 1ms to a Python subscriber All in peer mode. Do you have two boards or the publisher and subscriber are on the same nucleo?

Actually, it would probably be easier if you could send me the project files you used.

alexzeit commented 2 weeks ago

Hi Jean-Roland yes, but the same behaviour I have observed with c++ pubsub and 1ms in peer mode. We have one boards where publisher and subscriber are running in separate threads of zephyr rtos.

jean-roland commented 2 weeks ago

Alright, so it seems the error message is produced by Zephyr when it ran out of RX buffers to store messages.

My guess is it breaks the connection and since we do not yet have connectivity event support (see Issue #333) the only possibility is to restart the node.

Alternatively, you can try increasing the number of RX buffers, that should reduce the occurrence rate, see https://docs.zephyrproject.org/2.7.5/reference/kconfig/CONFIG_NET_BUF_RX_COUNT.html and https://docs.zephyrproject.org/2.7.5/reference/kconfig/CONFIG_NET_PKT_RX_COUNT.html

That also means pico has a hard time keeping up with this message rate, and as we discussed before we're going to look into performance after the 1.0 release.

alexzeit commented 1 week ago

Yes, it seems to be by zephyr, but this is caused by zenoh core. I think the issue is that zenoh starts the Ethernet receiver but it takes time until it starts to consume the bytes from eth Rx buffer. Because in other case, where the python publisher is not running during zenoh start up, this issue is not happening. I have tried to increase the rx buffer, but this did not solve the problem

jean-roland commented 1 week ago

So I tried reproducing the issue on my board with a pub/sub with 1ms frequency without success (or failure?). Is it possible for you to send me the files you used for the board and PC?