eclipse-zenoh / zenoh-pico

Eclipse zenoh for pico devices
Other
114 stars 74 forks source link

[Bug] _zn_read_exact_tcp in network.c might block indefinitely #95

Open fcgdam opened 2 years ago

fcgdam commented 2 years ago

Describe the bug

I'm doing some tests with Zeno-pico and Zephyr, but using Openthread as the network transport.

On Zephyr Openthread only IPv6 is used on the nodes.

Anyway, after some debugging the case is simple:

Zenoh daemon in running on a machine on a local network. The openthread node can access that machine.

./zenohd --version [2022-07-13T14:18:29Z INFO zenohd] zenohd v0.6.0-dev-283-g12bc4744 built with rustc 1.61.0 (Arch Linux rust 1:1.61.0-1) The zenoh router v0.6.0-dev-283-g12bc4744 built with rustc 1.61.0 (Arch Linux rust 1:1.61.0-1) Zephyr version is 2.7.1 with the associated Openthread package.

When the Openthread node connects to zenohd, it is able to reach it, I think zenohd responds since I see activity on the logs, but the node if it receives something it can read nothing.

This means it times out on (https://github.com/eclipse-zenoh/zenoh-pico/blob/446d1ec94ccb5bb9500efdae2d44b937895b1d8c/src/system/zephyr/network.c#L104) and returns 0 bytes read.

Due to this the calling function, https://github.com/eclipse-zenoh/zenoh-pico/blob/446d1ec94ccb5bb9500efdae2d44b937895b1d8c/src/system/zephyr/network.c#L107 enters an infinite loop and hangs the microcontroller since the returning value is zero and never returns any other values that might satisfy the exit do{} conditions.

From the zenohd machine, the openthread node running the code is accessible by ping, and if running any services the can access the network and receive responses.

This means that there might also be an issue with the combo zephyr/openthread/zenoh-pico, since it doesn't work (at least for me), but first I just wanted to call what might be an issue on this code section.

I've added the following lines at the _zn_read_exact_tcp: rb = _zn_read_tcp(sock, ptr, n); // Exit if read returns 0 bytes and there is still data to read. Temporary fix. if ( (n!=0) && (rb==0)) // read failure return 0; This allows the code going "up" and report connection failure to the main code.

From the zenohd side I get: 2022-07-13T15:10:12Z DEBUG zenoh_link_tcp::unicast] Accepted TCP connection on [fd11:1111:1122:2222:c2d8:7ba9:9e6d:5d3f]:7447: [fdbf:fcc1:5c19:1:43bc:bf3f:5a83:9d2]:50602 [2022-07-13T15:10:22Z DEBUG zenoh_transport::unicast::manager] future has timed out where the fdbf:.. is the node address.

To reproduce

A bit dificult since there is a need to setup an Openthread network, Border router, and have a node available. I'll try to help in whatever I can.

System info

cguimaraes commented 2 years ago

We have successfully tested a similar setup in the past (i.e., Zephyr + OpenThread + zenoh-pico) and, if I recall correctly, we have some users that also tried it without issues.

In any case, from the partial log you provided, it seems that the INIT/OPEN procedure used to establish a Zenoh session is halting. The client node (the board in your setup I believe) is not receiving the INIT Ack from zenohd, thus failing to open the session due to a timeout. In fact, _zn_read_exact_tcp / _zn_read_tcp actually timeout, allowing the code to report connection failure up in the layers. Only if timeouts are explicitly disabled, the code will keep looping. What happens upon the timeout depends on which state the Zenoh session is.

If you can provide us with the configuration files for each Zephyr and Linux node, as well as a subset of failing code, we can try to replicate it with the hardware we own.

fcgdam commented 2 years ago

Hi:

Since you might have the necessary hardware, I've uploaded the sample repository that is failing to work at https://github.com/fcgdam/ot_zenoh.

In this way all the necessary dependencies, files and code is available.