[Bug] Subscriber-API does not return a failure, but subscriber is not working

eeas-joas commented 1 year ago

Describe the bug

We have zenoh-pico on STM32H7 with Zephyr. The Agent runs on a different system (Zenoh-ROS2-Bridge).

When the STM32H7-Board is reset (without resetting the agent!), the Zenoh-communication is established, following the examples (z_open, z_declare_subscriber, z_declare_publisher, ..).

All the Zenoh-API-calls do not return an error, but the following situations can occur:

subscribers, publishers works as expected
subscribers do not work
subscribers and publishers do not work and tasks with publisher-functions are frozen (assuming, that Zenoh-API-calls do not return)

A parallel started Zenoh-client on a Linux-system (Zenoh-python) works properly, so we assume, the failure is not coming from Zenoh-Agent.

To reproduce

open zenoh
declare subscribers
declare publishers
restart system, do not restart Zenoh-Agent in between

System info

STM32H723ZG
Zephyr via Platform.io
zenoh-pico 0.7.0

cguimaraes commented 1 year ago

I don't have exactly that board but I will be testing with both a STM32F429ZI and STM32F767ZI, since I am not being able to replicate your issue.

In the meantime, I am doing some tests in a Unix environment against a Zenoh router and it seems to be working correctly. Here is a snippet of the code:

void data_handler(const z_sample_t *sample, void *ctx) {
    (void)(ctx);
    char *keystr = z_keyexpr_to_string(sample->keyexpr);
    printf(">> [Subscriber] Received ('%s': '%.*s')\n", keystr, (int)sample->payload.len, sample->payload.start);
    z_drop(z_move(keystr));
}

int main(int argc, char **argv) {
    const char *keyexpr_sub = "demo/example/b";
    const char *keyexpr_pub = "demo/example/a";
    const char *value_2 = "pub-from-pico";
    const char *mode = "client";
    char *locator = NULL;

    z_owned_config_t config = z_config_default();
    zp_config_insert(z_loan(config), Z_CONFIG_MODE_KEY, z_string_make(mode));
    if (locator != NULL) {
        zp_config_insert(z_loan(config), Z_CONFIG_PEER_KEY, z_string_make(locator));
    }

    printf("Opening session...\n");
    z_owned_session_t s = z_open(z_move(config));
    if (!z_check(s)) {
        printf("Unable to open session!\n");
        return -1;
    }

    // Start read and lease tasks for zenoh-pico
    if (zp_start_read_task(z_loan(s), NULL) < 0 || zp_start_lease_task(z_loan(s), NULL) < 0) {
        printf("Unable to start read and lease tasks");
        return -1;
    }

    z_owned_closure_sample_t callback = z_closure(data_handler);
    printf("Declaring Subscriber on '%s'...\n", keyexpr_sub);
    z_owned_subscriber_t sub = z_declare_subscriber(z_loan(s), z_keyexpr(keyexpr_sub), z_move(callback), NULL);
    if (!z_check(sub)) {
        printf("Unable to declare subscriber.\n");
        return -1;
    }

    printf("Declaring publisher for '%s'...\n", keyexpr_pub);
    z_owned_publisher_t pub = z_declare_publisher(z_loan(s), z_keyexpr(keyexpr_pub), NULL);
    if (!z_check(pub)) {
        printf("Unable to declare publisher for key expression!\n");
        return -1;
    }

    char *buf = (char *)malloc(256);
    for (int idx = 0; 1; ++idx) {
        sleep(1);
        snprintf(buf, 256, "[%4d] %s", idx, value_2);
        printf("Putting Data ('%s': '%s')...\n", keyexpr_pub, buf);

        z_publisher_put_options_t options = z_publisher_put_options_default();
        options.encoding = z_encoding(Z_ENCODING_PREFIX_TEXT_PLAIN, NULL);
        z_publisher_put(z_loan(pub), (const uint8_t *)buf, strlen(buf), &options);
    }

    printf("Enter 'q' to quit...\n");
    char c = '\0';
    while (c != 'q') {
        fflush(stdin);
        scanf("%c", &c);
    }

    z_undeclare_subscriber(z_move(sub));

    // Stop read and lease tasks for zenoh-pico
    zp_stop_read_task(z_loan(s));
    zp_stop_lease_task(z_loan(s));

    z_close(z_move(s));

    return 0;
}

Note that with Zenoh-Pico you need to explicitly launch the read and lease tasks (multi-thread) or to spin at your own pace (single-thread). I am assuming that you are doing it since you mention that subscribers sometimes they work.

If you can share a snippet of your code and/or logs (both from your application using Zenoh-Pico and from the Zenoh-ROS2-Bridge), it would be a great help to try to identify the issue.

cguimaraes commented 1 year ago

I might have an idea of what is happening on your scenario. TL;DR: your device is getting out of memory. I will have a further check on why the session is not being closed due to out of memory errors.

The current version of Zenoh protocol needs to be extended with additional capability negotiations during the session establishment to adapt the communication according to each other capabilities. Several improvements in this respect will come with an improved version of the protocol (expected to Q2 2023 according to the public roadmap). This will be especially critical to address the resource constrained capabilities of the microcontrollers.

Until then, there are a couple of things you can do as a workaround to your issue.

Decrease the default size for buffers. Usually, this is enough in most of the cases:

/**
 * Defaulf maximum batch size possible to be received.
 */
#ifndef Z_BATCH_SIZE_RX
#define Z_BATCH_SIZE_RX \
    65535  // Warning: changing this value can break the communication
           //          with zenohd in the current protocol version.
           //          In the future, it will be possible to negotiate such value.
           // Change it at your own risk.
#endif

/**
 * Defaulf maximum batch size possible to be sent.
 */
#ifndef Z_BATCH_SIZE_TX
#define Z_BATCH_SIZE_TX 65535
#endif

/**
 * Defaulf maximum size for fragmented messages.
 */
#ifndef Z_FRAG_MAX_SIZE
#define Z_FRAG_MAX_SIZE 300000
#endif

Increase the value of CONFIG_HEAP_MEM_POOL_SIZE in zephyr/prj.conf. This is the memory pool used by Zenoh for its dynamic allocations.

Let us know if these workarounds were able to solve your problem.

eeas-joas commented 1 year ago

Hi Carlos, thanks for your response and sorry for the late reply. But we have made additional tests.

I don't think it's linked to memory/heap. We were already struggling with mem and reduced the Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 some time ago. We were also facing an issue with a missing free/z_free, but this is already fixed.

We have the following situation:

STM32H7 with zenoh-pico
Jetson TX2 with Zenoh-DDS-Bridge
5 subscribers on STM32H7
10 publishers on STM32H7 (100ms, 1000ms-task) After restart of STM32H7 the subscriber is not working (eg. 5 times out of 10 restarts), sometimes the publisher-functions are not returning and the tasks hangs

After replacing the TX2 with another system (BB-AI-64) , we run only the Zenoh-agent on BB-AI-64 via zenoh-python:

STM32H7 with zenoh-pico
BB-AI-64 with zenoh-agent
5 publishers on BB-AI-64
5 subscribers on STM32H7
10 publishers on STM32H7 In this environment, the communication runs perfect.

When I use the BB-AI-64 as a subscriber of TX2-zenoh-DDS-bridge, it works perfect.

So in other words:

STM32H7 with zenoh-pico <=> TX2 with zenoh-DDS-Bridge => we are facing issues
STM32H7 with zenoh-pico <=> BB-AI-64 with zenoh-Agent => works
BB-AI64 with zenoh-python <=> TX2 with zenoh-DDS-Bridge => works

We will check this week the TX2-side (check of zenoh-DDS-bridge-version, replace combined bridge by zenoh-agent and separate DDS-bridge).

cguimaraes commented 1 year ago

Thanks for the detailed description of your experiments. It gives us a bit more insights on what might be happening. I am still inclined for a out of memory issue, but now wondering if it might be in a different place.

We had a similar issue while using the Zenoh-DDS-Bridge with a Hussarion robot (https://husarion.com/), where the amount of resources being subscribed/published allied with the fact the Zenoh router was forwarding all the resource declarations was enough to exhaust the memory in our Zenoh-Pico nodes. This was already signaled to the Zenoh team, and we are looking on how to make it more resource constrained friendly.

There are two things we can try to do to debug it: 1) You can do a tcpdump or wireshark capture in your Zenoh-DDS-bridge node and we can try to manually assess the memory overhead (in case you are not following, we just officially made available a Wireshark dissector for Zenoh: https://zenoh.io/blog/2023-01-17-zenoh-wireshark/) 2) I can point you to some parts of the code that you can comment just to see if the problem persist. The issue is that Zenoh communication will not properly work with these lines comment...but it will be handy to find or confirm the root cause of this problem.

From here, we can see what actions to take. In the meantime, I am adding some extra memory checks to trigger an out of memory error / warning in case it happens.

eeas-joas commented 1 year ago

Hello @cguimaraes, thanks a lot for your remarks, especially the "forwarding all the resource declarations was enough to exhaust"

We reduced Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 bytes..

Is it possible that this buffers are exhausted during communication-initialization-phase? .. and there is a difference between init-phase of Zenoh-router vs. Zenoh-DDS-bridge (which is plausible when I follow your statement)?

Because I give it a try and increased buffers Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 2048 bytes and now it seems to work..

I will test more in detail.

In any case: your help was really useful. Thank you very much!

cguimaraes commented 1 year ago

You insights are also very useful and I will dig on this. And discuss with the team on what can be done in the short term aligned with the router implementation to mitigate this issue.

Reducing the Z_BATCH_SIZE_RX might have undefined behavior because we cannot control the size of the batch on the router side (it will be possible to negotiate this in the next version of the protocol). But as of today we need to define it carefully...the same does not happen on the Z_BATCH_SIZE_TX because you might have a way to estimate how much your Zenoh-Pico application will required to send its payload data (or to do fragmentation otherwise).

This being said, Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX can be defined with different values. My advise then is that you reduce the TX to a value slightly bigger than your biggest payload data. And what you save give it to the RX.

eeas-joas commented 1 year ago

Thanks for your remarks.

I made several additional tests with a lot of STM32H7-startups and the software works now stable. Next step will be the replacement of Zephyr by freeRTOS and lwip. But this is another story.

cguimaraes commented 1 year ago

Good to hear that your setup is not stable. I will keep this issue open until it is fully fixed by the upcoming Zenoh Protocol.

Regarding the FreeRTOS port, we have an open request here: https://github.com/eclipse-zenoh/zenoh-pico/issues/129 Feel free to include any additional requirements you might have on your side.

cguimaraes commented 1 year ago

This will be addressed by when https://github.com/eclipse-zenoh/roadmap/issues/2 .

cguimaraes commented 1 year ago

@p-avital can you have a look at this issue please? It was a pending a confirmation after the latest protocol merge

eclipse-zenoh / zenoh-pico