[Bug] 40ms latency spikes every second seconds

alexzeit commented 1 month ago

Describe the bug

I publish periodically a double value every 10ms but on subscriber side I have one time every second seconds the time difference about 40ms. Normally it is 10ms as it run at zenoh pico publisher.

Tested with board Nucleo 144 h753zi, RTOS Zephyr 3.7 rc3

subscriber is running on real time ubuntu NXP RDB3 board on isolated core with rt priority RR 10

Compared with udp socket Nucleo<->NXP, no issues found. The communication is stable for tx rate up to 1ms. Latency time 740us.

a try with real time option did not make any improvement: z_publisher_options_t opts = z_publisher_options_default(); opts.priority = Z_PRIORITY_REAL_TIME; pub = z_declare_publisher(z_loan(s), z_keyexpr(PUB_KEYEXPR), &opts);

To reproduce

start zenohd start subscriber on Linux e.g. z_sub.py start zenoh-pico publisher on uC board

System info

Zephyr 3.7. RC3 on Nucleo H753zi z_publisher_put(z_loan(pub), (const uint8_t *)&double_value, sizeof(double), NULL);

Ubuntu 22.04 RT patch /w. CPU shielding taskset -c 2-7 chrt -r 10 python z_sub.py

sub = session.declare_subscriber(key, listener, reliability=Reliability.RELIABLE())

sashacmc commented 1 month ago

Hi @alexzeit, thanks for the bug report, we will take a look.

Mallets commented 1 month ago

Hi @alexzeit , can you please confirm which zenoh version are you using?

Please not that opts.priority = Z_PRIORITY_REAL_TIME; does nothing on pico itself since zenoh-pico does not support QoS enforcement. QoS is enforced on the next router if QoS is enabled (which is on by default). Said in another words, opts.priority is a marker for network prioritization that is carried over every hop and proper scheduling strategy is selected when available.

Mallets commented 1 month ago

Moreover, have you tried with a Rust subscriber instead of a Python one? Do you observe the same latency spikes? This would help us in narrowing down the potential issue.

alexzeit commented 1 month ago

Hi Mallets, I tried with c++ wrapper and have seen the same behaviour with latency spikes. I have never tried to program with RUST sorry. I have traced the Nucleo , the z_publisher_put is triggered every 10ms

Mallets commented 1 month ago

Ok, if you see it also on C++ we can already exclude Python interpreter/garbage collector from the picture :)

alexzeit commented 1 month ago

I have upgraded from 0.11.0 to 1.0.0.6, nothing changed , it has the same behaviour. In addition I have tested the publisher with 1ms sleep. the number of dt spikes increased to 40 spikes per s, but the max dt dropped to 19ms. It looks like one sample received with big delay of 19ms and then next 20 samples received with very small delay about 20us

I have played with thread priority where publisher is triggered but that results to no change.

could you explain how zenoh-pico handle the publishing? Is it running in separate thread? If yes how is the priority set for this thread in zephyr?

jean-roland commented 1 month ago

Hello alexzeit,

First, could you confirm if you're running in a client mode with a router or in peer mode? As in client mode, the spikes might come from the router.

Publishing is not run in a seperate thread and should be sent as soon as your application call the function (as soon as the encoding is done that is).

Do you verify the spikes on Wireshark as well?

alexzeit commented 1 month ago

Hi jean, it is in client mode and using the router zenohd v1.0.0-alpha.6 built with rustc 1.75.0 (82e1608df 2023-12-21) My next test will be the peer mode. No i have not checked it with Wireshark, i have to check this

alexzeit commented 1 month ago

Hi Jean. How can I activate the peer mode? The zenoh-pico publisher is running on Nucleo board with ip 192.168.88.112 The python subscriber on NXP with ip 192.168.88.107

So I have set in prj.conf of the zephyr like this: CONFIG_NET_CONFIG_MY_IPV4_GW="192.168.88.1" CONFIG_NET_CONFIG_PEER_IPV4_ADDR="192.168.88.107"

and changed #define CLIENT_OR_PEER 1 in main.c

on NXP i started the subscriber with python z_sub.py -m peer -e tcp/192.168.88.112:7447

but see no connection

jean-roland commented 1 month ago

You need to use a UDP Multicast group to use peer to peer. In the Zephyr example you should set the CLIENT_OR_PEER to 1 and for your python subscriber:

z_sub.py -m peer -l udp/224.0.0.225:7447#iface=<the interface name>

alexzeit commented 1 month ago

Here the trace: "649","0.786924369","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","151","52796 → 7447 [PSH, ACK] Seq=24179 Ack=1 Win=3413 Len=77" "650","0.786990973","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=1 Ack=24256 Win=65535 Len=0" "651","0.789431687","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","151","52796 → 7447 [PSH, ACK] Seq=24256 Ack=1 Win=3413 Len=77" "652","0.789498290","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=1 Ack=24333 Win=65535 Len=0" "653","0.790654945","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","77","7447 → 52796 [PSH, ACK] Seq=1 Ack=24333 Win=65535 Len=3" Frame 653: 77 bytes on wire (616 bits), 77 bytes captured (616 bits) on interface pfe0, id 0 Ethernet II, Src: Gigalink_be:e7:72 (00:01:be:be:e7:72), Dst: STMicroelect_35:f1:9c (00:80:e1:35:f1:9c) Internet Protocol Version 6, Src: fe80::201:beff:febe:e772, Dst: fe80::280:e1ff:fe35:f19c Transmission Control Protocol, Src Port: 7447, Dst Port: 52796, Seq: 1, Ack: 24333, Len: 3 Data (3 bytes) Data: 010004 [Length: 3] "654","0.790945758","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","74","52796 → 7447 [ACK] Seq=24333 Ack=4 Win=3410 Len=0" "655","0.791937005","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","151","52796 → 7447 [PSH, ACK] Seq=24333 Ack=4 Win=3413 Len=77" "659","0.832289101","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=4 Ack=24410 Win=65535 Len=0" "660","0.833102539","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","1383","52796 → 7447 [PSH, ACK] Seq=24410 Ack=4 Win=3413 Len=1309" "661","0.833118540","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=4 Ack=25719 Win=65535 Len=0" "662","0.833875975","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","151","52796 → 7447 [PSH, ACK] Seq=25719 Ack=4 Win=3413 Len=77" "663","0.833942178","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=4 Ack=25796 Win=65535 Len=0" "664","0.835934072","fe80::280:e1ff:fe35:f19c","fe80::201:beff:febe:e772","TCP","151","52796 → 7447 [PSH, ACK] Seq=25796 Ack=4 Win=3413 Len=77" "665","0.836001675","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","74","7447 → 52796 [ACK] Seq=4 Ack=25873 Win=65535 Len=0"

at 0.833102539 zeno-pico sent several messages at once, normaly the negth is 77 bytes, but here 1309 with content of 17 messages

alexzeit commented 1 month ago

every time before this blocking happened the subscriber sent "653","0.790654945","fe80::201:beff:febe:e772","fe80::280:e1ff:fe35:f19c","TCP","77","7447 → 52796 [PSH, ACK] Seq=1 Ack=24333 Win=65535 Len=3" Frame 653: 77 bytes on wire (616 bits), 77 bytes captured (616 bits) on interface pfe0, id 0 Ethernet II, Src: Gigalink_be:e7:72 (00:01:be:be:e7:72), Dst: STMicroelect_35:f1:9c (00:80:e1:35:f1:9c) Internet Protocol Version 6, Src: fe80::201:beff:febe:e772, Dst: fe80::280:e1ff:fe35:f19c Transmission Control Protocol, Src Port: 7447, Dst Port: 52796, Seq: 1, Ack: 24333, Len: 3 Data (3 bytes) Data: 010004 what does it mean?

jean-roland commented 1 month ago

That's pretty weird, we do have a batching mechanism to coalesce multiple messages into a single packet but I'm pretty sure it's not implemented on pico.

We have a Zenoh plugin for Wireshark if you want more insight on the messages but it might be a Zenoh [KEEP_ALIVE] packet from the router to maintain the connection with the publisher. They are sent every few seconds.

Similarly, the publisher will send Zenoh [KEEP_ALIVE] towards the router as well every few seconds.

jean-roland commented 1 month ago

Well now that I said that, it feels the [KEEP_ALIVE] mechanism might be a culprit for your latency spikes. In the zenoh-pico config (include/zenoh-pico/config.h You have two values governing the timing:

#define Z_TRANSPORT_LEASE 10000

#define Z_TRANSPORT_LEASE_EXPIRE_FACTOR 3

Basically, the send keep alive period is the first value divided by the second. You could try to reduce the factor to 1 see if it impact the spikes.

alexzeit commented 1 month ago

I have changed the FACTOR to 1 but the spikes are still 40ms:

Sample Count: 20704 (big_delay_count: 1) at 30.849295 angle: 0.0 Max Delay: 41568.04084777832 microseconds Sample Count: 21104 (big_delay_count: 0) at 31.849339 angle: 0.0 Max Delay: 2595.663070678711 microseconds Sample Count: 21504 (big_delay_count: 0) at 32.849338 angle: 0.0 Max Delay: 2581.8347930908203 microseconds Sample Count: 21905 (big_delay_count: 1) at 33.849341 angle: 0.0 Max Delay: 41498.661041259766 microseconds Sample Count: 22305 (big_delay_count: 0) at 34.849340 angle: 0.0 Max Delay: 2598.5240936279297 microseconds

Here the test with peer, it looks better, but there is one big delay of 12ms to every 12s:

Sample Count: 5463 (big_delay_count: 0) at 39.212251 angle: 0.0 Max Delay: 2662.65869140625 microseconds Sample Count: 5880 (big_delay_count: 0) at 40.212226 angle: 0.0 Max Delay: 2429.0084838867188 microseconds Sample Count: 6297 (big_delay_count: 0) at 41.212263 angle: 0.0 Max Delay: 2569.6754455566406 microseconds Sample Count: 6713 (big_delay_count: 0) at 42.212222 angle: 0.0 Max Delay: 2652.6451110839844 microseconds Sample Count: 7130 (big_delay_count: 0) at 43.212246 angle: 0.0 Max Delay: 2493.143081665039 microseconds Sample Count: 7547 (big_delay_count: 0) at 44.212218 angle: 0.0 Max Delay: 2462.3870849609375 microseconds Sample Count: 7963 (big_delay_count: 0) at 45.212248 angle: 0.0 Max Delay: 2424.2401123046875 microseconds Sample Count: 8379 (big_delay_count: 0) at 46.212221 angle: 0.0 Max Delay: 2677.2022247314453 microseconds Sample Count: 8796 (big_delay_count: 0) at 47.212218 angle: 0.0 Max Delay: 2496.480941772461 microseconds Sample Count: 9212 (big_delay_count: 0) at 48.212227 angle: 0.0 Max Delay: 2618.7896728515625 microseconds Sample Count: 9629 (big_delay_count: 0) at 49.212261 angle: 0.0 Max Delay: 2661.466598510742 microseconds Sample Count: 10041 (big_delay_count: 0) at 50.212222 angle: 0.0 Max Delay: 11929.750442504883 microseconds Sample Count: 10457 (big_delay_count: 0) at 51.212234 angle: 0.0 Max Delay: 2614.9749755859375 microseconds Sample Count: 10874 (big_delay_count: 0) at 52.212218 angle: 0.0 Max Delay: 2516.9849395751953 microseconds Sample Count: 11290 (big_delay_count: 0) at 53.212269 angle: 0.0 Max Delay: 2420.663833618164 microseconds Sample Count: 11706 (big_delay_count: 0) at 54.212225 angle: 0.0 Max Delay: 2424.955368041992 microseconds Sample Count: 12123 (big_delay_count: 0) at 55.212227 angle: 0.0 Max Delay: 2475.5001068115234 microseconds Sample Count: 12539 (big_delay_count: 0) at 56.212221 angle: 0.0 Max Delay: 2525.5680084228516 microseconds Sample Count: 12956 (big_delay_count: 0) at 57.212270 angle: 0.0 Max Delay: 2424.7169494628906 microseconds Sample Count: 13372 (big_delay_count: 0) at 58.212222 angle: 0.0 Max Delay: 2414.4649505615234 microseconds Sample Count: 13788 (big_delay_count: 0) at 59.212223 angle: 0.0 Max Delay: 2666.2349700927734 microseconds Sample Count: 14205 (big_delay_count: 0) at 00.212223 angle: 0.0 Max Delay: 2564.668655395508 microseconds Sample Count: 14622 (big_delay_count: 0) at 01.212261 angle: 0.0 Max Delay: 2624.034881591797 microseconds Sample Count: 15034 (big_delay_count: 0) at 02.212227 angle: 0.0 Max Delay: 11926.651000976562 microseconds

jean-roland commented 1 month ago

For the client: Changing the factor shouldn't improve the delay spike only influence its frequency/period. Do you notice any change in this regard?

For the peer mode: Do you have a corresponding Wireshark capture? Is there anything weird around the delayed packet? Peer mode periodically send [JOIN] packets every few seconds so if that was that the spike period would be lower than 12 seconds.

alexzeit commented 1 month ago

Yes, i have checked in wireshark, the 10ms delay is there, but this is my fault. I have used printf for debugging for every 5000 messages. I commented it out, now it works fine with peer. so the problem might be on the zenoh router side if i got it right

jean-roland commented 1 month ago

That is a relief for peer mode.

Can you confirm that the spike in client mode always corresponds to a batching of your messages by the router? If that's the case we can ask the router people if there's a way to configure this out.

alexzeit commented 1 month ago

Yes i can confirm this. Just tested with 1ms publishing period and peer mode, it works fine. In my opinion it makes a bit much the CPU load, zenoh-pico publisher 32% versus UDP socket 5% on Nucleo h753zi (480MHz) , I hope this could be improved in the next version.
Yes, you can close the issue for zenoh-pico. Many thanks for helping us, hope router people could help here father :)

Mallets commented 4 weeks ago

On dev/1.0.0 version you can disable batching on the router by setting the batching parameter in the configuraion.

alexzeit commented 3 weeks ago

Hi, I set batching to false, the spikes have disappeared, but during the start of communication the latency for most packets raised ap to 0.8s. After 10s the latency stabilised to 1.7ms. Only if I reduce the batching size it starts with 2ms latency.

eclipse-zenoh / zenoh-pico