Open nodakai opened 1 month ago
@nodakai I think if you change the code in the subscriber to:
// while let NodeEvent::Tick = node.wait(CYCLE_TIME) {
// while let Some(sample) = subscriber.receive()? {
while let NodeEvent::Tick = node.wait(Duration::ZERO) {
if let Some(sample) = subscriber.receive()? {
let tr = current_time();
let dt = tr - sample.i;
println!("received: {:?} delay: {:.1} us", *sample, dt as f64 * 1e-3);
}
}
It would fix the issue.
If you do not want to perform a busy wait, you can combine the publish-subscribe service with an event service and fire an event after the publisher has sent the message. On the subscriber side you wait on a listener until you have received the event and then receive your sample on the subscriber. For the event you can find an example here: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/examples/rust/event
Hmm, applying the suggested change (Duration::ZERO
only) reduced latency to 20--50 us, without consuming 100 % CPU core. So it appears that IO2's implementation does not rely on classical busy looping on an atomic variable in shared memory? Is there any docs on what synchronization primitives are used for each operation mode?
It's still much slower than these numbers so I'll have to continue investigation https://github.com/eclipse-iceoryx/iceoryx2/blob/main/internal/plots/benchmark_mechanism.svg
@nodakai another problem can be this call libc::clock_gettime(libc::CLOCK_REALTIME, &mut ts)
depending on what underlying clock source it uses - either HPET or TSC. With one of them you have a huge performance impact @elBoberido knows more here.
We encountered the same problem in our benchmarks and the solution was, instead of sending one sample and measure the time from A to B, we use a setup where process A sends a sample to process B and as soon as process B has received it, we send a sample back to A. We repeat this a million times and then measure the total time divided by the repetition times 2 (since we send the sample back and forth).
This is a common method called ping-pong benchmark.
@nodakai Oh, could you adjust your code to:
// while let NodeEvent::Tick = node.wait(CYCLE_TIME) {
// while let Some(sample) = subscriber.receive()? {
loop {
if let Some(sample) = subscriber.receive()? {
let tr = current_time();
let dt = tr - sample.i;
println!("received: {:?} delay: {:.1} us", *sample, dt as f64 * 1e-3);
}
}
Since you are working on an aarch64 target, you should be able to expect a latency in a single-digit microsecond, see: https://raw.githubusercontent.com/eclipse-iceoryx/iceoryx2/refs/heads/main/internal/plots/benchmark_architecture.svg
For the raspberry pi 4b we achieved a latency of ~800ns
Removing node.wait(...)
reduced latency further to 3--5 us at the cost of consuming 100 % CPU core (it does look like a busy looping now).
It still lags noticeably behind your Rasp Pi result. I know I shouldn't expect an ultimate performance from a shared, HT-enabled cloud machine but 0.8 us vs 5 us seems weird.
Was your Rasp Pi result obtained with busy looping (100 % CPU core consumption)?
I already verified the following test yields ~50 ns on the target Linux machine so I'm not worried about the accuracy/overhead of clock_gettime(CLOCK_REALTIME)
. It should be accelerated by vDSO https://man7.org/linux/man-pages/man7/vdso.7.html#ARCHITECTURE-SPECIFIC_NOTES
loop {
let t0 = current_time();
let t1 = current_time();
println!("t1 - t0 = {}", t1 - t0);
std::thread::sleep(CYCLE_TIME);
}
@nodakai I hesitate to ask, but are you building in release mode? On my system, debug builds are ~6.5 times slower than release builds.
@nodakai
Was your Rasp Pi result obtained with busy looping (100 % CPU core consumption)?
Yes, our benchmarks are part of the repo and you can run them yourself: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/benchmarks
cargo run --bin benchmark-publish-subscribe --release -- --bench-all
runs a ping-pong style pub-sub benchmark with two busy spin loopscargo run --bin benchmark-event --release -- --bench-all
runs a ping-pong style event benchmark with two loop waiting for a notification. Here the threads perform under the hood a sys call to wait for the notification to arrive and then fire a response.It still lags noticeably behind your Rasp Pi result. I know I shouldn't expect an ultimate performance from a shared, HT-enabled cloud machine but 0.8 us vs 5 us seems weird.
We never experimented with such a machine and our benchmark suite, but it would be good to add them. One idea is that the hardware is emulated, and you pay with latency? But this is just a wild guess.
but are you building in release mode?
Surely using --release
Yes, our benchmarks are part of the repo and you can run them yourself: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/benchmarks
Thanks, I'll give it a try. However, it seems that there’s quite a bit of parameter tuning involved. If the examples in the README aren’t reflective of typical usage, there might be room for improvement in the API design.
Thanks, I'll give it a try. However, it seems that there’s quite a bit of parameter tuning involved. If the examples in the README aren’t reflective of typical usage, there might be room for improvement in the API design.
You could divide the parameter tuning into to fields: deployment and iceoryx2.
For the deployment we used the posix thread feature to pin a thread to a specific CPU core and add a high priority which reduces jittering from your OS.
For iceoryx2 it is essentially, the less the better (at least for most parameter). If you have a setup with 1 publisher and 1000 subscribers, the latency will go up - this goes also for other parameters like history size. In the future, we need to provide a benchmark for such a setups as well, but the first benchmark focuses on a minimal use case with one publisher and one subscriber. I expect only a few nanoseconds when you increase these numbers significantly - but this needs to be proven by a benchmark.
Required information
Operating system:
Linux db 6.1.0-25-cloud-arm64 #1 SMP Debian 6.1.106-3 (2024-08-26) aarch64 GNU/Linux
Rust version:
rustc 1.81.0 (eeb90cda1 2024-09-04)
Cargo version:
cargo 1.81.0 (2dbb1af80 2024-08-20)
iceoryx2 version:
Detailed log output:
Details
Observed result or behaviour:
received: Msg { data: 1234, i: 1728059776063045068 } delay: 289894.3 us
Expected result or behaviour:
single digit us latency
Conditions where it occurred / Performed steps: