eclipse-iceoryx / iceoryx2

Eclipse iceoryx2™ - true zero-copy inter-process-communication in pure Rust
https://iceoryx.io
Apache License 2.0
1.03k stars 40 forks source link

Pub/sub example creates unkillable process with 100% core usage and no output when switched to use string #436

Closed Cypher1 closed 1 month ago

Cypher1 commented 1 month ago

r.e. FAQ

I've seen the 'dynamically sized' portion of the FAQ. This is not about the ability to transmit variably sized data types. This issue is about how badly iceoryx handles dynamically sized data. Ideally this would be prevented at compile time, but at minimum the program should respond to SIGKILL/SIGTERM and should probably crash when failing to handle dynamically sized data, rather than spinning at 100% core usage.

Required information

Operating system:

Rust version: rustc 1.80.1 (3f5fd8dd4 2024-08-06)

Cargo version: cargo 1.80.1 (376290515 2024-07-16)

iceoryx2 version: v0.4.1

Detailed log output: I think this is a) easy to reproduce and b) is worth reproducing, so I have not collected log data myself.

Observed result or behaviour: ./sub runs at 100% core usage and does not respond to attempts to kill it (e.g. SIGKILL, SIGTERM). It has to be SIGSTOP'd and abandoned.

pub.rs

use std::process::ExitCode;
use core::time::Duration;
use iceoryx2::prelude::*;

const CYCLE_TIME: Duration = Duration::from_secs(1);

fn main() -> ExitCode {
    let mut count = 0;
    let node = NodeBuilder::new().create::<ipc::Service>().expect("1");

    // create our port factory by creating or opening the service
    let service = node.service_builder(&"My/Funk/ServiceName4".try_into().expect("2"))
        .publish_subscribe::<String>()
        .open_or_create().expect("3");

    let publisher = service.publisher_builder().create().expect("4");

    while let NodeEvent::Tick = node.wait(CYCLE_TIME) {
        let sample = publisher.loan_uninit().expect("5");
        let sample = sample.write_payload(format!("{:?}", count));
        count += 1;
        sample.send().expect("6");
    }
    ExitCode::SUCCESS
}

sub.rs

use std::process::ExitCode;
use core::time::Duration;
use iceoryx2::prelude::*;

const CYCLE_TIME: Duration = Duration::from_secs(1);

fn main() -> ExitCode {
    let node = NodeBuilder::new().create::<ipc::Service>().expect("1");

    // create our port factory by creating or opening the service
    let service = node.service_builder(&"My/Funk/ServiceName4".try_into().expect("2"))
        .publish_subscribe::<String>()
        .open_or_create().expect("3");

    let subscriber = service.subscriber_builder().create().expect("4");

    while let NodeEvent::Tick = node.wait(CYCLE_TIME) {
        while let Some(sample) = subscriber.receive().expect("5") {
            println!("received: {:?}", *sample);
        }
    }
    ExitCode::SUCCESS
}

Expected result or behaviour: The following alternatives would be acceptable.

Conditions where it occurred / Performed steps: Build the above files and attempt to run them in any order.

Cypher1 commented 1 month ago

Also, I suspect that the sub process is performing a buffer overflow and reading from memory not associated with the sender. I've not spent the time to actually prove this but it would explain the high CPU usage.

elfenpiff commented 1 month ago

@Cypher1 Thanks for the bug report!

You cannot send a String with iceoryx2. The reason is that a String uses heap memory and contains a pointer. iceoryx2 supports only self-contained data types. We already provide some containers that are shared memory compatible, in your case, take a look at the complex data types example: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/examples/rust/complex_data_types

Here, we demonstrate how to use a FixedSizeByteString and a FixedSizeByteString as payload data.

Ideally this would be prevented at compile time,

With v0.5 we will introduce a macro that prevents users from using incompatible types. Also, we will add this to the documentation so that no one else will run into this issue before v0.5!

but at minimum the program should respond to SIGKILL/SIGTERM and should probably crash when failing to handle dynamically sized data, rather than spinning at 100% core usage.

You uncovered a bug in our signal handler caused by a SEGFAULT by using the string. This should definitely not happen and I completely agree. We will look into it and fix it.

Cypher1 commented 1 month ago

Great to hear! Thanks

Just btw, on early reflection (and I haven't spent much time on it yet) I don't think a macro should be needed for the safety I'm describing. I think it would be simpler (and more 'Rust') to provide a trait like SelfContained and implement it for all primitives and maybe even Copy types. Whatever is happening during the serialization / copying into shared memory step is not working for heap data and this is likely unsafe.

Anyway, sounds like you have it in hand, thanks for the speedy response

elfenpiff commented 1 month ago

@Cypher1

I added the type restrictions and how to define payload types for iceoryx to all example documentations and the FAQ. Furthermore, with #448 the signal handling should now be correct - so if you let your String example run again, the subscriber should terminate when the SIGSEGV is received.

Just btw, on early reflection (and I haven't spent much time on it yet) I don't think a macro should be needed for the safety I'm describing. I think it would be simpler (and more 'Rust') to provide a trait like SelfContained and implement it for all primitives and maybe even Copy types.

Yes, this was our idea. To have a proc macro with a corresponding trait like SelfContained or ShmSend which is implemented for all primitive types. So you just have to annotate your struct with #[ShmSend] and if one member does not implement the trait you get a compile failure.

Whatever is happening during the serialization / copying into shared memory step is not working for heap data and this is likely unsafe.

One of the benefits of zero copy communication is that you do not need to serialize the data at all. You can use the API so that you do not need to perform a single copy. The workflow should be like acquire memory with Publisher::loan() provide this memory to your thing that produces data - for instance V4L2 API has a call so that the camera writes the frame directly into the provided memory - and then you send it out to the next process. So you do not have a single copy in this pipeline and then iceoryx2 becomes incredibly fast.

elfenpiff commented 1 month ago

448 and #447 are merged and solving the issue. @Cypher1 if there is still an open problem please feel free to reopen this issue or create a new one.

Cypher1 commented 1 month ago

Thanks! That should do it (aside from the ongoing questions around detecting safe vs unsafe types [relocatable I think would be the requirement] types at compile time). I think I'm going to have to use flatbuffer or cap'n proto (I have dynamic message size requirements and want to support communications between machines, so I can't get the perf improvements from iceoryx) but I'm really glad you all are working on it.

elfenpiff commented 1 month ago

@Cypher1 Until the end of this year, we will focus on making the allocation in the publisher as dynamic as possible so that the user will no longer need to set the worst-case size. When this is done (and request-response, a lot of users are waiting for it) we will look into relocatable types then you shouldn't need to serialize data anymore which should give you another performance boost.

Cypher1 commented 1 month ago

Exciting to hear, Thanks!