eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.57k stars 373 forks source link

iox-roudi throws POPO__CHUNK_LOCKING_ERROR when killing a process mid-publish #2304

Closed hrudhansh closed 6 days ago

hrudhansh commented 1 week ago

Required information

Operating system: Ubuntu 24.04 LTS

Compiler version: 12.3.0

Eclipse iceoryx version: b2cd72bdc789bcf7601cb112c6078c47d533d798

Observed result or behaviour: Killing an application that in the middle of a 'critical section' of publish causes POPO__CHUNK_LOCKING_ERROR in iox-roudi

Expected result or behaviour: Upon calling the de-constructor, it is able to abruptly stop publish, exit the 'critical section', and exit gracefully.

Conditions where it occurred / Performed steps: To reproduce -

  1. Run a pub-sub with no delay in between publishes.
  2. Register a SIGINT signal handler in your main like signal(SIGINT, SignalHandler);
  3. Upon ctrl+c on the pub process, you would see it stall.
  4. Upon ctrl+c again you would see the above error in iox-roudi.

Additional helpful information

On my end, I ran gdb on the pub process with -exec handle SIGINT nostop & -exec handle SIGINT pass, a breakpoint on the exit(sigint); and called pkill -SIGINT publisher in a separate terminal. I noticed:

  1. In cases where it fails, the signal handler seems to be called in the middle of a publish 'critical section' .
  2. Once this happens ^, the main loop seems to just be spinning.
  3. In cases where it fails, there is also another 'KeepAlive' background thread running.

So I assume what is happening is -

Publish thread starts critical section > triggers an 'is_started' state change in background thread > sends ack back to publish > publish moves ahead > publish is interrupted > background thread is waiting for an 'is_ended' trigger > it never gets it so keeps waiting > publish thread also waiting for background thread to ack 'is_ended'

Also:

hrudhansh commented 1 week ago

This was posted originally in this issue #2193

elBoberido commented 1 week ago

@hrudhansh do you have a minimal example which triggers the problem? Ideally targeting the iceoryx main branch.

If you look at our examples, they also register a signal handler and have no problem with ctrl+c. They use the signal handler either implicit via iox::waitForTerminationRequest(); and while (!iox::hasTerminationRequested()) or explicit with iox::registerSignalHandler.

hrudhansh commented 1 week ago

@elBoberido You are correct! Adding "while (!iox::hasTerminationRequested())" seems to make the issue go away.

So the issue was essentially:

But this is great, I will potentially just add it in-front of every publish call if the overhead isn't too high. Works every time now, thank you!

elBoberido commented 1 week ago

@hrudhansh you don't need to add it before every publish call. I guess you will have a loop where you publish or something similar. Just add it as part of the loop condition. Alternatively if you are blocking in the main thread, it might also be sufficient to just have the iox::waitForTerminationRequest(); call there.

If you are able to post a minimal example of your code, I might be able to tell you the ideal solution for iceoryx. The important thing is to handle the shutdown in a way to let all the destructors run.

hrudhansh commented 6 days ago

So I am essentially making an opinionated wrapper library around Iceoryx for exactly our use-case. One of the "philosophies" of this library is having a very small footprint in our codebase. So ideally the flow is - bring in the header > instantiate > call publish... everything else is taken care of for you. So while I don't have a fixed minimal example, in this case I was just trying to push the boundaries by calling publish with no delays, and see how it holds up. It holds well btw! I did not miss a single message on the sub side once the Options are set correctly.

But I see your point - better to optimize around the whole publish loop instead of every publish call.

elBoberido commented 6 days ago

This example might be interesting for you https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_examples/request_response/client_cxx_waitset.cpp

It shows that you basically just have to register a signal handler and then notify your event loops to stop the execution.