eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.57k stars 373 forks source link

Race condition between ProbeProcessAliveWithSigTerm in RouDi and SignalWatcher singleton in applications #2308

Open gpalmer-latai opened 6 days ago

gpalmer-latai commented 6 days ago

Required information

When shutting down applications, RouDi periodically sends SIGTERM signals. Usually these are all eaten by the SignalWatcher in applications which simply sets a flag that a SIGTERM was received.

But there is a very small window of time where right after the destructor for the SignalWatcher has run and reset the signal handler, but before the process is actually dead, that the SIGTERM triggers the default signal handler which terminates the program with a non-zero exit code.

This is pretty annoying when monitoring applications for non-zero exit codes because every once in awhile you'll get the SIGTERM exit code (usually 128 + 15 = 143 though in my environment it seems to be just 15).

Operating system: NVidia DriveOS 6.0.6

Compiler version: Can't recall off the top of my head - some version of gcc for arm64 CPU

Eclipse iceoryx version: Fork based off of 8a5e08348a8fe830cdb1d92bf5b299a3e4bc8282 (Upstream mean as of 2024.11.1)

Observed result or behaviour: Occasionally applications have exit code 15, but all signs from RouDi and the application are that it exited gracefully

Expected result or behaviour: Exit code 0 always. Preferably we monitor for processes being alive or not some other way besides spamming SIGTERM. I believe the idea of a file lock was mentioned w.r.t. Iceoryx2?

Conditions where it occurred / Performed steps: Have an application which connects to RouDi and uses the waitForTermination() signal handler method. Take a non-trivial amount of time to shutdown such that RouDi sends several followup SIGTERM's after the first. Be unlucky and have the signal watcher partially destroyed when receiving one of those SIGTERMS

Additional helpful information

elBoberido commented 5 days ago

@gpalmer-latai there is a reason I don't like singletons.

I think I have a workaround and a solution. Since the destruction order of static variables is in reverse order of their construction, you could just call hasTerminationRequested before initRuntime.

Unfortunately, we need to reset the signal handler in the dtor, else if a signal was raised after the dtor it would access a destroyed semaphore. But it should be possible to just register an empty function as signal handler. AFAIK, this should prevent immediate termination and since the dtor of the SignalWatcher runs after main, it shouldn't have a negative effect. @elfenpiff what do you think?

There is a third solution but it is not yet fully workable for you. The experimental posh API does not use a singleton for the runtime but a local variable. With this, it is also guaranteed that the unregistering from RouDi is finished before the dtor of the static SignalWatcher runs. Unfortunately, not all endpoints are already ported and request-response and the listener are not yet available.

elfenpiff commented 5 days ago

@elBoberido @gpalmer-latai

Since the destruction order of static variables is in reverse order of their construction, you could just call hasTerminationRequested before initRuntime.

In one translation unit you are right, but when you have multiple ones this is no longer guaranteed. In this use case it should work but I wouldn't recommend this for production code. However, it is perfect for a quick intermediate solution to remove the annoying failures and gives us time to solve this issue cleanly.

But it should be possible to just register an empty function as signal handler. AFAIK, this should prevent immediate termination and since the dtor of the SignalWatcher runs after main, it shouldn't have a negative effect.

It would render the SignalWatcher useless in all other cases, but I agree that something like this is the only viable solution, when we have a singleton runtime. I am pointing this out because I think I used the SignalWatcher before to capture signals like SIGABRT when the shared memory is set to zero - to provide a useful error output to the user.

Also, I would not fix a design mistake we made - with the singleton - by adding another mistake in the SignalWatcher, which would then have surprising behavior. The contract is clear: register a signal, and when it goes out of scope, it restores the previous signal catching calls - maybe another instance has overridden some kind of signal handler, and with this change, we would break it.

There is a third solution but it is not yet fully workable for you. The experimental posh API does not use a singleton for the runtime but a local variable. With this, it is also guaranteed that the unregistering from RouDi is finished before the dtor of the static SignalWatcher runs. Unfortunately, not all endpoints are already ported and request-response and the listener are not yet available.

I would recommend this path. As far as I remember @elBoberido you mentioned that less than a week has to be invested to finish the featureset.

Otherwise, it feels like that we try to reduce bugs/technical debt by introducing even more technical debt.

elBoberido commented 5 days ago

@elfenpiff when my memories from creating the API for the Publisher, Subscriber and WaitSet were fresh, it would have taken me about two days but now I guess it takes at least one more day to get into the details again.

gpalmer-latai commented 5 days ago

Thanks for the feedback. I'm going to try the suggested workaround for now, and I agree the proper thing to do is to de-singleton the runtime. If I ever get the resources to do so I might help contribute to this (after I get through my backlog of other things I want to upstream).

elBoberido commented 5 days ago

In this case I could at least guide you a little bit.