Open niclar opened 9 months ago
@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.
This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it
. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.
Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?
@niclar this is related to #1361. I already improved the situation on master
-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)
Oh, I just saw v2.0.x and thought it came from the release branch.
Is the system time changed abruptly so that jumps can occur?
Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.
A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT
. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.
These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..
It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat
method to the runtime. One could then call it manually.
This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.
Do you have more information, e.g. was the load during that time quite high?
cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.
Do you have some logs of spikes in the CPU utilization when the issue occurred?
...we've added core utilization logging now for next time it happens...
@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.
@niclar Do you require the history when a subscriber connects?
Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)
@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0
and historyRequest=0
respectively.
It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.
Furthermore, the POPO__CHUNK_LOCKING_ERROR
is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.
@elBoberido
Furthermore, the
POPO__CHUNK_LOCKING_ERROR
is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.
This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.
It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.
Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.
Well, the monitoring was explicitly turned on as far as I know :sweat_smile:
@niclar did you encounter this problem again?
In the meantime I'm a bit hesitant on exposing the heartbeat triggering as public API. An alternative would be to increase the timeout for the detection of died processes. Currently it is about 1.5 seconds and the heartbeat should be sent every 300ms. If this happens again, can you check the RouDi log output for messages like Application foo not responding (last response #### milliseconds ago) --> removing it
and report the amount of milliseconds?
As a matter of fact we did friday evening;
^[[0;90m2024-04-12 18:35:37.238 ^[[0;1;93m[Warn ]^[[m: Application marketview.prd not responding (last response 1511 milliseconds ago) --> removing it ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-bf50be8845.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR libc++abi: terminating
Okay, the deadline was missed only by a little bit. For some reasons the runtime thread which sends the heartbeat was not triggered for a long time. Is this on Windows or on Linux?
As a quick workaround I would suggest to try to change the 5
in this line in iceoryx_posh_types.hpp
to 10
constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL; // > PROCESS_KEEP_ALIVE_INTERVAL
It does not fix the problem but should make it much more unrealistic. You could increase the time even more. With e.g. 200
it would be one minute until RouDi would assume a process is dead. The 1.5 seconds came from a time where RouDi did not clean up the resources when a crashed application tried to re-register and therefore we kept it low in order to not have to wait to long. But now it could be indeed set to 1 minute which would not be too bad to clean the remainders of a crashed process up but should virtually prevent killing running applications.
The more I think about it, the more I come to the conclusion to just increase this timeout to eliminate this source for leaving a corrupt mutex behind.
This is on linux, dedicated cores. -I applied the workaround and will get back if we encounter it again.
Thanks!
Still encountering the issue
2024-04-22 11:32:58.571 [Warn ]: Application X not responding (last response 3046 milliseconds ago) --> removing it 2024-04-22 11:32:58.579 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-2ee4ec5b88.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died 2024-04-22 11:32:58.579 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() 2024-04-22 11:32:58.579 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. 2024-04-22 11:32:58.579 [Error]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
This is weird, especially in combination with the high socket usage from #2268.
Just to exclude other possibilities. You are using the plain upstream version from the commit mentioned in the description and not a custom branch with that commit cherry-picked?
It might very well be related to the IPC queues yes.
-Yes, the only difference being the attached "patches"; increased_process_keep_alive_intervall.patch service_description.patch
Build options; -DINTROSPECTION=ON -DIOX_MAX_PUBLISHERS=4096 -DIOX_MAX_SUBSCRIBERS=4096 -DIOX_MAX_SUBSCRIBERS_PER_PUBLISHER=64 -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16 -DIOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY=256 -DIOX_MAX_CLIENTS_PER_SERVER=512 -DIOX_MAX_NUMBER_OF_NOTIFIERS=512
Roudi conf;
iox::config::CmdLineArgs_t cmdLineArgs; cmdLineArgs.run = conf.net.enabled; cmdLineArgs.logLevel = iox::log::LogLevel::INFO; cmdLineArgs.monitoringMode = iox::roudi::MonitoringMode::ON; cmdLineArgs.compatibilityCheckLevel = iox::version::CompatibilityCheckLevel::PATCH; //cmdLineArgs.uniqueRouDiId cmdLineArgs.processTerminationDelay = iox::roudi::PROCESS_DEFAULT_TERMINATION_DELAY; cmdLineArgs.processKillDelay = iox::roudi::PROCESS_DEFAULT_KILL_DELAY; //cmdLineArgs.configFilePath;
We'll update to head in a day or two and see how that plays out...
Those patches should not have any negative effect on the heartbeat. Do you run the iceoryx tests after the build? Since you changed some of the compile time constants it is strongly recommended to run the tests. I think we need to put some emphasis on this in the documentation.
Oh, and could you increase the timeout to to 60_s
. If it still fails with such a long timeout, then there is something extremely fishy.
It seems like HEAD might have solved the ipc queue and this cleanup issue , let us run it a while longer, to be sure
Yeah, I somehow have the feeling that you still had a version which sent the heartbeat via the unix domain socket.
I managed to get this today again on f00cfe05c0048301bb72ade454771891696ee24c (same settings as above/prior, but now on clang 18.1.4);
^[[0;90m2024-04-23 11:44:43.001 ^[[0;1;93m[Warn ]^[[m: Application X not responding (last response 1550 milliseconds ago) --> removing it ^[[0;90m2024-04-23 11:44:43.006 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_hoofs/posix/sync/source/mutex.cpp:325 { expected<void, MutexLockError> iox::mutex::lock() -> iox_pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [Fatal Error] [POPO__CHUNK_LOCKING_ERROR (code = 61)] in module [iceoryx_posh (id = 2)] ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [PANIC]
I'll try setting the timeout to a minute
we still encounter it with the 1min delay (I let the client process saturate its core doing mostly iceoryx "server" req/reps then kill -9);
2024-04-23 16:10:35.281 [Warn ]: Application X not responding (last response 1501 milliseconds ago) --> removing it 2024-04-23 16:10:35.286 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_hoofs/posix/sync/source/mutex.cpp:325 { expected<void, MutexLockError> iox::mutex::lock() -> iox_pthread_mutex_lock } ::: [ 130 ] Owner died 2024-04-23 16:10:35.286 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() 2024-04-23 16:10:35.286 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. 2024-04-23 16:10:35.286 [Fatal]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [Fatal Error] [POPO__CHUNK_LOCKING_ERROR (code = 61)] in module [iceoryx_posh (id = 2)] 2024-04-23 16:10:35.286 [Fatal]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [PANIC]
The log does not fit the 1minute timeout. It still terminates the app after 1.5 seconds.
1 minute timeout is set/compiled however. No idea why this happens..
constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 200 * PROCESS_KEEP_ALIVE_INTERVAL;
Can you set it to 60_s
?
Oh, since you are using the main branch, can you also run RouDi with -x commitId
to ensure that all applications use the same iceoryx version. If you want to be more strict you could also deactivate reproducible builds and use -x buildDate
. The default is patch
which allows all applications with the same patch level to connect to RouDi.
@niclar Have you managed to solve the issue? If so, how? We are getting the same exact problem, currently trying the options listed here but no luck so far. Will update to latest later tonight and let it run for a few days. Currently running: f602986bc77fa9b73aefa9a86dc0af15fa754700
Having the same issue here aswell, on commit f6dc4e1 (latest). @niclar @elBoberido any way to make it less frequent or solve it completely?
@rbdm-qnt if the monitoring is turned off, then this will not happen but in the case the application crashes, the resources are only cleaned up when RouDi shuts down or when the application re-registers.
Can you try the following changes in iceoryx_posh_types.hpp
-- constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL;
++ constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 60_s;
My guess is that this issue is related to jumps in the system time like mentioned further above. I was just able to trigger the issue by changing the system time with timedatectl
. I think the problem is that we rely on the timestamp from the last heartbeat of the application. If there are jumps in the system time, the heartbeat thread of the application as well as the monitoring thread of RouDi will be woken up and now there is a race between the application sending the heartbeat with the updated timestamp and RouDi checking for the liveliness of the applications. If RouDi wins and the jump was into the future, it assumes that the application did not send the heartbeat for too long and therefore is in a degraded state.
Please check if your system time jumps and if increasing the PROCESS_KEEP_ALIVE_TIMEOUT
timeout reduces the frequency.
The proper solution would probably be to use a counter instead of a timestamp and then we could have a heuristic like when the counter was not incremented for 5 consecutive times RouDi checks for the liveliness, the application can be considered to be in a degraded state.
@rbdm-qnt @PMSRGroup @niclar did you have a chance to check whether my theory with the jumps in system time is correct and the workaround works?
@elBoberido Can the timestamp used by the heartbeat be the system boot time, so that it is not affected by the system time
@passchaos are you also experiencing this issue?
I just had a look at the code and we are already using CLOCK_MONOTONIC
, which is related to the boot time on linux. Unfortunately this is also affected by adjustments performed by adjtime
and NTP
. There is CLOCK_MONOTONIC_RAW
which should not be affected but I can also trigger the error with that clock source by performing these steps.
sudo timedatectl set-ntp false
sudo timedatectl set-time 18:30:00
sudo timedatectl set-true
At least in one of three runs, RouDi assumes the application did not send the heartbeat.
The workaround would be to increase the PROCESS_KEEP_ALIVE_TIMEOUT
to a value larger than the jumps in the clock.
A proper fix would be to rework the heartbeat handling and I also have an idea how to do it but I'm currently quite busy with other topics.
@elBoberido I have found the cause of the problem, and it is not roudi's problem, but the problem of the sender sending the keepalive signal
The code https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_hoofs/time/source/duration.cpp#L55 here use the CLOCK_REALTIME. On the other hand, older version, such as v2.0.6, also use CLOCK_REALTIME in the timer sources, see: https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L243 and https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L436
@passchaos the code in duration.cpp
relates only to conversions from Duration
to timespec
and is not related to the heartbeat.
The heartbeat uses CLOCK_MONOTONIC
https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_posh/source/runtime/heartbeat.cpp#L50-L60
@elBoberido Well, I know this is the logic of the heartbeat, and each process involved in the communication sends a Keepalive signal to roudi, but because each process uses a timer to send a Keepalive signal, the timer uses a time source that is not monotonic (see my previous link). Therefore, if the system time is set back to the previous time, the timer considers that the time is not up and does not send a Keepalive signal to roudi. When I change to monotonic, the loss of heartbeat due to time adjustment no longer occurs.
@passchaos ah, right. The thread is waits on a semaphore and that causes the issue. The problem is that sem_timedwait
requires CLOCK_REALTIME
, see also https://www.man7.org/linux/man-pages/man3/sem_timedwait.3p.html.
Have you checked whether one of your cores runs at 100% after the change? I assume that the heartbeat thread does not sleep anymore and returns immediately from the sem_timedwait
call.
@elBoberido Yeah, I have also noticed this problem, it is a sad thing
This issue is showing for me 7/10 times upon stopping publisher process.
How to recreate?
signal(SIGINT, SignalHandler);
Hopefully this helps you recreate/debug this on your end.
On my end, I ran gdb on the pub process with -exec handle SIGINT nostop
& -exec handle SIGINT pass
, a breakpoint on the exit(sigint);
and called pkill -SIGINT publisher
in a separate terminal. I noticed:
How do I stop/interrupt the publisher thread gracefully? It seems like just letting the de-constructor call is not enough. How do I stop the KeepAlive thread gracefully?
Not advocating a quick fix, but I wonder if we can put an atomic bool somewhere in the publish thread that we can call to abruptly stop what it's doing, and tear down.
Note: This test was done on commit: b2cd72bdc789bcf7601cb112c6078c47d533d798 which is not latest. But the same issue persists on latest.
@hrudhansh I think the root of your problem is a different one than from this issue. It just manifests in the same error code. Can you please create a separate issue for your problem?
@niclar @PMSRGroup @rbdm-qnt @passchaos shall we schedule a community meeting to discuss the problem, solution approaches and how to distribute the work?
Maybe I'm interested as well. I haven't enabled monitoring but I encounter probably the same error case when shutting down RouDi - likely because some process has crashed at some point while holding the mutex.
Hi, we just encountered a fatal crash in roudi (first time on b18411364d1cac72422c53a4b254fad0c9061204 since it came out), roudi logs below.
Maybe an iceoryx "client" process died prior, but no iceoryx logging in them or the syslog indicates this. And this should not crash roudi of course.. -Do you maybe have any pointers as to why we encountered this and how it should be remedied ?
(We're running 21 iceoryx client processes on the box, pinned and isolated, ~1500 publishers, and roudi on highest prio)
Thanks
Required information
Operating system: Ubuntu 22.04.2 LTS
Compiler version: Clang 17.0.5
Eclipse iceoryx version: b18411364d1cac72422c53a4b254fad0c9061204 (v2.0.x, 20231121)
Observed result or behaviour:
Edit by elBoberido:
Related issues
325
2260