eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.57k stars 373 forks source link

mutex owner died -> POPO__CHUNK_LOCKING_ERROR #2193

Open niclar opened 4 months ago

niclar commented 4 months ago

Hi, we just encountered a fatal crash in roudi (first time on b18411364d1cac72422c53a4b254fad0c9061204 since it came out), roudi logs below.

Maybe an iceoryx "client" process died prior, but no iceoryx logging in them or the syslog indicates this. And this should not crash roudi of course.. -Do you maybe have any pointers as to why we encountered this and how it should be remedied ?

(We're running 21 iceoryx client processes on the box, pinned and isolated, ~1500 publishers, and roudi on highest prio)

Thanks

Required information

Operating system: Ubuntu 22.04.2 LTS

Compiler version: Clang 17.0.5

Eclipse iceoryx version: b18411364d1cac72422c53a4b254fad0c9061204 (v2.0.x, 20231121)

Observed result or behaviour:

^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-f53d20e6d8.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock }  :::  [ 130 ]  Owner died
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent()
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application.
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
libc++abi: terminating

Edit by elBoberido:

Related issues

elBoberido commented 4 months ago

@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.

This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.

niclar commented 4 months ago

Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?

elBoberido commented 4 months ago

@niclar this is related to #1361. I already improved the situation on master

niclar commented 4 months ago

-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)

elBoberido commented 4 months ago

Oh, I just saw v2.0.x and thought it came from the release branch.

Is the system time changed abruptly so that jumps can occur?

Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.

A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.

niclar commented 4 months ago

These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..

elBoberido commented 4 months ago

It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat method to the runtime. One could then call it manually.

niclar commented 4 months ago

This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.

elBoberido commented 4 months ago

Do you have more information, e.g. was the load during that time quite high?

niclar commented 3 months ago

cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.

elBoberido commented 3 months ago

Do you have some logs of spikes in the CPU utilization when the issue occurred?

niclar commented 3 months ago

...we've added core utilization logging now for next time it happens...

elfenpiff commented 3 months ago

@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.

@niclar Do you require the history when a subscriber connects?

Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)

niclar commented 3 months ago

@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0 and historyRequest=0 respectively.

elBoberido commented 3 months ago

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

elfenpiff commented 3 months ago

@elBoberido

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.

elBoberido commented 3 months ago

Well, the monitoring was explicitly turned on as far as I know :sweat_smile:

elBoberido commented 2 months ago

@niclar did you encounter this problem again?

In the meantime I'm a bit hesitant on exposing the heartbeat triggering as public API. An alternative would be to increase the timeout for the detection of died processes. Currently it is about 1.5 seconds and the heartbeat should be sent every 300ms. If this happens again, can you check the RouDi log output for messages like Application foo not responding (last response #### milliseconds ago) --> removing it and report the amount of milliseconds?

niclar commented 2 months ago

As a matter of fact we did friday evening;

^[[0;90m2024-04-12 18:35:37.238 ^[[0;1;93m[Warn ]^[[m: Application marketview.prd not responding (last response 1511 milliseconds ago) --> removing it ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-bf50be8845.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR libc++abi: terminating

elBoberido commented 2 months ago

Okay, the deadline was missed only by a little bit. For some reasons the runtime thread which sends the heartbeat was not triggered for a long time. Is this on Windows or on Linux?

As a quick workaround I would suggest to try to change the 5 in this line in iceoryx_posh_types.hpp to 10

constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL; // > PROCESS_KEEP_ALIVE_INTERVAL

It does not fix the problem but should make it much more unrealistic. You could increase the time even more. With e.g. 200 it would be one minute until RouDi would assume a process is dead. The 1.5 seconds came from a time where RouDi did not clean up the resources when a crashed application tried to re-register and therefore we kept it low in order to not have to wait to long. But now it could be indeed set to 1 minute which would not be too bad to clean the remainders of a crashed process up but should virtually prevent killing running applications.

The more I think about it, the more I come to the conclusion to just increase this timeout to eliminate this source for leaving a corrupt mutex behind.

niclar commented 2 months ago

This is on linux, dedicated cores. -I applied the workaround and will get back if we encounter it again.

Thanks!

niclar commented 2 months ago

Still encountering the issue

2024-04-22 11:32:58.571 [Warn ]: Application X not responding (last response 3046 milliseconds ago) --> removing it 2024-04-22 11:32:58.579 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-2ee4ec5b88.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died 2024-04-22 11:32:58.579 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() 2024-04-22 11:32:58.579 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. 2024-04-22 11:32:58.579 [Error]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR

elBoberido commented 2 months ago

This is weird, especially in combination with the high socket usage from #2268.

Just to exclude other possibilities. You are using the plain upstream version from the commit mentioned in the description and not a custom branch with that commit cherry-picked?

niclar commented 2 months ago

It might very well be related to the IPC queues yes.

-Yes, the only difference being the attached "patches"; increased_process_keep_alive_intervall.patch service_description.patch

Build options; -DINTROSPECTION=ON -DIOX_MAX_PUBLISHERS=4096 -DIOX_MAX_SUBSCRIBERS=4096 -DIOX_MAX_SUBSCRIBERS_PER_PUBLISHER=64 -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16 -DIOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY=256 -DIOX_MAX_CLIENTS_PER_SERVER=512 -DIOX_MAX_NUMBER_OF_NOTIFIERS=512

Roudi conf;

iox::config::CmdLineArgs_t cmdLineArgs; cmdLineArgs.run = conf.net.enabled; cmdLineArgs.logLevel = iox::log::LogLevel::INFO; cmdLineArgs.monitoringMode = iox::roudi::MonitoringMode::ON; cmdLineArgs.compatibilityCheckLevel = iox::version::CompatibilityCheckLevel::PATCH; //cmdLineArgs.uniqueRouDiId cmdLineArgs.processTerminationDelay = iox::roudi::PROCESS_DEFAULT_TERMINATION_DELAY; cmdLineArgs.processKillDelay = iox::roudi::PROCESS_DEFAULT_KILL_DELAY; //cmdLineArgs.configFilePath;

niclar commented 2 months ago

We'll update to head in a day or two and see how that plays out...

elBoberido commented 2 months ago

Those patches should not have any negative effect on the heartbeat. Do you run the iceoryx tests after the build? Since you changed some of the compile time constants it is strongly recommended to run the tests. I think we need to put some emphasis on this in the documentation.

Oh, and could you increase the timeout to to 60_s. If it still fails with such a long timeout, then there is something extremely fishy.

niclar commented 2 months ago

It seems like HEAD might have solved the ipc queue and this cleanup issue , let us run it a while longer, to be sure

elBoberido commented 2 months ago

Yeah, I somehow have the feeling that you still had a version which sent the heartbeat via the unix domain socket.

niclar commented 2 months ago

I managed to get this today again on f00cfe05c0048301bb72ade454771891696ee24c (same settings as above/prior, but now on clang 18.1.4);

^[[0;90m2024-04-23 11:44:43.001 ^[[0;1;93m[Warn ]^[[m: Application X not responding (last response 1550 milliseconds ago) --> removing it ^[[0;90m2024-04-23 11:44:43.006 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_hoofs/posix/sync/source/mutex.cpp:325 { expected<void, MutexLockError> iox::mutex::lock() -> iox_pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [Fatal Error] [POPO__CHUNK_LOCKING_ERROR (code = 61)] in module [iceoryx_posh (id = 2)] ^[[0;90m2024-04-23 11:44:43.007 ^[[0;1;97;41m[Fatal]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-7813eef5f0.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [PANIC]

niclar commented 2 months ago

I'll try setting the timeout to a minute

niclar commented 2 months ago

we still encounter it with the 1min delay (I let the client process saturate its core doing mostly iceoryx "server" req/reps then kill -9);

2024-04-23 16:10:35.281 [Warn ]: Application X not responding (last response 1501 milliseconds ago) --> removing it 2024-04-23 16:10:35.286 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_hoofs/posix/sync/source/mutex.cpp:325 { expected<void, MutexLockError> iox::mutex::lock() -> iox_pthread_mutex_lock } ::: [ 130 ] Owner died 2024-04-23 16:10:35.286 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() 2024-04-23 16:10:35.286 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. 2024-04-23 16:10:35.286 [Fatal]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [Fatal Error] [POPO__CHUNK_LOCKING_ERROR (code = 61)] in module [iceoryx_posh (id = 2)] 2024-04-23 16:10:35.286 [Fatal]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/91696ee24c-11145aa309.clean/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:42 [PANIC]

elBoberido commented 2 months ago

The log does not fit the 1minute timeout. It still terminates the app after 1.5 seconds.

niclar commented 2 months ago

1 minute timeout is set/compiled however. No idea why this happens..

constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 200 * PROCESS_KEEP_ALIVE_INTERVAL;

elBoberido commented 2 months ago

Can you set it to 60_s?

elBoberido commented 2 months ago

Oh, since you are using the main branch, can you also run RouDi with -x commitId to ensure that all applications use the same iceoryx version. If you want to be more strict you could also deactivate reproducible builds and use -x buildDate. The default is patch which allows all applications with the same patch level to connect to RouDi.

PMSRGroup commented 1 month ago

@niclar Have you managed to solve the issue? If so, how? We are getting the same exact problem, currently trying the options listed here but no luck so far. Will update to latest later tonight and let it run for a few days. Currently running: f602986bc77fa9b73aefa9a86dc0af15fa754700

rbdm-qnt commented 1 month ago

Having the same issue here aswell, on commit f6dc4e1 (latest). @niclar @elBoberido any way to make it less frequent or solve it completely?

elBoberido commented 1 month ago

@rbdm-qnt if the monitoring is turned off, then this will not happen but in the case the application crashes, the resources are only cleaned up when RouDi shuts down or when the application re-registers.

Can you try the following changes in iceoryx_posh_types.hpp

-- constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL;
++ constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 60_s;

My guess is that this issue is related to jumps in the system time like mentioned further above. I was just able to trigger the issue by changing the system time with timedatectl. I think the problem is that we rely on the timestamp from the last heartbeat of the application. If there are jumps in the system time, the heartbeat thread of the application as well as the monitoring thread of RouDi will be woken up and now there is a race between the application sending the heartbeat with the updated timestamp and RouDi checking for the liveliness of the applications. If RouDi wins and the jump was into the future, it assumes that the application did not send the heartbeat for too long and therefore is in a degraded state.

Please check if your system time jumps and if increasing the PROCESS_KEEP_ALIVE_TIMEOUT timeout reduces the frequency.

The proper solution would probably be to use a counter instead of a timestamp and then we could have a heuristic like when the counter was not incremented for 5 consecutive times RouDi checks for the liveliness, the application can be considered to be in a degraded state.

elBoberido commented 1 month ago

@rbdm-qnt @PMSRGroup @niclar did you have a chance to check whether my theory with the jumps in system time is correct and the workaround works?

passchaos commented 3 weeks ago

@elBoberido Can the timestamp used by the heartbeat be the system boot time, so that it is not affected by the system time

elBoberido commented 3 weeks ago

@passchaos are you also experiencing this issue?

I just had a look at the code and we are already using CLOCK_MONOTONIC, which is related to the boot time on linux. Unfortunately this is also affected by adjustments performed by adjtime and NTP. There is CLOCK_MONOTONIC_RAW which should not be affected but I can also trigger the error with that clock source by performing these steps.

sudo timedatectl set-ntp false
sudo timedatectl set-time 18:30:00
sudo timedatectl set-true

At least in one of three runs, RouDi assumes the application did not send the heartbeat.

The workaround would be to increase the PROCESS_KEEP_ALIVE_TIMEOUT to a value larger than the jumps in the clock.

A proper fix would be to rework the heartbeat handling and I also have an idea how to do it but I'm currently quite busy with other topics.

passchaos commented 3 weeks ago

@elBoberido I have found the cause of the problem, and it is not roudi's problem, but the problem of the sender sending the keepalive signal

The code https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_hoofs/time/source/duration.cpp#L55 here use the CLOCK_REALTIME. On the other hand, older version, such as v2.0.6, also use CLOCK_REALTIME in the timer sources, see: https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L243 and https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L436

elBoberido commented 3 weeks ago

@passchaos the code in duration.cpp relates only to conversions from Duration to timespec and is not related to the heartbeat.

The heartbeat uses CLOCK_MONOTONIC https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_posh/source/runtime/heartbeat.cpp#L50-L60

passchaos commented 3 weeks ago

@elBoberido Well, I know this is the logic of the heartbeat, and each process involved in the communication sends a Keepalive signal to roudi, but because each process uses a timer to send a Keepalive signal, the timer uses a time source that is not monotonic (see my previous link). Therefore, if the system time is set back to the previous time, the timer considers that the time is not up and does not send a Keepalive signal to roudi. When I change to monotonic, the loss of heartbeat due to time adjustment no longer occurs.

elBoberido commented 3 weeks ago

@passchaos ah, right. The thread is waits on a semaphore and that causes the issue. The problem is that sem_timedwait requires CLOCK_REALTIME, see also https://www.man7.org/linux/man-pages/man3/sem_timedwait.3p.html.

Have you checked whether one of your cores runs at 100% after the change? I assume that the heartbeat thread does not sleep anymore and returns immediately from the sem_timedwait call.

passchaos commented 3 weeks ago

@elBoberido Yeah, I have also noticed this problem, it is a sad thing

hrudhansh commented 3 weeks ago

This issue is showing for me 7/10 times upon stopping publisher process.

How to recreate?

  1. Run a pub-sub with no delay in between publishes.
  2. Register a SIGINT signal handler in your main like signal(SIGINT, SignalHandler);
  3. Upon ctrl+c on the pub process, you would see it stall.
  4. Upon ctrl+c again you would see the above error in iox-roudi.

Hopefully this helps you recreate/debug this on your end.

On my end, I ran gdb on the pub process with -exec handle SIGINT nostop & -exec handle SIGINT pass, a breakpoint on the exit(sigint); and called pkill -SIGINT publisher in a separate terminal. I noticed:

How do I stop/interrupt the publisher thread gracefully? It seems like just letting the de-constructor call is not enough. How do I stop the KeepAlive thread gracefully?

Not advocating a quick fix, but I wonder if we can put an atomic bool somewhere in the publish thread that we can call to abruptly stop what it's doing, and tear down.

Note: This test was done on commit: b2cd72bdc789bcf7601cb112c6078c47d533d798 which is not latest. But the same issue persists on latest.

elBoberido commented 2 weeks ago

@hrudhansh I think the root of your problem is a different one than from this issue. It just manifests in the same error code. Can you please create a separate issue for your problem?

elBoberido commented 2 weeks ago

@niclar @PMSRGroup @rbdm-qnt @passchaos shall we schedule a community meeting to discuss the problem, solution approaches and how to distribute the work?

gpalmer-latai commented 2 weeks ago

Maybe I'm interested as well. I haven't enabled monitoring but I encounter probably the same error case when shutting down RouDi - likely because some process has crashed at some point while holding the mutex.