eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.67k stars 390 forks source link

improve process is alive detection #1361

Open elfenpiff opened 2 years ago

elfenpiff commented 2 years ago

Brief feature description

When on high CPU load it is possible that the heartbeat thread does not send its heartbeats in a given time-frame. This can cause roudi to cleanup all resources of the application which missed the heartbeat which may lead to use of resources which are deleted.

The solution should be as efficient as possible and may avoid context switches or sending messages (if possible). One approach could be to use getpgid, which returns the group id of a given pid. If the pid does not exist it will fail. If we could couple this with the process runtime or creation time we can identify a process and check if it is still alive.

Relates

1380

mossmaurice commented 2 years ago

@elfenpiff This is related to both #611 and #620. We should follow RAII for the resources of the app. I suppose a hierarchical structure as sketched in the .puml would allow easier handling of the resources in shared memory.

elfenpiff commented 2 years ago

@mossmaurice loosely related. But the problem in here is not the handling of shared memory resource.

RouDi falsely assumes that an application has died since the high cpu load prevented that application to send the heartbeat in the required time frame.

elBoberido commented 2 years ago

@elfenpiff I think there is the possibility to use a pipe or stream socket. AFAIK when the writing end of a pipe/stream socket gets closed, the process with the receiving end would get a POLLHUP via poll

qclzdh commented 2 years ago

Some info shared from my side about monitor mode: When CPU load is high, There is a high possibility that "keepalivemsg" can't be sent to roudi within PROCESS_KEEP_ALIVE_TIMEOUT, we use "posix::FileLock::create(runtimeName);" to check that process is really died or not.