cms-gem-daq-project / xhal

XHAL interface library
0 stars 10 forks source link

memhub can "deadlock" when killed #130

Open lmoureaux opened 5 years ago

lmoureaux commented 5 years ago

Brief summary of issue

memhub uses semaphores to prevent concurrent access to the CTP7 memory. It tries hard to avoid leaving active semaphores behind, but all these efforts are moot if the process gets killed. This is a caveat of semaphores themselves.

Types of issue

Expected Behavior

A dying process should release all resources it holds.

Current Behavior

When a process gets killed (SIGKILL) in the middle of a memhub call, it leaves behind an active semaphore. The next process trying to use memhub gets stuck.

Steps to Reproduce (for bugs)

  1. Introduce a delay in memhub_read right after the call to memsvc_read
  2. Call memsvc_read (eg through an RPC call)
  3. While the process is waiting, kill -9 it
  4. Call memsvc_read again

Possible Solution

Since this is caused by a caveat in the semaphores API, one has to find another way to synchronize multiple processes. There are basically two ways apart from semaphores:

Context

Want to avoid a CTP7 getting stuck and requiring manual intervention.

Your Environment

lpetre-ulb commented 5 years ago
* Use a `pthread` mutex in a shared memory region. This would be hard to get right for our use case.

Except the race condition at creation, robust POSIX mutexes in shared memory should do the job, no?

* Create a temporary file (eg `/tmp/memhub.lock`) and use the [advisory locking API](https://linux.die.net/man/2/fcntl). Advisory file locks are released when a process is killed. I have a working prototype.

Is there a possibility to implement a timeout for the lock acquisition? I don't like calls that can block forever... I think implementing a timeout for fnctl possible with signals, but, as always with signals, it not the safest way.

A dying process should release all resources it holds.

While I usually agree with that principle is it the best thing to do in our case? I mean, if a process terminates unexpectedly what guarantee can we have about the hardware state? Isn't it better to forbid any access to the system?

Of course, there are many ways to deal with the issue at higher levels in the software stack, but in these cases, there must be a good cooperation between the different processes accessing the hardware.

lmoureaux commented 5 years ago

Except the race condition at creation, robust POSIX mutexes in shared memory should do the job, no?

Setting them up seems a bit harder.

Is there a possibility to implement a timeout for the lock acquisition?

Active polling with F_SETLK seems to be the only signal-free way.

jsturdy commented 5 years ago

Has this been shown to be an issue anywhere yet? Or is this as yet an academic exercise?

Is there a possibility to implement a timeout for the lock acquisition? I don't like calls that can block forever...

Indeed, the initial incantation we implemented to try to solve the bus collisions involved a lock file and a timeout. @evka85 then implemented the version with semaphores

evka85 commented 5 years ago

Yeah I think the lock file approach had some issues, but I can't remember what it was. Indeed if you kill it with -9 (SIGKILL), it will leave any active semaphores hanging, though normally we should just use SIGTERM, which should exit cleanly (this was tested quite extensively)

lmoureaux commented 5 years ago

Has this been shown to be an issue anywhere yet? Or is this as yet an academic exercise?

I'm not aware of this issue happening in production, but it's a race condition involving a rare signal (SIGKILL) so it's rather unlikely.

normally we should just use SIGTERM

To be specific, I'm afraid of the kernel sending SIGKILL in out-of-memory conditions.

evka85 commented 5 years ago

Ok I see, yes the kill from kernel of course is possible, but that is just an indication of something else going terribly wrong, and I think we should instead the root cause, which is running out of memory.

I know that the rw_reg library currently isn't really using memory efficiently, and running multiple instances of it may result in out-of-memory condition. We have discussed possible ways to change the rw_reg to reduce the memory footprint: mainly the memory is wasted by creating duplicate nodes with shifted addresses coming from the generated nodes in the xml file. There are a couple of ways to go around that -- one way would be to use shared memory that contains the register addresses among all processes, this is already done e.g. in the RPC service by using LMDB, and so python tools could also use that; another way would be just to optimize the way that generated nodes are stored in memory by storing just the root node and the generate parameters, and uppon request for any specific generated node from the user, the address would be calculated dynamically.

We should probably open a new issue on that..

lmoureaux commented 5 years ago

Ok I see, yes the kill from kernel of course is possible, but that is just an indication of something else going terribly wrong, and I think we should instead the root cause, which is running out of memory.

I fail to understand how out-of-memory conditions are different from SIGSEGV (segmentation violation), SIGILL (illegal instruction) or SIGABRT (abort), which all leave the semaphore in a consistent state. What's the rationale for treating these errors differently?

evka85 commented 5 years ago

Sorry, I didn't quite understand the question.. We are handling all the signals that we can, but in case the kernel out-of-memory killer kicks in, it just sends SIGKILL, which we can't catch. If this is really a problem in normal operation, we could try to work around it by monitoring the memory situation, and sending out some other signal before the out-of-memory killer kicks in. There may be other ways of registering for some kind of notifications from kernel, I don't know..

evka85 commented 5 years ago

I think I did the signal handling mostly to clean up when the user presses ctrl+c which results in SIGINT, but I added whatever other signals I could just in case.