global_lock() assertion can fail with ERROR_INTERRUPT

StevenLevine commented 7 years ago

While testing, I got this

git clone git://github.com/php/php-src.git . Cloning into '.'... remote: Counting objects: 702417, done. remote: Compressing objects: 100% (22/22), done. remote: Total 702417 (delta 7), reused 4 (delta 0), pack-reused 702395 Receiving objects: 100% (702417/702417), 276.54 MiB | 148.00 KiB/s, done. Assertion info: 95 6% (33230/544139) Assertion failed: arc == NO_ERROR, file D:/Users/dmik/rpmbuild/BUILD/libcx-0.5.3/src/shared.c, line 470

Killed by SIGABRT pid=0x237c ppid=0x237b tid=0x0001 slot=0x003e pri=0x0200 mc=0x0001 ps=0x0010 D:\USR\LIBEXEC\GIT-CORE\GIT.EXE Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it. error: index-pack died of signal 6 fatal: index-pack failed

This might be triggered by low/fragmented memory, but often ERROR_INTERRUPT is recoverable with a retry.

dmik commented 7 years ago

Yes, as you may see from Assertion info, you get error 95 which is ERROR_INTERRUPT indeed. However, what I see here in cases of low memory condition and under some other cases where the process currently using LIBCx dies unexpectedly is ERROR_SEM_OWNER_DIED (105), not ERROR_INTERRUPT. And I wonder what is a correct way to recover from ERROR_SEM_OWNER_DIED. Perhaps release the semaphore and re-request it. Perhaps we should simply retry after both errors but do DosReleaseMutexSem in case of ERROR_SEM_OWNER_DIED first.

dmik commented 7 years ago

Well, CPREF for DosQueryMuxWaitSem mentions that you should close DosCloseMutexSem in response to ERROR_SEM_OWNER_DIED. This makes some sense.

And I now think that it makes no sense to try to recover after ERROR_SEM_OWNER_DIED in case of LIBCx. When this happens, chances are very high that the crashing process has messed up with the LIBCx shared memory area (after all, this is what this mutex guards) and there is no way to properly recover after that other than completely reset the shared memory area (as we don't know what exactly is corrupted there). And resetting the shared memory area from another running process is equivalent to crashing this process as well because it would make all resources this other process might be already using invalid. And the assertion we get now is effectively equivalent to a crash but perhaps we should replace it with just a warning message explaining the situation and a termination rather than throw an assertion (which will surely make people file bug reports due its nature).

What to do in case of ERROR_INTERRUPT is still unclear to me as I have no idea why @StevenLevine gets this, exactly.

StevenLevine commented 7 years ago

The meaning of ERROR_INTERRUPT is murky at best. I think in our case it means that the kernel had to do some other work on the otherwise blocked thread 1 and did not know how to restart the operation, so it leaves this up to the caller.

http://www.edm2.com/index.php/OS/2_signal_handling, which is a repost of http://www.howzatt.demon.co.uk/articles/07dec90.html, discuss how this can happen for signals.

I agree that there is no useful recovery for ERROR_SEM_OWNER_DIED. Whatever work the owner intended to complete while it owned the did not complete. The recommendation to close the semaphore makes sense because it allows the system resources to be recovered, but it does not fix anything.

dmik commented 7 years ago

I still wonder about when ERROR_INTERRUPT is reported. I did tons of tests within the last week and saw it only once, when I was Ctrl-Break'ing some LIBCx clients after immediate start (hard to reconstruct the exact chain of events right now). However, an attempt to reproduce this situation in a synthetic test doesn't give any result. What I tried is made one process hold the mutex and the other process request it (all happened on thread 1 in both processes). Then I killed the other process both with Ctrl-Break and with kill -9 (which involves some kernel call via xf86sup.sys). I would expect it'd return with ERROR_INTERRUPT from its DosRequestMutexSem call but it didn't. Instead, it simply directed execution to the process exit callback leading to eventual process termination. On the other hand, it makes sense as thread 1 is responsible for process termination and since process termination is unconditional, it makes no sense to return ERROR_INTERRUPT at that stage. And I can't figure out any other case where the kernel would desperately want normal (i.e. non terminating) thread execution so that it would cause ERROR_INTERRUPT to be returned.

BTW, we've got a potential PITA here. LIBCx installs a process exit callback so that it gets a chance to terminate itself gracefully on a crash or a kill request. And this exposes a problem given that we request the LIBCx mutex there. If there is another process holding this mutex and spinning up (because of a bug or memory corruption or such) we'll get an infamous unkillable zombie stuck forever in the process exit list handler. We need to resolve that.

dmik commented 7 years ago

Ok, I figured what could be a reason for ERROR_INTERRUPT after reading the article Steven suggested — if there is a signal handler which decides to continue program execution instead of end it. But given that we don't have signal handler in LIBCx and LIBC (which does have a pretty complex signal handler logic that involves emulation of Posix signals) doesn't continue execution in cases we involve, this is hardly applicable to us.

Anyway, I will leave this open to collect more info on real cases. Given that LIBCx will now always log to /var/lib/libcx whenever ERROR_INTERRUPT (or any other error) happens in global_lock() and other places, we'll see what's best to do here. Perhaps just retry in a loop like it's always done for EINTR in Posix.

StevenLevine commented 7 years ago

5436_01.TRP.txt 5437_01.TRP.txt GIT-59c554e1-5437.log.txt GIT-59c554ea-5436.log.txt

Here's a couple more test cases. For me, it's easy to recreate. Run fsck concurrently on a couple of large git repos. In this case I used Intel's acpica and the samba trunk, but the issue appears to be more a timing issue, than a size issue. The failure

24cad354 01 ff 0000 Asrt: Assertion Failed!!! 24cad354 01 ff 0000 Asrt: Function: 24cad354 01 ff 0000 Asrt: File: D:/Users/dmik/rpmbuild/BUILD/libcx-0.6.0/src/shared.c 24cad354 01 ff 0000 Asrt: Line: 511 24cad354 01 ff 0000 Asrt: Expr: arc == NO_ERROR 24cad354 01 ff 0000 Asrt: 95

follows a DosReleaseMuxSem.

The attached files are for two separate occurrences.

dmik commented 5 years ago

This is pretty much obvious now. ERROR_INTERRUPT is generated when LIBC tries to deilver POSIX signals, e.g. SIGCHLD to the parent. And it's also pretty clear that we should behave exactly like POSIX behaves in similar cases — retry the interrupted wait operation until it ends with something else but ERROR_INTERRUPT.

I have already fixed it locally a couple months ago as I had another failing case for it which greatly annoyed me: GNU make constantly failing when building Qt over ssh. The fix is in local testing for all this time and not a single ERROR_INTERRUPT in either make or samba since then. I will just commit it.

bitwiseworks / libcx

global_lock() assertion can fail with ERROR_INTERRUPT #39