_fmutex operation failed: LIBC Heap request in Chromium

dmik commented 3 years ago

Sometimes with multi-process mode enabled in Chromium (https://github.com/bitwiseworks/qtwebengine-chromium-os2/issues/12), I get this from one of the processes:

_fmutex operation failed:  LIBC Heap request

Killed by SIGABRT
pid=0x6765 ppid=0x6763 tid=0x0001 slot=0x00c4 pri=0x0200 mc=0x0001 ps=0x0010
D:\CODING\QT5\QT5-DEV-BUILD\QTBASE\LIBEXEC\QTWEBENGINEPROCESS.EXE
Creating 6765_01.TRP

The stack trace is as this:

 Trap  ->  1D2F8D97   LIBCN0    0001:000F8D97  b_panic.c#506 ___libc_Back_panicV + 60B 0001:000F878C (b_panic.obj)
 0082CD78  1D2FA193   LIBCN0    0001:000FA193  b_panic.c#150 ___libc_Back_panic + 26 0001:000FA16D (b_panic.obj)
 0082CD98  1D24770C   LIBCN0    0001:0004770C  signals.c#2028 ___libc_back_ghevWait + 1081C 0003:00036EF0 (signals.obj)
 0082CDB8  1D249DE3   LIBCN0    0001:00049DE3  signals.c#1812 ___libc_back_signalAccept + 682 0001:00049761 (signals.obj)
 0082CEC8  1D24A0FA   LIBCN0    0001:0004A0FA  signals.c#966 ___libc_Back_signalRaise + 129 0001:00049FD1 (signals.obj)
 0082CF68  1D24B1D6   LIBCN0    0001:0004B1D6  b_signalSendPid.c#82 ___libc_Back_signalSendPid + 46 0001:0004B190 (b_signalSendPid.obj)
 0082CF88  1D277512   LIBCN0    0001:00077512  kill.c#76 __std_kill + 22 0001:000774F0 (kill.obj)
 0082CFA8  1D280C2A   LIBCN0    0001:00080C2A  raise.c#56 __std_raise + 1A 0001:00080C10 (raise.obj)
 0082CFC8  1D26D39C   LIBCN0    0001:0006D39C  abort.c#62 __std_abort + C4 0001:0006D2D8 (abort.obj)
 0082D008  1D21758D   LIBCN0    0001:0001758D  fmutex2.c#75 __fmutex_abort + C9 0001:000174C4 (D:\Coding\libc\master\src\emx\src\lib\process\fmutex2.c)
 0082D048  1D20D62D   LIBCN0    0001:0000D62D  fmutex.h#147 __umalloc + 119 0001:0000D514 (D:\Coding\libc\master\src\emx\src\lib\malloc\umalloc.c)
 0082D078  1D218547   LIBCN0    0001:00018547  malloc.c#25 __std_malloc + 27 0001:00018520 (D:\Coding\libc\master\src\emx\src\lib\malloc\malloc.c)
 0082D098  1D1002DC   STDCPP6   0001:000102DC  new_op.cc#50 __Znwj + 1C 0001:000102C0 (D:\Users\dmik\rpmbuild\BUILD\gcc-os2-gcc-9_2_0-release-os2-b3\libstdc++-v3\libsupc++\new_op.cc)
 0082D0B8  B8446A85   QT5WEBC   0001:019E6A85   __ZNSt11_Deque_baseIN2v88internal15TranslatedState14ObjectPositionESaIS3_EE17_M_initialize_mapEj$w$9ZYDaCnZRN7Ug3tT0 + 35 0001:019E6A50 (v8_base_jumbo_25.o)
 0082D128  B847EE26   QT5WEBC   0001:01A1EE26   __ZNK2v88internal14OptimizedFrame9SummarizeEPSt6vectorINS0_12FrameSummaryESaIS3_EE + A6 0001:01A1ED80 (v8_base_jumbo_27.o)
 0082D238  B808A8B8   QT5WEBC   0001:0162A8B8   __ZN2v88internal10FindCallerEPNS0_7IsolateENS0_6HandleINS0_10JSFunctionEEE + 98 0001:0162A820 (v8_base_jumbo_1.o)
 200E9E3C  26A45FED   *Unknown*
 28E493FD  5528E31D   Invalid address: 5528E31D

The "LIBC Heap" mutex is used to serialize access to the LIBC heap. From what I can tell, _um_heap_lock (h) is called in _umalloc which in turn calls _fmutex_checked_request (&h->fsem, _FMR_IGNINT) which calls _fmutex_request with the same arguments which returns non-zero (an error) which causes _fmutex_abort to be called.

I don't have a 100% proof (since that I don't have full LIBC logging in this run) but it really smells like DosWaitEventSem returning ERROR_INTERRUPT while performing a LIBCx handle transfer (see bitwiseworks/libcx#87).

dmik commented 3 years ago

DosWaitEventSem is called in __fmutex_request_internal and it supports ignoring ERROR_INTERRUPT via a _FMR_IGNINT flag (which is used in case of _um_heap_lock) by retrying the wait cycle. When the flag is present, ERROR_INTERRUPT is replaced with ERROR_TIMEOUT which in turn is the only condition to retry the loop (if the wait succeeds or fails with any other error, the cycle is aborted and the result is returned to the caller). ERROR_TIMEOUT is there because I guess that the wait cycle was originally designed to spin in 3 second wait intervals in order to implement owner death (and a subsequent deadlock) detection.

The owner check begins with checking if the mutex is owned (if its owner field is not 0) in the end of the wait loop. The code obviously assumes that if there is no owner but DosWaitEventSem returns ERROR_TIMEOUT, something went really wrong and bails out with a non-zero error code (ERROR_TIMEOUT) indicating that ownership was not transferred. In our case it's most likely ERROR_INTERRUPT which causes this behavior but nevertheless a failure is returned and _fmutex_abort gets triggered.

The actual problem here is that the assumption the code makes about the owner is generally not right. If the owner is zero after ERROR_TIMEOUT (and ERROR_INTERRUPT), it doesn't necessarily means that something went wrong. It could be that the mutex owner simply released the mutex (and reset the owner field to 0) the next moment (time slice) after DosWaitEventSem returned those errors. I.e. the mutex request code should simply gain ownership (set itself as the mutex owner) and return success.

The reason why it could not be seen earlier is because it's just really unlikely that the above scenario will happen exactly after 3 seconds of unsuccessful waiting (where ERROR_TIMEOUT is returned). However, breaking the wait with ERROR_INTERRUPT may happen much more frequently than every 3 seconds in Chromium (every few milliseconds at some stages!) which makes this scenario much more likely.

I will try to fix the code and see if the error is gone.

dmik commented 3 years ago

Note that using _fmutex in pthread (see https://github.com/bitwiseworks/pthread-os2/issues/9) made this problem even more frequent. And the solution I'm going to commit makes it go away according to my testing.

bitwiseworks / libc

_fmutex operation failed: LIBC Heap request in Chromium #97