Make mojo IPC work - Githubissues

dmik commented 4 years ago

As shown in https://github.com/bitwiseworks/qtwebengine-os2/issues/4#issuecomment-661918653, IPC communication is broken now. Perhaps due to invalid socket usage (wrt OS/2 limitations). One of the evidences is this message from libevent:

[30360:10170368:0721/184952.148000:ERROR:broker_posix.cc(40)] Recvmsg error: Socket operation on non-socket (38)

This ticket is to track it down.

dmik commented 4 years ago

I think I tracked it down. This code is really complex and has many layers. In short, Mojo uses a socket pair on Posix (and a named pipe on Windows) to bootstrap the IPC communication. OS/2 goes the Posix way but the code that implements socket fd inheritance is missing in the OS/2 part. I will add it now.

While looking around, I also found that Chromium supports "in-process" mode where the renderer process is run within the browser process (i.e. w/o creating a helper child process). They claim that this mode is used for debugging but we might choose to use it as the main mode for the time being (to rule out potential IPC problems and missing bits). I will try it after doing the above, depending on the result. Hints for finding the respective code parts are:

A call to IsInBrowserProcess in ChildThreadImpl::Init.
A RenderThreadImpl::RenderThreadImp constructor taking InProcessChildThreadParams.

dmik commented 4 years ago

I made IPC bootsprapping work but faced another problem. Chromium wants to send the remote end of a newly created socket pair to the child process via sendmsg(SCM_RIGHTS) over the bootstrap pipe (which works now). This is known not to be supported by the OS/2 TCP/IP stack. In fact, there is no way to inject a file handle from a parent to its already existing child process. I faced this problem already when dealing with IPC in Firefox (it partly uses Chromium code there).

Windows has a similar limitation and uses native named pipes to overcome it (the child process simply connects to its end of the pipe by name which is transferred to it over the bootstrap pipe). We could use native named pipes too but OS/2 already uses the Posix code path in this place and mixing it with the native code path is not the best solution. In fact, the Posix code path provides named pipes as well -- via named sockets. And they are even used on some platforms for this sort of things. It should be easier to use them on OS/2 too instead of introducing native named pipes.

dmik commented 4 years ago

In the mean time, I've managed to run disable IPC at all. It's as simple as passing --single-process to the engine, e.g. via set QTWEBENGINE_CHROMIUM_FLAGS=--single-process. I get a crash somewhere in the Blink subsystem in this case but that's not related to this ticket. I will investigate it elsewhere.

dmik commented 3 years ago

Windows has a similar limitation and uses native named pipes to overcome it

This is not entirely true. In the BrokerHost code (that does sendmsg(SCM_RIGHTS) on Posix to send the socket as https://github.com/bitwiseworks/qtwebengine-chromium-os2/issues/12#issuecomment-663739470 says) Windows uses ::DuplicateHandle for the same purpose, not named pipes. DuplicateHandle is capable of transferring HANDLE access over process boundaries (i.e. make a particular handle accessible in another process) — something OS/2 completely lacks.

I will check how to make use of named sockets here instead.

dmik commented 3 years ago

Status update: named sockets are not actually needed here, we just don't need this BrokerHost and Broker pair at all; it looks like all we need can be done w/o the broker (it is used by Windows and Linux to obtain shared buffers from the broker process via Broker::GetWritableSharedMemoryRegion in some synchronized way but on OS/2 this doesn't seem to be the case — and macOS doesn't use it either).

Regarding passing file handles I have one solution. It will not work for all handles but will work for at least sockets and SHMEM handles from LIBCx. The thing is that both resources are global. I.e. the real OS/2 TCP/IP socket descriptor (returned by _getsockhandle from a LIBC descriptor) may be used in any process. The only thing that needs to be done there — besides passing the socket descriptor value — is calling addsockettolist in the target process to pass it ownership of the socket and removesocketfromlist in the source process to revoke its ownership. As for SHMEM, it's done the other way: the source process gives the right to use the handle to the target process so once it gets the handle it can use it.

As for regular files, we may hack that around using the actual file name (accessible from most regular file descriptors via __LIBC_FH::pszNativePath returned by __libc_FH). I.e. pass the file name to the target process where it will reopen it and get a valid file descriptor for itself.

That said, we don't need SCM_RIGHTS support. We may simply pass this additional information (socket descriptors, SHMEM handles, file names) to the other process in the socket message itself — by reserving space in the Mojo message header (similarly to macOS that does that to pass mach ports which are non-transferrable via SCM_RIGHTS either).

So far I only saw the actual attempt to pass SHMEM between processes so this is what I will try to implement first.

dmik commented 3 years ago

Latest news. I made the prototype work using the above idea when passing SHMEM around. However, there is now a problem with leaking SHMEM handles. On POSIX, the Chromium code expects that the OS takes ownership of handles when they are in transit between the source process and the target process "inside" sendmsg. On Windows, this is not needed because there is a special DuplicateHandle call (DUPLICATE_CLOSE_SOURCE) that allows to immediately (and atomically) transfer the handle to the target and close it in the source process.

On OS/2, we don't have anything like that. So if I let the in-transit handle be closed in the source process, the target process may receive the IPC message carrying the handle after this happens and won't be able to open the handle because it will have gone already. If I let this in-transit handle live, it will all work but the handle in the source process will remain forever open effectively resulting in a handle (and underlying memory) leak. The key problem here is that these messages are intended to be sent asynchronously in Chromium and there is no way for the target process to confirm handle reception. In theory, a special message might be sent out for that but this is not so simple and will require the OS/2 code to significantly diverge from the rest of the platforms which I would really like to avoid.

I could enhance SHMEM api by shmem_transfer that would handle this situation, but there are also sockets for which this won't work. And also if it turns out that we need to pass other entities (like atoms — for passing regular file descriptors) we will also face exactly the same problem as with SHMEM leaks... I need to look for a better solution.

dmik commented 3 years ago

With the LIBCx ticket done, Chromium appears to work in IPC mode (including audio output!). There are some glitches mostly at termination which are not seen in single process mode, like this:

[912:12:0205/215803.239000:FATAL:thread_restrictions.cc(116)] Check failed: !g_base_sync_primitives_disallowed.Get().Get(). Waiting on a //base sync primitive is not allowed on this thread to prevent jank and deadlock. If waiting on a //base sync primitive is unavoidable, do it within the scope of a ScopedAllowBaseSyncPrimitives. If in a test, use ScopedAllowBaseSyncPrimitivesForTesting.
[912:22:0205/215803.259000:WARNING:audio_sync_reader.cc(193)] ASR: No room in socket buffer.: Broken pipe (32)

Also sometimes I get ERROR_OUT_OF_MEMORY from DosCreateMutexSem in pthread_mutex_init for a reason not yet known. Memory footprint is as follows when it happens:

*** 0, provided 3145344 used 2506728 maxfree 40660
*** max pr 175505408 max sh 141557760
*** max hi pr 2297274368 max hi sh 1369313280

First line is LIBC heap stats. To me it looks like there is a lot of free memory so it shouldn't be just memory per se (something I saw earlier before marking the Chromium DLL for loading high — private low memory was fully exhausted in that case). Here we have 175M of low private, 141M of low shared, 2.3G of high private and 1.4G of high shared — pretty much I would say.

I have a suspicion that Chromium creates mutexes like hell so it might be that OS/2 bumps some limit there. I will check that.

dmik commented 3 years ago

When I have a single tab open, everything works in multi-process mode, but opening a second tab (which starts up a new renderer process) gives this in the release build:

[2760:1:0209/021717.216000:ERROR:platform_shared_memory_region_os2.cc(142)] shmem_get_info(0) failed: Invalid argument (22)
[2760:1:0209/021717.216000:FATAL:platform_shared_memory_region_os2.cc(27)] Check failed: CheckPlatformHandlePermissionsCorrespondToMode(handle.get(), mode, size).

Somehow the handle is not properly passed to the second process, I need to debug it.

dmik commented 3 years ago

The above commit fixes the issue. Now I can open many tabs with different web pages and each of them will have its own backing rendering process (QTWEBENGINEPROCESS.EXE) and all appears to generally work. What I don't like here is that for each of them there is also a LIBCX-SPAWN2.WRP executable which is needed to implement thread safety of extended spawn2 functions. This really looks like a waste so I will work around that (in a separate ticket).

What I also see sometimes is this:

LIBC PANIC!!
fmutex deadlock: Recursive mutex!
0x2003013c: Owner=0x2da80001 Self=0x2da80001 fs=0x3 flags=0x0 hev=0x00010004
            Desc="LIBC Heap"
pid=0x2da8 ppid=0x2da7 tid=0x0001 slot=0x00c8 pri=0x0200 mc=0x0000 ps=0x0010
D:\CODING\QT5\QT5-DEV-BUILD\QTBASE\LIBEXEC\QTWEBENGINEPROCESS.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

It means that the libc heap mutex is requested while being already held. I suspect that it might be because of the interrupt service that in theory may kick into the process at any time — even inside a malloc/new call. SO if there is another malloc call from that state, such a problem may show up. This should not happen as I made protection against that (against unexpected reentrancy) but I need to check this.

dmik commented 3 years ago

Got a strange assertion failure while testing #31:

[54998:1:0217/224348.193000:FATAL:page_allocator_internals_os2.h(156)] Check failed: static_cast<ULONG>(0) == arc (0 vs. 95)

It's DosFreeMemEx returning ERROR_INTERRUPT. Clearly, something to do with LIBCx interrupt service (bitwiseworks/libcx#87) again. Need to look at DosFreeMemEx. It should ignore interrupts there.

dmik commented 3 years ago

The above commit fixes the above FATAL assertion, was a LIBC issue related to LIBCx interrupts indeed.

dmik commented 3 years ago

With all the latest fixes, this ticket may be boldly considered done. There is a new test drop with all the new stuff, feel free to test (and report issues in separate tickets): http://rpm.netlabs.org/test/qtwebtest5_5.7z

bitwiseworks / qtwebengine-chromium-os2

Make mojo IPC work #12