Closed dmik closed 3 years ago
There is a good readme in the sources about what the Network Service is, located here
And I was right, the networks service was simply disabled in the previous Chromium build, now it's enabled, see: https://github.com/bitwiseworks/qtwebengine-chromium-os2/blob/514a09dcf89f4ffdd46c17c92d1f7d0b1c4d7e1b/chromium/services/network/public/cpp/features.cc#L25
It's also worth mentioning that it's possible to run the network service both in-process and out-of-process with the latter being the default (except Android), see: https://github.com/bitwiseworks/qtwebengine-chromium-os2/blob/514a09dcf89f4ffdd46c17c92d1f7d0b1c4d7e1b/chromium/content/public/common/network_service_util.cc#L43
This explains why new Chromium starts a utility process for that now. I will try to play around with these things: to disable it completely, to enable it in-process and see how it behaves.
It appears that it's impossible to turn off the network service completely now. Before both IsOutOfProcessNetworkService
and IsInProcessNetworkService
could return false (if the NetworkService
feature was disabled which was the default), so the service was not started. Now NetworkService
is deprecated so it's always enabled and can only be switched between in-process or out-of-process with --enable-features=NetworkServiceInProcess
in QTWEBENGINE_CHROMIUM_FLAGS
.
Switching it to in-process here makes the utility process go away indeed but now the renderer process itself crashes. So there must be something more there which I can't see yet.
I'm now almost sure that it's not the render process itself that terminates but the main browser process that for some reason prematurely kills the renderer (perhaps mistakenly thinking that it failed to start). What I see in the renderer is that its threads just end in the middle and in different places depending on the timing. Need to dig from that end.
If this turns out to be difficult to track down, you might want to try
SET EXCEPTQ=ZZ
which will generate reports for termination exceptions. See exceptq-shl.txt for the gory details.
It won't work anymore as since https://github.com/bitwiseworks/libc/issues/98 EXCEPTQ exception handlers are not installed - instead, LIBC now generates EXCETPQ reports from its default panic (unhandled signal) handler. And process termination is not delivered there. And in this particular case it would not give me much (if anything at all).
I guess I got more or less what's going on. The parent process receives some notification from its end of the pipe connecting the child renderer process after which it decides that the pipe was closed and reports to some upper layer that the child process has gone. The upper layer reports this back to Qt as an unexpected death of the renderer process and we have what we have. It's still unclear to me though if the renderer process gets terminated on its own or gets killed by the parent. The problem is that there is a lot of Mojo code involved there and Mojo is Chromium's object-oriented IPC (sort of MS COM / Mozilla XPCOM) whit a lot of layers and indirections, extremely hard to debug.
Here's the stack trace:
Judging from the logs, though, the child (renderer) process is still alive when this notification happens in the parent. But in case of a debuggert session the child terminates itself after about 15 seconds (kConnectionTimeoutS
) via a call to ChildThreadImpl::EnsureConnected
reasonably deciding that something went wrong since it didn't get a successful connection from the parent. Note that in case if no debugger kicks in, the parent most likely simply kills the child itself after having got this notification.
I suspect that the problem is in some changed socket pipe handing logic but I still can't get to it due to many obstacles (and the lack of test coverage in our case).
From what I see after many debugging sessions, it's not as simple as I described above. The main process creates a Mojo pipe but it's not a pipe in TCP/IP terms. Mojo has its own concept of "nodes" (currently matching an OS process) and "ports" (roughly matching an OS file handle). Ports are used to transfer Mojo messages between nodes (i.e. different processes) as well as within the node itself (between its separate "task runners" which often mean event queues on separate threads). When the destination is the same node (process), no actual file handles are created. All communication is done via internal memory buffers associated with ports within the node using event queues (and synchronization primitives). Actual pipes (created with socketpair
) are only used when it's needed to pass a message between two different nodes. And as far as I see, IPC using Mojo creates only one socket pair per each node pair and all communication (i.e. all messages to all ports) is sent through that socket pair. I need to apply special debugging to see if this pair is actually functional (something tells me that it is and the problem is somewhere else, see below).
So, after creating a Mojo message pipe (mojo::CreateMessagePipe
), the host process sets up an IPC listener on one port of that pipe (via an IPC channel implemented over Mojo ports (ipc::ChannelMojo
). Then it sends the other port in a Mojo message (Mojo supports sending ports around) to some other recipient via some other, earlier defined Mojo message pipe (I don't understand yet what this recipient is responsible for). That other recipient port still lives in the main process.
Then the host process sends a bunch of messages over the first pipe and they are received by that port which was transferred on another thread. Note that all this happens even before the renderer process is started at all.
Then the main process sends a bunch of other ports -- also ends of Mojo message pipes -- to that other recipient. Then, immediately after starting the renderer process, the main process deletes this other recipient port that received many pipe ends including the first one. Then after some time that first pipe end is also deleted (Node::ErasePort
). This operation causes Mojo to report the port status change as MOJO_HANDLE_SIGNAL_PEER_REMOTE to the other end of that pipe connected to the IPC listener. The IPC subsystem reasonably decides that the pipe is over and reports it as a communication error. Since this listener is somehow assumed to represent a communication channel between the main process and the renderer process, this error is interpreted as the end of communication due to the death of the renderer process and this is what we see.
I still don't fully know what is going on on an upper level there. Looks like the main process assumes that the renderer process should have gotten all ports from the main process via that other pipe and closes their handles on it end. However, since it's not the case, all falls apart. I will check this version.
Some more info. The parent process creates a special startup socketpair to send an invitation (including Mojo ports etc) to the child process. However, when it tries to send an invitation message to that socket on its end, send
returns -1 and errno = 32 (EPIPE). And this error is not handled anywhere. As a result, the parent thinks it communicated everything successfully to the child process and closes its handles for ports it sent. Since these ports were never actually established by the child, this is seen as remote port closure event int he parent and it thinks the child is dead. Now I need to find out why it gets EPIPE.
Okay I finally made IPC work. The reason it was broken is a poorly written comment which made me disable passing the other end of socketpair
to child in spawn2
when originally porting Chroimum. This part of code was simply not used on OS/2 in Qt 5.13 and earlier because they used something called Service Manager IPC mode
which didn't require it. However, in Chromium in 5.15 they deprecated this Service Manager mode in favor of what they call legacy IPC bootstrap now and it didn't work because of the socket not being passed (and eventually closed in parent, hence EPIPE).
Now, with this bootstrap stuff working, IPC/Mojo communication also seems to work but I get this assertion at some point for some reason:
[2686:4:0722/211150.524000:FATAL:channel_posix.cc(70)] Check failed: message_->data_num_bytes() > offset_ + num_bytes (184 vs. 184)
Something is wrong in some IPC message structure for some reason.
The above commit fixes the failed check and Mojo/IPC generally works in multi-process mode now: I can browse pages. Closing this.
An attempt to run e.g.
simplebrowser
(see https://github.com/bitwiseworks/qtwebengine-os2/issues/8) w/o--single-process
inQTWEBENGINE_CHROMIUM_FLAGS
results in messages about the render process crash. Digging into the logs shows up that Chromium starts a utility process like this in addition to the usual render process it used to start in Qt 5.13:It appears that this process starts successfully but then quickly terminates and Chromium spits this into the log of the main process:
If you hit "start again" in the simplebrowser's UI, it will endlessy try to start this network service process which is then immediately terminated over and over again.
Comparing to Chromium in Qt 5.13, I don't see attempts to start this network service helper process. At least not by default. Looks like this default has changed in newer Chromium and this network service was simply never tried on OS/2 back then.
I guess we may make it behave like in Qt 5.13 (i.e. not start this network service) but first I want to check what it does, why the default was changed and why it terminates so quickly (or crashes w/o a .TRP file). This requires more digging into the internals.