bitwiseworks / qtwebengine-chromium-os2

Port of Chromium and related tools to OS/2
9 stars 2 forks source link

Premature browser process termination #32

Open dmik opened 3 years ago

dmik commented 3 years ago

Sometimes I see that when I close the Qt Simple Browser application, the browser process does not terminate cleanly: it gets terminated by the system before returning from main(), or sometimes after that but before LIBC runs its termination sequence (e.g. some or all atexit handlers).

Catching it from the debugger shows that thread 1 receives XCPT_SIGNAL (with XCPT_SIGNAL_KILLPROC) while in the middle of the normal app termination sequence. The main LIBC exception handler catches it and should raise it as SIGKILL but this does not happen (as if it were interrupted in the middle as well). No assertions, trap reports or POPUPLOG.OS2 entries are generated when it happens. When not running from the debugger, the process vanishes even before receiving XCPT_SIGNAL in LIBC.

So far, it feels like this strange termination is somehow related to audio support via libkai. I.e. I can often trigger it after visiting http://hpr.dogphilosophy.net/test/. And can never trigger it on pages not using any audio. Looks like the audio driver somehow kills the process for some reason. I wonder how I can trace it (or trace where this fatal XCPT_SIGNAL originates from).

There is a possibility that it somehow can be related to LIBCx interrupt_request (which also uses XCPT_SIGNAL_KILLPROC to request various services in other processes) but I couldn't see any proof of that so far. And it just doesn't feel like that. I will inspect LIBCx logs.

StevenLevine commented 3 years ago

Comment imported from #9

Sometimes exceptq can help track down these silent exits.

See the

Z - Report normal terminations

section in exceptq-shl.txt. A raw DosExit() call is going to bypass atexits. This does not explain a silent exit from main(), but the silent exits might have multiple triggers.

dmik commented 3 years ago

SET EXCEPTQ=Z does not help when EXCEPTQ's MYHANDLER is called manually, w/o installing an EXCEPTQ exception handler. I can try to do so from the LIBC handler as a test, or install it via LIBCx again. I will check.

dmik commented 3 years ago

I get this XCPT_SIGNAL report on premature terminations:

______________________________________________________________________

 Exception C0010003 - Signal Received Notice
______________________________________________________________________

 Process:  D:\CODING\QT5\QT5-DEV-BUILD\QTWEBENGINE\EXAMPLES\WEBENGINEWIDGETS\SIMPLEBROWSER\DEBUG\SIMPLEBROWSER.EXE (02/23/2021 22:53:27 2,684,236)
 PID:      2686 (9862)
 TID:      01 (1)
 Priority: 200

 Filename: C:\OS2\DLL\DOSCALL1.DLL (11/24/2011 03:24:00 132,401)
 Address:  FFEF:E411 (0003:0000E411)
 Exception Info: [0]= 00000003 (3)  

______________________________________________________________________

 Failing Instruction
______________________________________________________________________

 FFEF:E402 ADD      [BX+SI+0x36ca], DL     (0090 ca36)
 FFEF:E406 ADD      [BX+SI+0x38ca], DL     (0090 ca38)
 FFEF:E40A ADD      [BX+SI+0x3667], DL     (0090 6736)
 FFEF:E40E CALL FAR WORD [BP+DI+0x4]       (ff5b 04)
 FFEF:E411>JMP FAR  LARGE 0x5b:0x1ffc777c  (66 ea 7c77fc1f 5b00)
 FFEF:E419 XCHG     BX, BX                 (87db)
 FFEF:E41B NOP                             (90)
 FFEF:E41C PUSH     EBP                    (66 55)

______________________________________________________________________

 Registers
______________________________________________________________________

 EAX : 00000142   EBX  : 0000EFC0   ECX : 00000007   EDX  : 00000020
 ESI : 0084EF04   EDI  : 299A0000
 ESP : 0087EFB4   EBP  : 0084EEF8   EIP : 0000E411   EFLG : 00000206
 CS  : FFEF       CSLIM: 00010000   SS  : 043F       SSLIM: 0000F000

 EAX : not a valid address
 EBX : not a valid address
 ECX : not a valid address
 EDX : not a valid address
 ESI : read/write memory on this thread's stack
 EDI : unallocated memory

______________________________________________________________________

 Stack Info for Thread 01
______________________________________________________________________

        Size       Base         SS:SP         Max          Top
 16 : 00800000   042F:0000 -> 043F:EFB4 -> 041F:D000 -> 002F:0000
 32 : 00800000   00850000  -> 0087EFB4  -> 0083D000  -> 00050000

______________________________________________________________________

 Call Stack
______________________________________________________________________

   EBP     Address    Module     Obj:Offset    Nearest Public Symbol
 --------  ---------  --------  -------------  -----------------------
 Trap  ->  FFEF:E411  DOSCALL1  0003:0000E411  between UT32_TARGET + 5 and HT16_STARTUP - B

 0087EF0C  0000EFC0   Invalid address: 0000EFC0

Useless to me. The only thing I notice is that it's some 16-bit code. However, it's normal at this stage I guess.

dmik commented 3 years ago

BTW, this is surely not related to LIBCx interrupt_request as I've just got it reproduced in single process mode (i.e. when interrupt_request is surely not called). Symptoms are exactly the same. So my main guess now is that somehow the (UNIAUD?) driver is causing the kernel to kill the process in an unusual way, bypassing normal flow.

BTW, I sometimes get this to the console after termination:

set_hw_params error -77

and this appears to be from the UNIAUD DLL. At least I see that LIBKAI imports something called _uniaud_pcm_set_hw_params from there.

Anyway, I will postpone this ticket as it doesn't look critical — after all, this is the end of the main application (after all child processes are gone too) so all unreleased resources, if any, should be released by the kernel anyway.

This might affect some things like cache/database file flushing but nothing like that is noticed so far.

StevenLevine commented 3 years ago

I agree that this can probably wait.

FWIW, we have

define ENOLCK 77 / No locks available /

The UT32_TARGET + 5 location is normal when a 32-bit API is thunked to to call the 16-bit implementation.

The signal exception interrupted whatever API was running so when the API returned to ring3, the signal exception was disaptched on thread 1.

I would not mind seeing the full .trp file. ss:bx points to interesting data, but exceptq is not quite smart enough to know this on it's own.

To get 'Z' to work when calling MYHANDLER directly, it should be sufficient to call SetExceptqOptions.

dmik commented 3 years ago

The full report: 2686_01.TRP.txt

I really wonder where this signal is coming from. My guess is still the same: the audio driver (and/or MMPM).

Re Z, well, the problem is that MYHANDLER is currently called from the LIBC panic function. But in case of this faulty XCPT_SIGNAL the panic function is for some reason not reached (there is a long path: raise -> default signal handler -> panic). And I don't know upfront if this path will work or not, so I can't do that directly from the LIBC exception handler.

StevenLevine commented 3 years ago

The question is not so much where the signal is coming from, but rather what process issued the DosKillProcess and why. The signal exception is a side-effect of the DosKillProcess call.

At this point in time, I am going to assume that the DosKillProcess was issued by some SimpleBrowser thread rather some external process.

The .trp file is interesting, but it's only telling us what thread 1 happened to be doing when the DosKillProcess was issued and since the kernel createed a temporary ThunkTIB stack to invoke the 16-bit API, the "Labels on the Stack" listing in probably misleading.

The good news is that unlike DosExit, DosKillProcess is a request. All the other threads are still alive at the time the signal is delivered. It is likely if we force a process dump, we should be able to determine which thread issued the DosKillProcess. I suspect this thread remains blocked in the kernel until the signal exception is handled.

You should be able to use SetExceptqOption("U",...) to configure MYHANDLER to generate a process dump

You should be able to use my pdumpctl.cmd script to set configure the process dump facility.

pdumpctl i simplebrowser v

may given us sufficient data. I'm not quite sure how the process dump facility is going to handle the long process name. PROCDUMP.DOC implies is should do the right thing. If not, try

pdumpctl i simplebr v

dmik commented 3 years ago

Steven, thanks for the hint. I guess I will have to work on this issue sooner than later since I'm also getting these terminations on tab close sometimes. And not only the tab rendering process goes away, but the browser process itself too. Symptoms are exactly the same though.

dmik commented 3 years ago

Regarding DosKillProcess. Yes, it is a request but still, the question is where it comes from. I've just caught a tab close event int the debugger and all threads are in "Runnable" state when it happens. I guess having a live debugger session is better than a "static" process dump. But I wonder what else I can do in it to check for the origin of the signal.

dryeo commented 3 years ago

While probably not related, the Mozilla apps (FF45.9 etc) have a habit of closing with no trace when using the NSS and NSPR4 from netlabs-exp.

StevenLevine commented 3 years ago

My point about DosKillProcess being a request was indicate why it should be relatively to track down the source

As I mentioned, a process dump captured by exceptq should contain the data we need to identify the origin.

The debugger is not really capable of displaying the data I would most want to see. The Runnable state might be a red herring. Of course, if there are not too many threads, you switch from thread to thread and see it the individual thread state tells you if the thread issued the DosKillThread.

The OS/2 Trace Facility can be configured to trace just DosKillProcess, but it is unlikely to reveal the origin we want to see. Most likely it will report some address in DOSCALL1.DLL.

Dave Blaschke's OS2TRACE can also trace the origin of the DosKillProcess, but the the .TRC file might be huge, which could be a problem.

DavidMcKenna commented 3 years ago

Just tried the new drop of QTWebTest5_5, the readme says to report problems here...

I am using ArcaOS 5.0.6 with all the latest driver updates. Also installed all the latest updates from netlab-exp using ANPM and rebooted. My computer has an i5 7400 CPU with 8GB RAM. Downloaded and extracted the QTWebTest5_5.7z file using Zippy, then went to the command prompt, navigated to the extracted folder and ran 'sh run-check.sh'. Simplebrowser-check.log says all files needed are available. Then ran 'sh run.sh' and it immediately returned to the command prompt - simplebrowser.log is 0 bytes and empty. Did this about 15 times and then got a trap (attached), but no POPUPLOG and simplebrowser.log is 0 bytes.

DavidMcKenna commented 3 years ago

Had to go to a Windows machine to attach the trap... 0065_01.TRP.txt

dryeo commented 3 years ago

Running sh run.sh also results in nothing happening here. Setting up BEGINLIBPATH and LIBPATHSTRICT in a cmd window and running the simplebrowser executable does work. After opening a couple of tabs, the system froze, likely due to lack of memory. I also was getting sys1808 IIRC, out of memory errors while testing run.sh I also had to paste some text into the Google search box before it would except text input. Looking good

tomtom-66 commented 3 years ago

0049_01.TRP.txt

When updating required packages to their newer versions, I got a libc panic and a TRP-file was created that looks similar to the one from David M. First updated libc from version 0.1.6-1 to version 0.1.7-1 and rebooted. Next updated libcx from version 0.6.9-2 to version 0.7.0-1 and rebooted. Next updated pthread from version 0.2.2-1 to version 0.2.3-1. And finally tried to update libevent from version 0.2.2-1 to version 0.2.3-1. That produced the libc panic. I have pasted its full text at the top of the TRP-file (which contained a truncated version of the full text of the panic). Also ran into a simplebrowser.log of zero bytes. Changing the last line in run.sh to the previous line fixed that.

dryeo commented 3 years ago

On a different install, on startup, while displaying about:blank, the tab crashed with a popup asking if I wanted to reload the tab. Here's the trp 005F_01.TRP.txt

DavidMcKenna commented 3 years ago

Thanks Tom, REMMing the last line in run.sh and un-REMing the previous one allows the simplebrowser to start. It seems to work OK (with numerous quirks)...

dmik commented 3 years ago

@DavidMcKenna sorry, there is a typo in the last line of run.sh, it should be (the vertical bar symbol is not needed)

./simplebrowser.exe 2>&1 >simplebrowser.log

I just wonder why it doesn't crash here. Perhaps, your RPM environment is not pristine.

@dryeo How much memory do you have? Chromium (and any other heavy modern app) needs the maximum which is 4G in our case, and SET VIRTUALADDRESSLIMIT=3072. Otherwise you will be getting out of RAM pretty soon: a YouTube tab may eat up to 300-400 MB and there is around ~3 GB available for Chromium on a freshly booted system with 4G, VAL=3072 and no other huge apps started.

Your second .TRP looks similar to what David has. Are you sure you fixed run.sh (removed the vertical bar)?

@tomtom-66 Did rebooting solve this? Please create a ticket at https://github.com/bitwiseworks/pthread-os2 if not (I need a way to reproduce and also a .TRP with -debuginfo packages installed for libc/libcx/pthread/gettext/rpm/python).

dryeo commented 3 years ago

@dmik, sorry, I forgot to mark the DLLs high to begin with, afterwards it was mostly stable, eventually vanishing with no trace. I fixed run.sh by adding tee after the vertical bar.

tomtom-66 commented 3 years ago

Yes, I have rebooted (several times) but that did not make the problem disappear. After the update from pthread 0.2.2-1 to 0.2.3-1 every yum command failed with a libc panic. Will open a ticket for that at pthread-os2.

@tomtom-66 Did rebooting solve this? Please create a ticket at https://github.com/bitwiseworks/pthread-os2 if not (I need a way to reproduce and also a .TRP with -debuginfo packages installed for libc/libcx/pthread/gettext/rpm/python).

DavidMcKenna commented 3 years ago

dmik,

Thanks, I changed run.sh as you said and it now starts. Still 0 byte simplebrowser.log though, so I imitated what was in 'run-check.sh' and changed the line to: './simplebrowser.exe >simplebrowser.log 2>&1' and now it starts AND simplebrowser.log collects info. Please let me know if this is not correct. So far I have not had any traps, but like dryeo have had the browser vanish a couple times...(using YouTube). I have marked the browser DLL's high too.

dmik commented 3 years ago

@DavidMcKenna my bad, of course, it should be like you do, 2>&1 should come last. It was done the other way for piping when I wanted to remove excessive logging starting with *** in the previous drop because it would make a ridiculously big .log file.

@dryeo re new NSS/NSPR, yes, it's an interesting guess. In theory, any DLL can issue DosKillProcess on the one it is attached to. Given that in Mozilla you are seeing something similar, it's worth checking at least. Chromium surely does use the same new NSS/NSPR DLLs (in fact, it was updated to meet its requirements).

@StevenLevine can tracing be set up to see which DLL calls which DOS API?

DavidMcKenna commented 3 years ago

004D_01.TRP.txt

I see this trap every time I try to log in to my bank (www.netteller.com). The whole browser crashes, not just the tab.

dmik commented 3 years ago

@DavidMcKenna unfortunately, this .TRP is not very useful because EXCEPTQ currently doesn't support optimized GCC stack frames used in release builds of Chromium. You will need a debug build for that, but that's for later. Anything in the .log file?

Also, are you sure you don't use export QTWEBENGINE_CHROMIUM_FLAGS=--single-process? I read in your upper comments that you did that for some reason. This will make the whole browser crash in case if a tab crashes.

DavidMcKenna commented 3 years ago

simplebrowser.log 0050_01.TRP.txt

dmik,

I don't use the export you mention (at least not intentionally), never have.

Started the browser, went to my bank website, let it render, then clicked on 'Log In'. A new tab is automatically created (netteller is used for the logon) and eventually a window pops up saying the rendering process stopped with error 15 and do you want to reload the page (or something like that). I click 'OK' and the rendering starts agaiin, but the same popup appears. Click 'OK' again and I get the trap, and the browser disappears. There is no POPUPLOG.

dmik commented 3 years ago

@DavidMcKenna Ok, I see. Anyway, this is clearly some assertion from Chromium itself. I wonder why there is no FATAL message in the log. Only the debug build will tell us more. Please create a separate ticket, this a different issue.

StevenLevine commented 3 years ago

@dmik, you should be able to use the OS/2 Trace facility to trace the DosKillProcess API. See the attached doskill.cfg which is used with my Trace Tools at

http://www.warpcave.com/os2diags/TrcTools-20201213.zip

The config file is currently configured to trace top.exe. For top v2.14, the Trace Facility reports:

(OS) Dos32KillProcess Pre-Invocation Event [1] Major [16/0x0010] Minor [275/0x0113] PID [6854/0x1ac6] Length [7] Time [20:17:54.52]
Return address = 000175FB

and the return address points after the call to Dos32KillProcess, as it should.

Once you verify that you can trace DosKillProcess calls from top.exe, adjust the config to trace simplebrowser and see if you can locate where the Dos32KillProcess is originating.

For some reason, the Trace Facility must be configured to trace at least one excutable or the Dos32KillProcess tracepoint will not trigger. This seems to apply to DOSCALL1 APIs, but I don't know if this by design or what.

StevenLevine commented 3 years ago

Let's see if I can convince doskill.cfg to upload this time.

StevenLevine commented 3 years ago

; doskill.cfg - trace kill and exit APIs

; 2021-03-03 SHL Baseline - clone from elsewhere

; Executable to trace - disable to trace all executables ; It appears that tracing DOSCALL1 APIs require explicit process spec ; Exe = simplebrowser.exe Exe = top.exe

; Die on configuration errors (yes/no, default = yes) ; DIEONERROR = no

; Find PID for exe (yes/no, default = no) ; FINDPID = yes

; PID to trace (overrides exe) ; PID = 666

; === KernelR (Major 5, 0x5) ====

; 49 (0x0031) Dos32Exit Dos Post-Invocation ; 12 (0x000C) Dos32ExitList Dos Post-Invocation ; 51 (0x0033) Dos32ExitList Dos Post-Invocation ; 169 (0x00A9) Dos32ExitList Dos Pre-Invocation ; 441 (0x01B9) Dos32KillThread Post-Invocation ; 440 (0x01B8) Dos32KillThread Pre-Invocation ; 206 (0x00CE) DosExit Dos Pre-Invocation ; 2 (0x0002) DosExitCritSec Post-Invocation ; 207 (0x00CF) DosExitCritSec Pre-Invocation ; 208 (0x00D0) DosExitList Dos Pre-Invocation ; 84 (0x0054) DosKillProcess Post-Invocation ; 241 (0x00F1) DosKillProcess Pre-Invocation

Cmd = trace on 5(49,206,84,241)

; === DOSCALL1 (Major 16, 0x10) ===

; DOS32EXITCRITSEC 273 TK PRE,API 32-bit EP to KERNEL API ; DOS32EXIT 274 TK PRE,API 32-bit EP to KERNEL API ; DOS32KILLPROCESS 275 TK PRE,API 32-bit EP to KERNEL API

Cmd = trace on doscall1(274,275)

; === PMSHAPI (Major 192, 0xc0) ====

; 32809 (0x8029) ExitProc Post-invocation ; 41 (0x0029) ExitProc Pre-Invocation

; Cmd = trace on pmshapi(32809,41)

; eof