Open dmik opened 3 years ago
Comment imported from #9
Sometimes exceptq can help track down these silent exits.
See the
Z - Report normal terminations
section in exceptq-shl.txt. A raw DosExit() call is going to bypass atexits. This does not explain a silent exit from main(), but the silent exits might have multiple triggers.
SET EXCEPTQ=Z
does not help when EXCEPTQ's MYHANDLER is called manually, w/o installing an EXCEPTQ exception handler. I can try to do so from the LIBC handler as a test, or install it via LIBCx again. I will check.
I get this XCPT_SIGNAL report on premature terminations:
______________________________________________________________________
Exception C0010003 - Signal Received Notice
______________________________________________________________________
Process: D:\CODING\QT5\QT5-DEV-BUILD\QTWEBENGINE\EXAMPLES\WEBENGINEWIDGETS\SIMPLEBROWSER\DEBUG\SIMPLEBROWSER.EXE (02/23/2021 22:53:27 2,684,236)
PID: 2686 (9862)
TID: 01 (1)
Priority: 200
Filename: C:\OS2\DLL\DOSCALL1.DLL (11/24/2011 03:24:00 132,401)
Address: FFEF:E411 (0003:0000E411)
Exception Info: [0]= 00000003 (3)
______________________________________________________________________
Failing Instruction
______________________________________________________________________
FFEF:E402 ADD [BX+SI+0x36ca], DL (0090 ca36)
FFEF:E406 ADD [BX+SI+0x38ca], DL (0090 ca38)
FFEF:E40A ADD [BX+SI+0x3667], DL (0090 6736)
FFEF:E40E CALL FAR WORD [BP+DI+0x4] (ff5b 04)
FFEF:E411>JMP FAR LARGE 0x5b:0x1ffc777c (66 ea 7c77fc1f 5b00)
FFEF:E419 XCHG BX, BX (87db)
FFEF:E41B NOP (90)
FFEF:E41C PUSH EBP (66 55)
______________________________________________________________________
Registers
______________________________________________________________________
EAX : 00000142 EBX : 0000EFC0 ECX : 00000007 EDX : 00000020
ESI : 0084EF04 EDI : 299A0000
ESP : 0087EFB4 EBP : 0084EEF8 EIP : 0000E411 EFLG : 00000206
CS : FFEF CSLIM: 00010000 SS : 043F SSLIM: 0000F000
EAX : not a valid address
EBX : not a valid address
ECX : not a valid address
EDX : not a valid address
ESI : read/write memory on this thread's stack
EDI : unallocated memory
______________________________________________________________________
Stack Info for Thread 01
______________________________________________________________________
Size Base SS:SP Max Top
16 : 00800000 042F:0000 -> 043F:EFB4 -> 041F:D000 -> 002F:0000
32 : 00800000 00850000 -> 0087EFB4 -> 0083D000 -> 00050000
______________________________________________________________________
Call Stack
______________________________________________________________________
EBP Address Module Obj:Offset Nearest Public Symbol
-------- --------- -------- ------------- -----------------------
Trap -> FFEF:E411 DOSCALL1 0003:0000E411 between UT32_TARGET + 5 and HT16_STARTUP - B
0087EF0C 0000EFC0 Invalid address: 0000EFC0
Useless to me. The only thing I notice is that it's some 16-bit code. However, it's normal at this stage I guess.
BTW, this is surely not related to LIBCx interrupt_request
as I've just got it reproduced in single process mode (i.e. when interrupt_request
is surely not called). Symptoms are exactly the same. So my main guess now is that somehow the (UNIAUD?) driver is causing the kernel to kill the process in an unusual way, bypassing normal flow.
BTW, I sometimes get this to the console after termination:
set_hw_params error -77
and this appears to be from the UNIAUD DLL. At least I see that LIBKAI imports something called _uniaud_pcm_set_hw_params
from there.
Anyway, I will postpone this ticket as it doesn't look critical — after all, this is the end of the main application (after all child processes are gone too) so all unreleased resources, if any, should be released by the kernel anyway.
This might affect some things like cache/database file flushing but nothing like that is noticed so far.
I agree that this can probably wait.
FWIW, we have
The UT32_TARGET + 5 location is normal when a 32-bit API is thunked to to call the 16-bit implementation.
The signal exception interrupted whatever API was running so when the API returned to ring3, the signal exception was disaptched on thread 1.
I would not mind seeing the full .trp file. ss:bx points to interesting data, but exceptq is not quite smart enough to know this on it's own.
To get 'Z' to work when calling MYHANDLER directly, it should be sufficient to call SetExceptqOptions.
The full report: 2686_01.TRP.txt
I really wonder where this signal is coming from. My guess is still the same: the audio driver (and/or MMPM).
Re Z
, well, the problem is that MYHANDLER is currently called from the LIBC panic function. But in case of this faulty XCPT_SIGNAL the panic function is for some reason not reached (there is a long path: raise -> default signal handler -> panic). And I don't know upfront if this path will work or not, so I can't do that directly from the LIBC exception handler.
The question is not so much where the signal is coming from, but rather what process issued the DosKillProcess and why. The signal exception is a side-effect of the DosKillProcess call.
At this point in time, I am going to assume that the DosKillProcess was issued by some SimpleBrowser thread rather some external process.
The .trp file is interesting, but it's only telling us what thread 1 happened to be doing when the DosKillProcess was issued and since the kernel createed a temporary ThunkTIB stack to invoke the 16-bit API, the "Labels on the Stack" listing in probably misleading.
The good news is that unlike DosExit, DosKillProcess is a request. All the other threads are still alive at the time the signal is delivered. It is likely if we force a process dump, we should be able to determine which thread issued the DosKillProcess. I suspect this thread remains blocked in the kernel until the signal exception is handled.
You should be able to use SetExceptqOption("U",...) to configure MYHANDLER to generate a process dump
You should be able to use my pdumpctl.cmd script to set configure the process dump facility.
pdumpctl i simplebrowser v
may given us sufficient data. I'm not quite sure how the process dump facility is going to handle the long process name. PROCDUMP.DOC implies is should do the right thing. If not, try
pdumpctl i simplebr v
Steven, thanks for the hint. I guess I will have to work on this issue sooner than later since I'm also getting these terminations on tab close sometimes. And not only the tab rendering process goes away, but the browser process itself too. Symptoms are exactly the same though.
Regarding DosKillProcess
. Yes, it is a request but still, the question is where it comes from. I've just caught a tab close event int the debugger and all threads are in "Runnable" state when it happens. I guess having a live debugger session is better than a "static" process dump. But I wonder what else I can do in it to check for the origin of the signal.
While probably not related, the Mozilla apps (FF45.9 etc) have a habit of closing with no trace when using the NSS and NSPR4 from netlabs-exp.
My point about DosKillProcess being a request was indicate why it should be relatively to track down the source
As I mentioned, a process dump captured by exceptq should contain the data we need to identify the origin.
The debugger is not really capable of displaying the data I would most want to see. The Runnable state might be a red herring. Of course, if there are not too many threads, you switch from thread to thread and see it the individual thread state tells you if the thread issued the DosKillThread.
The OS/2 Trace Facility can be configured to trace just DosKillProcess, but it is unlikely to reveal the origin we want to see. Most likely it will report some address in DOSCALL1.DLL.
Dave Blaschke's OS2TRACE can also trace the origin of the DosKillProcess, but the the .TRC file might be huge, which could be a problem.
Just tried the new drop of QTWebTest5_5, the readme says to report problems here...
I am using ArcaOS 5.0.6 with all the latest driver updates. Also installed all the latest updates from netlab-exp using ANPM and rebooted. My computer has an i5 7400 CPU with 8GB RAM. Downloaded and extracted the QTWebTest5_5.7z file using Zippy, then went to the command prompt, navigated to the extracted folder and ran 'sh run-check.sh'. Simplebrowser-check.log says all files needed are available. Then ran 'sh run.sh' and it immediately returned to the command prompt - simplebrowser.log is 0 bytes and empty. Did this about 15 times and then got a trap (attached), but no POPUPLOG and simplebrowser.log is 0 bytes.
Had to go to a Windows machine to attach the trap... 0065_01.TRP.txt
Running sh run.sh also results in nothing happening here. Setting up BEGINLIBPATH and LIBPATHSTRICT in a cmd window and running the simplebrowser executable does work. After opening a couple of tabs, the system froze, likely due to lack of memory. I also was getting sys1808 IIRC, out of memory errors while testing run.sh I also had to paste some text into the Google search box before it would except text input. Looking good
When updating required packages to their newer versions, I got a libc panic and a TRP-file was created that looks similar to the one from David M. First updated libc from version 0.1.6-1 to version 0.1.7-1 and rebooted. Next updated libcx from version 0.6.9-2 to version 0.7.0-1 and rebooted. Next updated pthread from version 0.2.2-1 to version 0.2.3-1. And finally tried to update libevent from version 0.2.2-1 to version 0.2.3-1. That produced the libc panic. I have pasted its full text at the top of the TRP-file (which contained a truncated version of the full text of the panic). Also ran into a simplebrowser.log of zero bytes. Changing the last line in run.sh to the previous line fixed that.
On a different install, on startup, while displaying about:blank, the tab crashed with a popup asking if I wanted to reload the tab. Here's the trp 005F_01.TRP.txt
Thanks Tom, REMMing the last line in run.sh and un-REMing the previous one allows the simplebrowser to start. It seems to work OK (with numerous quirks)...
@DavidMcKenna sorry, there is a typo in the last line of run.sh
, it should be (the vertical bar symbol is not needed)
./simplebrowser.exe 2>&1 >simplebrowser.log
I just wonder why it doesn't crash here. Perhaps, your RPM environment is not pristine.
@dryeo How much memory do you have? Chromium (and any other heavy modern app) needs the maximum which is 4G in our case, and SET VIRTUALADDRESSLIMIT=3072
. Otherwise you will be getting out of RAM pretty soon: a YouTube tab may eat up to 300-400 MB and there is around ~3 GB available for Chromium on a freshly booted system with 4G, VAL=3072 and no other huge apps started.
Your second .TRP looks similar to what David has. Are you sure you fixed run.sh
(removed the vertical bar)?
@tomtom-66 Did rebooting solve this? Please create a ticket at https://github.com/bitwiseworks/pthread-os2 if not (I need a way to reproduce and also a .TRP with -debuginfo packages installed for libc/libcx/pthread/gettext/rpm/python).
@dmik, sorry, I forgot to mark the DLLs high to begin with, afterwards it was mostly stable, eventually vanishing with no trace. I fixed run.sh by adding tee after the vertical bar.
Yes, I have rebooted (several times) but that did not make the problem disappear. After the update from pthread 0.2.2-1 to 0.2.3-1 every yum command failed with a libc panic. Will open a ticket for that at pthread-os2.
@tomtom-66 Did rebooting solve this? Please create a ticket at https://github.com/bitwiseworks/pthread-os2 if not (I need a way to reproduce and also a .TRP with -debuginfo packages installed for libc/libcx/pthread/gettext/rpm/python).
dmik,
Thanks, I changed run.sh as you said and it now starts. Still 0 byte simplebrowser.log though, so I imitated what was in 'run-check.sh' and changed the line to: './simplebrowser.exe >simplebrowser.log 2>&1' and now it starts AND simplebrowser.log collects info. Please let me know if this is not correct. So far I have not had any traps, but like dryeo have had the browser vanish a couple times...(using YouTube). I have marked the browser DLL's high too.
@DavidMcKenna my bad, of course, it should be like you do, 2>&1
should come last. It was done the other way for piping when I wanted to remove excessive logging starting with ***
in the previous drop because it would make a ridiculously big .log file.
@dryeo re new NSS/NSPR, yes, it's an interesting guess. In theory, any DLL can issue DosKillProcess
on the one it is attached to. Given that in Mozilla you are seeing something similar, it's worth checking at least. Chromium surely does use the same new NSS/NSPR DLLs (in fact, it was updated to meet its requirements).
@StevenLevine can tracing be set up to see which DLL calls which DOS API?
I see this trap every time I try to log in to my bank (www.netteller.com). The whole browser crashes, not just the tab.
@DavidMcKenna unfortunately, this .TRP is not very useful because EXCEPTQ currently doesn't support optimized GCC stack frames used in release builds of Chromium. You will need a debug build for that, but that's for later. Anything in the .log file?
Also, are you sure you don't use export QTWEBENGINE_CHROMIUM_FLAGS=--single-process
? I read in your upper comments that you did that for some reason. This will make the whole browser crash in case if a tab crashes.
simplebrowser.log 0050_01.TRP.txt
dmik,
I don't use the export you mention (at least not intentionally), never have.
Started the browser, went to my bank website, let it render, then clicked on 'Log In'. A new tab is automatically created (netteller is used for the logon) and eventually a window pops up saying the rendering process stopped with error 15 and do you want to reload the page (or something like that). I click 'OK' and the rendering starts agaiin, but the same popup appears. Click 'OK' again and I get the trap, and the browser disappears. There is no POPUPLOG.
@DavidMcKenna Ok, I see. Anyway, this is clearly some assertion from Chromium itself. I wonder why there is no FATAL message in the log. Only the debug build will tell us more. Please create a separate ticket, this a different issue.
@dmik, you should be able to use the OS/2 Trace facility to trace the DosKillProcess API. See the attached doskill.cfg which is used with my Trace Tools at
http://www.warpcave.com/os2diags/TrcTools-20201213.zip
The config file is currently configured to trace top.exe. For top v2.14, the Trace Facility reports:
(OS) Dos32KillProcess Pre-Invocation
Event [1] Major [16/0x0010] Minor [275/0x0113] PID [6854/0x1ac6] Length [7] Time [20:17:54.52]
Return address = 000175FB
and the return address points after the call to Dos32KillProcess, as it should.
Once you verify that you can trace DosKillProcess calls from top.exe, adjust the config to trace simplebrowser and see if you can locate where the Dos32KillProcess is originating.
For some reason, the Trace Facility must be configured to trace at least one excutable or the Dos32KillProcess tracepoint will not trigger. This seems to apply to DOSCALL1 APIs, but I don't know if this by design or what.
Let's see if I can convince doskill.cfg to upload this time.
; doskill.cfg - trace kill and exit APIs
; 2021-03-03 SHL Baseline - clone from elsewhere
; Executable to trace - disable to trace all executables ; It appears that tracing DOSCALL1 APIs require explicit process spec ; Exe = simplebrowser.exe Exe = top.exe
; Die on configuration errors (yes/no, default = yes) ; DIEONERROR = no
; Find PID for exe (yes/no, default = no) ; FINDPID = yes
; PID to trace (overrides exe) ; PID = 666
; === KernelR (Major 5, 0x5) ====
; 49 (0x0031) Dos32Exit Dos Post-Invocation ; 12 (0x000C) Dos32ExitList Dos Post-Invocation ; 51 (0x0033) Dos32ExitList Dos Post-Invocation ; 169 (0x00A9) Dos32ExitList Dos Pre-Invocation ; 441 (0x01B9) Dos32KillThread Post-Invocation ; 440 (0x01B8) Dos32KillThread Pre-Invocation ; 206 (0x00CE) DosExit Dos Pre-Invocation ; 2 (0x0002) DosExitCritSec Post-Invocation ; 207 (0x00CF) DosExitCritSec Pre-Invocation ; 208 (0x00D0) DosExitList Dos Pre-Invocation ; 84 (0x0054) DosKillProcess Post-Invocation ; 241 (0x00F1) DosKillProcess Pre-Invocation
Cmd = trace on 5(49,206,84,241)
; === DOSCALL1 (Major 16, 0x10) ===
; DOS32EXITCRITSEC 273 TK PRE,API 32-bit EP to KERNEL API ; DOS32EXIT 274 TK PRE,API 32-bit EP to KERNEL API ; DOS32KILLPROCESS 275 TK PRE,API 32-bit EP to KERNEL API
Cmd = trace on doscall1(274,275)
; === PMSHAPI (Major 192, 0xc0) ====
; 32809 (0x8029) ExitProc Post-invocation ; 41 (0x0029) ExitProc Pre-Invocation
; Cmd = trace on pmshapi(32809,41)
; eof
Sometimes I see that when I close the Qt Simple Browser application, the browser process does not terminate cleanly: it gets terminated by the system before returning from
main()
, or sometimes after that but before LIBC runs its termination sequence (e.g. some or allatexit
handlers).Catching it from the debugger shows that thread 1 receives XCPT_SIGNAL (with XCPT_SIGNAL_KILLPROC) while in the middle of the normal app termination sequence. The main LIBC exception handler catches it and should raise it as SIGKILL but this does not happen (as if it were interrupted in the middle as well). No assertions, trap reports or POPUPLOG.OS2 entries are generated when it happens. When not running from the debugger, the process vanishes even before receiving XCPT_SIGNAL in LIBC.
So far, it feels like this strange termination is somehow related to audio support via libkai. I.e. I can often trigger it after visiting http://hpr.dogphilosophy.net/test/. And can never trigger it on pages not using any audio. Looks like the audio driver somehow kills the process for some reason. I wonder how I can trace it (or trace where this fatal XCPT_SIGNAL originates from).
There is a possibility that it somehow can be related to LIBCx
interrupt_request
(which also uses XCPT_SIGNAL_KILLPROC to request various services in other processes) but I couldn't see any proof of that so far. And it just doesn't feel like that. I will inspect LIBCx logs.