Open dmik opened 6 years ago
As I closed #176, the target issue is now #248.
Tests are built fine now, lets see what we get with them.
These are some docs about the Mozilla test framework: https://developer.mozilla.org/en-US/docs/Running_automated_tests
Ok, making it compile is obviously not enough. For example, running xpcshell-based tests fails with this:
From _tests: Kept 41186 existing; Added/updated 0; Removed 0 files and 0 directories.
Error running mach:
['xpcshell-test']
The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.
You should consider filing a bug for this issue.
If filing a bug, please include the full output of mach, including this error
message.
The details of the failure are as follows:
NotImplementedError: cannot determine number of cpus
File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 373, in run_xpcshell_test
return xpcshell.run_test(**params)
File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 72, in run_test
self.run_suite(**kwargs)
File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 58, in run_suite
return self._run_xpcshell_harness(**kwargs)
File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 77, in _run_xpcshell_harness
import runxpcshelltests
File "D:/Coding/mozilla/master/build/mach_bootstrap.py", line 340, in __call__
module = self._original_import(name, globals, locals, fromlist, level)
File "D:/Coding/mozilla/master/testing/xpcshell/runxpcshelltests.py", line 50, in <module>
NUM_THREADS = int(cpu_count() * 4)
File "C:/USR/lib/python2.7/multiprocessing/__init__.py", line 136, in cpu_count
raise NotImplementedError('cannot determine number of cpus')
Some more porting is needed.
Note that getting through all tests is kinda slow. Each small failure requires to start the whole test subdir over and some tests are quite time consuming. And given that tests haven't been run for ages, there are lots of small failures. I will be committing fixes in small batches. So far, I'm working on make check
tests (so called "compiled-code tests"). The config
tests have been fixed (all PASS now). Working on python
tests.
No doubt tests are a good thing per se. I already found a small typo that could potentially result in spontaneous crashes. See f816da11c3 for details.
All python tests pass now (required some Python fixes as well, see its trac). The next stop is JS. Just what we need.
JS actually doesn't seem to provide compiled-code tests, it was some other failure (already fixed).
The next make check
stop is GTest on xul.dll:
D:/Coding/mozilla/master-build-test/_virtualenv/bin/python D:/Coding/mozilla/master/testing/gtest/rungtests.py \
--cwd=../../_tests/gtest \
--xre-path=../../dist/bin \
--symbols-path=../../dist/crashreporter-symbols \
../../dist/bin/firefox.exe
gtest INFO | Running gtest
Running GTest tests...
Creating 55EF_01.TRP
Killed by SIGSEGV
pid=0x55ef ppid=0x55ea tid=0x0001 slot=0x0098 pri=0x0200 mc=0x0001 ps=0x0010
D:\CODING\MOZILLA\MASTER-BUILD-TEST\DIST\BIN\FIREFOX.EXE
XUL 0:00469425
cs:eip=005b:0f5f9425 ss:esp=0053:0013f930 ebp=20085be0
ds=0053 es=0053 fs=150b gs=0000 efl=00010202
eax=20085be0 ebx=0013f98c ecx=0f312b90 edx=00000000 edi=20084480 esi=20084490
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
gtest TEST-UNEXPECTED-FAIL | gtest | test failed with return code -11
make: *** [check] Error 1
It seems a special version of XUL.DLL is linked for the GTest framework (google version of the testing framework). However, I have no idea why yet.
There are several JS tests out there (there is actually a mess of tests and no clear instruction, at least I couldn't find one). Most require mach
(see mach help
). But some run manually via make check-SOMETHING
. I still couldn't figure out what is run by what and why/when. It's so complicated and very poorly documented. I have to learn the sources, as usual.
I've tried mach check-spidermonkey
so far which seems to be the major JS test and there is only one failure from what I can see, this is my blind port of testGCAllocator.cpp (https://github.com/bitwiseworks/mozilla-os2/commit/c9025c48d667c1ab88a93ac9297af47ec1c75b8c#diff-ae279a8150eb41aa248bf1109913de39).
Another problem is that for some reason I get hard crashes (that kill LIBCX) and trap screens when trying to direct test output to a file. This might be related to my latest python fixes... (and might be not).
Does
this help in any way?
I've read that of course and I can start these tests manually but it's not how they are run by mach
or make check-XXX
. Also, mach check-spidermonkey
or mach jasapi-tests
seems to do more than just run theses.
When started manually both tests seem to succeed (all passed) but I want to make them run as intended by the devs also.
So far, I'm getting various crashes here and there. E.g. now it crashed python in DOSCALLS (Trap -> FFF7:7F9D DOSCALL1 0004:00007F9D between VDHWRTTTY + 57 and StrScan - D5). Something is wrong with console output somewhere.
BTW, reverting Python's subprocess
impl back to fork + socketpair
makes any tests involving python launchers fail immediately with OSError: [Errno 22] Invalid argument
which isn't surprising as socket pairs are not normal files. It also introduces an unkillable zombie (a usual thing when fork is involved). Anyway, redoing Python to use spawn + native pipe
seems a right thing. I need to sort out why Python crashes DOSCALLS.
Got another interesting crash after fiddling with test runner output to avoid crashes in DOSCALLS. It's in LIBCX mmap:
LIBCx version : 0.6.0_dev
LIBCx module : C:\USR\LIB\LIBCX0.DLL (hmod=05ea)
004f7b65 01 ff 0000 Asrt: Assertion Failed!!!
004f7b65 01 ff 0000 Asrt: Function: <NULL>
004f7b65 01 ff 0000 Asrt: File: D:/Coding/libcx/master/src/mmap/mmap.c
004f7b65 01 ff 0000 Asrt: Line: 1200
004f7b65 01 ff 0000 Asrt: Expr: arc == NO_ERROR
004f7b65 01 ff 0000 Asrt: 87
DosSetMem suddenly returns 87 which is something regarding invalid arguments (I can't tell exactly because view.exe
sometimes crashes PMSHELL in DOSCALLS here as it did now. Just for the record, this is the PMSHELL popuplog entry:
11-21-2017 18:44:00 SYS3175 PID 0028 TID 0001 Slot 0051
C:\OS2\PMSHELL.EXE
c0000005
1ffc1b65
P1=00000001 P2=00000069 P3=XXXXXXXX P4=XXXXXXXX
EAX=00000069 EBX=80000153 ECX=0004f84c EDX=0093b860
ESI=0004f728 EDI=0093ae10
DS=0053 DSACC=d0f3 DSLIM=bfffffff
ES=0053 ESACC=d0f3 ESLIM=bfffffff
FS=150b FSACC=00f3 FSLIM=00000030
GS=0000 GSACC=**** GSLIM=********
CS:EIP=005b:1ffc1b65 CSACC=d0df CSLIM=bfffffff
SS:ESP=0053:0004f700 SSACC=d0f3 SSLIM=bfffffff
EBP=0004f700 FLG=00010217
DOSCALL1.DLL 0002:00001b65
).
The only reason for invalid arguments there could be a wrong memory address. However, it's impossible to tell why it's so w/o detailed LIBCX logging. And for that, I need to find out which JS test is causing this, exactly (there is some mmap usage in JS indeed and this is served by LIBCX now).
Some info about the PYTHON DOSCALLS crash. As opposed to EXCEPTQ (and to the experimental IBM), a good old icsdebug
from VAC 3.0 can properly trace the stack of this crash:
Note that the address is different compared to the other one but I'm sure the reason is the same. According to the debugger, the only thing which is wrong with the string passed to DosWrite
is that it contains 0x0A as line separators which is not nice when you are writing to file handle 1 connected to the console but it's not a crime. The other nuance is that this string is from high mem but DosWrite
should be fine with that given that it's not a device handle. I don't know if TTY is reported as a "device" but if so, LIBC should put this string to low mem given that it checks for the F_DEV flag before calling DosWrite
.
Note that when redirecting to a file, this crash does not occur bit I get errno 9 (bad file) at some point for reasons also not known so far.
Just for the record, jit_test.py
with the default set of options emerges 29729 (!) JS tests. So may be failures are somehow related to some OOM condition...
Or may be there is some bug in python that causes some data corruption when there are too many processes being spawned and too many threads being created/destroyed(each test leads to of one process object and two thread objects). The crash here happens when the amount of performed tests is around 1300 or so.
Well, the statement about the number of tests is not correct; I modified the python code to dramatically reduce output and sometimes it crashes in DOSCALLS just after 50 tests. I now smell that it may be somehow related to unserialized access to the stdout/stderr TTY/FILE...
Note also that the code in question uses Python's yield
(which is a powerful technique of async processing w/o threads) and that may also be related to strange behavior.
Something really strange is happening with yield
. If I add a simple counter that increases by 1 and prints its value before calling yield
, some numbers get skipped somehow. The only explanation is that print doesn't work for them for some reason. But I don't understand yet how this is possible.
Ok, I got where this mystery is coming from. The new subprocess
impl in Python uses dup()
to implement std I/O redirection in the child process and this is not thread-safe (and the JS test runner uses a lot of threads). Created https://trac.netlabs.org/rpm/ticket/280 for that.
FWIW, the console is a device handle. It's the underlying reason for https://trac.netlabs.org/libc/ticket/294, although the ticket did not make this clear. Until DosWrite was added to os2safe.h, applications that used DosWrite to write to the console would trap if the buffer happened to be in upper memory.
@StevenLevine ok, I guessed that. But anyway, as __write
handles it manually, it doesn't need SafeDosWrite
. And high mem is not the reason in this case. The reason for the crash I believe is concurrent access to the stdin/out/err handles while fiddling with them using DosDupHandle
.
How about testing with DosEnterCritSec()/DosExitCritSec() pair ?
@komh it's not the only way to solve this problem but perhaps it's the most efficient one. Anyway, it's off-topic here, check https://trac.netlabs.org/rpm/ticket/280 for a full story.
Although off-topic, I have one more question. DosEnterCritSec()/DosExitCritSec() pair work ? ^^
@komh yes, using DosEnterCritSec/DosExitCritSec works as expected. Meaning that no output is lost any more due to supbroces.Popen
from multiple threads. No crashes in DosWrite and no errno 9 when output is redirected to a file.
Now I always get an assertion in the LIBCx mmap handler due to error 87 from DosSetMem (see https://github.com/bitwiseworks/mozilla-os2/issues/247#issuecomment-346048314) after a certain number of tests (quite big). The assertion is usually followed by a system trap screen. This doesn't have anything to do with the Python test hardness, however. Something is wrong in JS.
FTR, the view.exe trap is at DOS32UNSETEXCEPTIONHANDLER+21 and the exception handler chain has been corrupted. This is probably a side effect of some earlier error.
@StevenLevine thx, good to know. Note that this time it happened on a freshly rebooted machine with nothing but a File Commander window. Interesting enough that this crash sometimes happens at WPS startup which makes the Desktop never appear (only the background screen is visible) until you reboot.
For the error 87, I recommend you add the address and m->dos_flag values to the assertion output.
Yes, but of course I can't get an assertion to be triggered now due to the heisenbug effect. It only hangs or just traps now. But I know what might get wrong there. I will check the JS code which uses mmap()
and msync()
(the latter is what eventually triggers that DosSetMem).
I can see the address in the .TRP file at least and the length of the segment. The address is 0xBFBFDE000 and the length is is 0x153A (5434) bytes both are pretty valid. Here is the call stack that leads to the assertion:
flush_dirty_pages() in mmap.c#1200
sync_map() in mmap.c#1832
msync() in mmap.c#1891
ShellCloseAsmJSCacheEntryForWrite in js.cpp#5957
I finally made it trigger the assertion, flags before (flags & ~PAG_WRITE)
is 0x2 (so it becomes zero when passed to DosSetMem) which perfectly explains the error and also kinda proves my guess on what might get wrong — mmap and DOS protection mode mapping goes out of sync. I need to check a few things in the code (and also learn how JS uses mmap).
@dmik, assuming you are running VirtualAddressLimit 3072 the address should be fine. The cb parm will be the hard-coded 4096, so that's not it. That leaves the flags. One reason for error 87 is if any of the reserved flag bits are non-zero. DosSetMem is implemented by _VR32SetMem. The error could be coming from one of the support functions, but I'm waiting for the next occurrence before looking any deeper.
I guess we have to wait for the next failure to know more.
JS crashes have been fixed within the above LIBCx ticket. Now tests run quite well though I couldn't finish all the 30k yet — I have a few failing ones and further testing is taking extremely too long for some reason (there may be special slow tests OTOH). Anyway, here are the failed tests so far:
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testResize.js | uncaught exception: out of memory (code 3, args "--baseline-eager --no-fpu")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout-deactivate-reactivate-signals.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout1.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout2.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout3.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout4.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout5.js | Unknown (code -2, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout6.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout7.js | Timeout (code -15, args "")
However, the -nosignals
version of the same tests does not fail. Note that one core is pretty busy with something during these timeout tests. (I run tests in single core mode so far although they support simultaneous multicore execution which I haven't tested yet; that's for later).
I will run it once again over night. Note that it's only jit_test.py
tests so far.
The testResize.js
failure looks similar to the real Firefox crash seen sometimes with OOM crashes in the JS code path in mozalloc
or such (I have several reports). The timeout failures might be related to #248 but that's not clear yet. A deeper understanding of tests is needed.
The testResize.js
one fails because we simply run out of memory. The test barfs on this line https://github.com/bitwiseworks/mozilla-os2/blob/esr45/js/src/jit-test/tests/asm.js/testResize.js#L198. The test case already allocates about 192 MB by the time it wants 80 MB more as a single fragment but it looks that there is no such big fragment any longer. If I change the code to be less greedy, it works fine. Note that from what I see, the code responsible for these allocations is the one using mmap
currently, see #250. I will try to port it to DosAllocMem
to see if it helps.
Note also that only the tests starting with --baseline-eager
and --no-asmjs
JS options fail, the other modes (including the default) work well. These flags seem to influence the memory fragmentation somehow. Also note that this test was completely removed from ESR 52 for whatever reason.
As for testTimeoutX.js
, seems that the timeout watchdog somehow does not kick in. Needs a closer look. Has something to do with signals.
Re testResize.js
, the problem was because Firefox (JS) used low memory for ArrayBuffer
objects so it just ran out of it during the test. The details are here: https://github.com/bitwiseworks/mozilla-os2/issues/250#issuecomment-347690062. Forcing the app high memory for the C heap (which is used for ArrayBuffer
objects) solves the test failure on OS/2.
Re testTimeoutN.js
, for some reason the service interrupt callback requested with JS_RequestInterruptCallback
via a timeout()
call is never called. So far, I can't figure out why. Seems that I will have to perform a deeper learning of some JS internals.
The essence of the timeout test is some heavy JS loop. In case of the first test it's just this code:
function f() {} function g() { while(1) { f() } } return g()
which endlessly spins, as I expect, in the JIT code and which is supposed to be interrupted. Either it's not, or the interrupt doesn't get properly delivered.
Ok, I think know what's wrong. OS/2 uses the Linux code path to break the JIT code executing on the main thread: using a signal sent with pthread_kill
. But this operation is a no-op in our quite basic pthread
implementation. So of course such service interrupts are never delivered to the JS code. This indeed may be the reason for #248 and similar hangs. I will try to implement pthread_kill
on OS/2. If I fail, we will have to use a different approach (via thread context like on Win, though I'm not sure if it'll work on OS/2 either — we'll see).
I see that in kLIBC there is already a mechanism that uses DosKillThread
+ XCPT_ASYNC_PROCESS_TERMINATE to inject signals in other threads (this is what we need to implement pthread_kill
). However, this mechanism does not provide an API to signal a specific thread. It only signals the current thread or all threads. It may be easily added but I promised once that I won't touch kLIBC until I fork it on GitHub (because the amount of patches is already unbearable)...
I found JS_NO_SIGNALS (and its synonym JS_DISABLE_SLOW_SCRIPT_SIGNALS) environment variable that disables this signal machinery completely for asm.js/Ion mode — some interrupt checks are emitted instead in the complied JS code. All timeout tests start working well when this variable is set. This doesn't help much in the #248 case. So there is a chance this interrupt malfunction is unrelated to the hangs. Anyway, it's worth doing as there might be other consequences (and there might be a reason why signals is the default mode upstream).
However, hacking LIBC to make pthread_kill
work is too much. We have an OS/2 system exception handler in the JS code already in-place (disabled due to the change in OOB processing, see above, but easy to reenable it for XCPT_ASYNC_PROCESS_TERMINATE). As for pthread_kill
, I will just put abort()
there so that any app using it would immediately crash (as any other behavior is unexpected anyway).
BTW, we have DosQueryThreadContext
to query registers of the thread. I wonder if there is DosSetThreadContext
(similar to Win's SetThreadContext
) — in this case we'd not even need DosKillThread
. @StevenLevine?
Answered my own question above within #253.
@dmik, TTBOMK, there's nothing like DosSetThreadContext. FWIW, I'm not quite sure where I would find a use for DosQueryThreadContext.
It sorta makes sense that you cannot effectively change registers in an XCPT_ASYNC_PROCESS_TERMINATE handler. The process is going to end, so the changes will be discarded because the thread will never get dispatched again.
With #253 closed, all JS JIT timeout tests in pass now. However, now I get hangs in python again after some many tests (hung at 800 tests now). We also have some terrible hangs when tests are run in parallel (i.e. 4 parallel tests by the number of cores — this is the default Makefile
s mode BTW). Perhaps both cases have a common root.
The latter is most likely related to the unfinished http://trac.netlabs.org/rpm/ticket/280. So I have to switch back to python/libcx.
Hooray, I was finally able to run all 30k tests, see http://trac.netlabs.org/rpm/ticket/280#comment:11. And only 18 tests fail out of 29711. I will give it one more run to grab which tests fail, exactly.
BTW, JIT tests work much better -j4 mode too — previously, it would hang almost immediately, after a few tests. Now it passed several thousand and then PYTHON aborted in LIBC with this:
TEST-PASS | js/src/jit-test/tests/basic/bug510655.js | Success (code 0, Traceback (most recent call last)
File "D:/Coding/mozilla/master/js/src/jit-test/jit_test.py", line 314, in <module>
main(sys.argv[1:])
File "D:/Coding/mozilla/master/js/src/jit-test/jit_test.py", line 302, in main
ok = jittests.run_tests(job_list, prefix, options)
File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 567, in run_tests
ok = process_test_results(gen, num_tests, pb, options)
File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 533, in process_test_results
print_automation_format(ok, res)
File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 416, in print_automation_format
result, res.test.relpath_top, message, res.rc, jitflags))
IOError: [Errno 9] Bad file number
LIBC fatal error - streams: fmutex_request failed
Killed by SIGABRT
pid=0x03ce ppid=0x03cc tid=0x0001 slot=0x0090 pri=0x0200 mc=0x0001 ps=0x0010
C:\USR\BIN\PYTHON.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
LIBC fatal error - streams: fmutex_request failed
Killed by SIGABRT
pid=0x03ce ppid=0x03cc tid=0x0001 slot=0x0090 pri=0x0200 mc=0x0002 ps=0x0017
C:\USR\BIN\PYTHON.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
Which means some pipe gets closed before some last time write attempt. This is known and pops up very often on OS/2 when porting software from Unix world which makes me think that it just behaves differently on OS/2 when it comes to pipes, compared to Unix. Once I came to a conclusion that we should ignore EBADF in such cases, let's see if I come to it now.
In that case, EPIPE occurs not EBADF, isn't it ?
AFAIR EBADF is also possible under some circumstances when you deal with pipes but the relation to pipes in this particular case is just a guess, after all. Closer analysis shows that the error happens when thread 1 of the test framework (which coordinates worker threads starting test processes and processes results from them) fails to do printf
to STDOUT
(it returns EBADF). STDOUT in this particular run was redirected to a file using tee
so pipes were definitely involved there.
What puzzles me is why LIBC aborted. This particular abort results from either stream_lock
or stream_trylock
, both are called when LIBC writes something to a FILE stream (to serialize multi-threaded output). The only reason why requesting a mutex might fail this way is when DosRequestMutex
returns 105 (ERROR_SEM_OWNER_DIED) and that's what I guess happens in this case. This typically means a sudden death of another thread while it was holding a stream mutex (e.g. a crash inside printf
or so). However, since that there is no evidence of what exactly happened, it's still just a guess. I will try logging kLIBC, but I guess it won't survive 1000s of parallel test runs.
The only error that could be there besides error 105 is error 95 (ERROR_INTERRUPT) but that error is mystery even more as the exact reasons for it are unknown.
We need to build tests (and especially JS tests) in order to find a problem with unresponsive JS in FF 45 (see e.g. #176).
Most tests have never been ported to OS/2 so I'm facing build breaks here and there (which are relatively easy to fix so far). This issue is to track this process down.