bitwiseworks / mozilla-os2

Mozilla for OS/2 and OS/2-based systems
Other
34 stars 9 forks source link

Build tests #247

Open dmik opened 6 years ago

dmik commented 6 years ago

We need to build tests (and especially JS tests) in order to find a problem with unresponsive JS in FF 45 (see e.g. #176).

Most tests have never been ported to OS/2 so I'm facing build breaks here and there (which are relatively easy to fix so far). This issue is to track this process down.

dmik commented 6 years ago

As I closed #176, the target issue is now #248.

dmik commented 6 years ago

Tests are built fine now, lets see what we get with them.

These are some docs about the Mozilla test framework: https://developer.mozilla.org/en-US/docs/Running_automated_tests

dmik commented 6 years ago

Ok, making it compile is obviously not enough. For example, running xpcshell-based tests fails with this:

From _tests: Kept 41186 existing; Added/updated 0; Removed 0 files and 0 directories.
Error running mach:

    ['xpcshell-test']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

NotImplementedError: cannot determine number of cpus

  File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 373, in run_xpcshell_test
    return xpcshell.run_test(**params)
  File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 72, in run_test
    self.run_suite(**kwargs)
  File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 58, in run_suite
    return self._run_xpcshell_harness(**kwargs)
  File "D:/Coding/mozilla/master/testing/xpcshell/mach_commands.py", line 77, in _run_xpcshell_harness
    import runxpcshelltests
  File "D:/Coding/mozilla/master/build/mach_bootstrap.py", line 340, in __call__
    module = self._original_import(name, globals, locals, fromlist, level)
  File "D:/Coding/mozilla/master/testing/xpcshell/runxpcshelltests.py", line 50, in <module>
    NUM_THREADS = int(cpu_count() * 4)
  File "C:/USR/lib/python2.7/multiprocessing/__init__.py", line 136, in cpu_count
    raise NotImplementedError('cannot determine number of cpus')

Some more porting is needed.

dmik commented 6 years ago

Note that getting through all tests is kinda slow. Each small failure requires to start the whole test subdir over and some tests are quite time consuming. And given that tests haven't been run for ages, there are lots of small failures. I will be committing fixes in small batches. So far, I'm working on make check tests (so called "compiled-code tests"). The config tests have been fixed (all PASS now). Working on python tests.

No doubt tests are a good thing per se. I already found a small typo that could potentially result in spontaneous crashes. See f816da11c3 for details.

dmik commented 6 years ago

All python tests pass now (required some Python fixes as well, see its trac). The next stop is JS. Just what we need.

dmik commented 6 years ago

JS actually doesn't seem to provide compiled-code tests, it was some other failure (already fixed).

The next make check stop is GTest on xul.dll:

D:/Coding/mozilla/master-build-test/_virtualenv/bin/python D:/Coding/mozilla/master/testing/gtest/rungtests.py \
        --cwd=../../_tests/gtest \
        --xre-path=../../dist/bin \
        --symbols-path=../../dist/crashreporter-symbols \
        ../../dist/bin/firefox.exe
gtest INFO | Running gtest
Running GTest tests...
Creating 55EF_01.TRP

Killed by SIGSEGV
pid=0x55ef ppid=0x55ea tid=0x0001 slot=0x0098 pri=0x0200 mc=0x0001 ps=0x0010
D:\CODING\MOZILLA\MASTER-BUILD-TEST\DIST\BIN\FIREFOX.EXE
XUL 0:00469425
cs:eip=005b:0f5f9425      ss:esp=0053:0013f930      ebp=20085be0
 ds=0053      es=0053      fs=150b      gs=0000     efl=00010202
eax=20085be0 ebx=0013f98c ecx=0f312b90 edx=00000000 edi=20084480 esi=20084490
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
gtest TEST-UNEXPECTED-FAIL | gtest | test failed with return code -11
make: *** [check] Error 1

It seems a special version of XUL.DLL is linked for the GTest framework (google version of the testing framework). However, I have no idea why yet.

dmik commented 6 years ago

There are several JS tests out there (there is actually a mess of tests and no clear instruction, at least I couldn't find one). Most require mach (see mach help). But some run manually via make check-SOMETHING. I still couldn't figure out what is run by what and why/when. It's so complicated and very poorly documented. I have to learn the sources, as usual.

I've tried mach check-spidermonkey so far which seems to be the major JS test and there is only one failure from what I can see, this is my blind port of testGCAllocator.cpp (https://github.com/bitwiseworks/mozilla-os2/commit/c9025c48d667c1ab88a93ac9297af47ec1c75b8c#diff-ae279a8150eb41aa248bf1109913de39).

Another problem is that for some reason I get hard crashes (that kill LIBCX) and trap screens when trying to direct test output to a file. This might be related to my latest python fixes... (and might be not).

StevenLevine commented 6 years ago

Does

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Running_Automated_JavaScript_Tests

this help in any way?

dmik commented 6 years ago

I've read that of course and I can start these tests manually but it's not how they are run by mach or make check-XXX. Also, mach check-spidermonkey or mach jasapi-tests seems to do more than just run theses.

When started manually both tests seem to succeed (all passed) but I want to make them run as intended by the devs also.

dmik commented 6 years ago

So far, I'm getting various crashes here and there. E.g. now it crashed python in DOSCALLS (Trap -> FFF7:7F9D DOSCALL1 0004:00007F9D between VDHWRTTTY + 57 and StrScan - D5). Something is wrong with console output somewhere.

dmik commented 6 years ago

BTW, reverting Python's subprocess impl back to fork + socketpair makes any tests involving python launchers fail immediately with OSError: [Errno 22] Invalid argument which isn't surprising as socket pairs are not normal files. It also introduces an unkillable zombie (a usual thing when fork is involved). Anyway, redoing Python to use spawn + native pipe seems a right thing. I need to sort out why Python crashes DOSCALLS.

dmik commented 6 years ago

Got another interesting crash after fiddling with test runner output to avoid crashes in DOSCALLS. It's in LIBCX mmap:

LIBCx version : 0.6.0_dev
LIBCx module  : C:\USR\LIB\LIBCX0.DLL (hmod=05ea)
004f7b65 01 ff 0000 Asrt: Assertion Failed!!!
004f7b65 01 ff 0000 Asrt: Function: <NULL>
004f7b65 01 ff 0000 Asrt: File:     D:/Coding/libcx/master/src/mmap/mmap.c
004f7b65 01 ff 0000 Asrt: Line:     1200
004f7b65 01 ff 0000 Asrt: Expr:     arc == NO_ERROR
004f7b65 01 ff 0000 Asrt: 87

DosSetMem suddenly returns 87 which is something regarding invalid arguments (I can't tell exactly because view.exe sometimes crashes PMSHELL in DOSCALLS here as it did now. Just for the record, this is the PMSHELL popuplog entry:

11-21-2017  18:44:00  SYS3175  PID 0028  TID 0001  Slot 0051
C:\OS2\PMSHELL.EXE
c0000005
1ffc1b65
P1=00000001  P2=00000069  P3=XXXXXXXX  P4=XXXXXXXX
EAX=00000069  EBX=80000153  ECX=0004f84c  EDX=0093b860
ESI=0004f728  EDI=0093ae10
DS=0053  DSACC=d0f3  DSLIM=bfffffff
ES=0053  ESACC=d0f3  ESLIM=bfffffff
FS=150b  FSACC=00f3  FSLIM=00000030
GS=0000  GSACC=****  GSLIM=********
CS:EIP=005b:1ffc1b65  CSACC=d0df  CSLIM=bfffffff
SS:ESP=0053:0004f700  SSACC=d0f3  SSLIM=bfffffff
EBP=0004f700  FLG=00010217

DOSCALL1.DLL 0002:00001b65

).

dmik commented 6 years ago

The only reason for invalid arguments there could be a wrong memory address. However, it's impossible to tell why it's so w/o detailed LIBCX logging. And for that, I need to find out which JS test is causing this, exactly (there is some mmap usage in JS indeed and this is served by LIBCX now).

dmik commented 6 years ago

Some info about the PYTHON DOSCALLS crash. As opposed to EXCEPTQ (and to the experimental IBM), a good old icsdebug from VAC 3.0 can properly trace the stack of this crash:

2017-11-21 20 27 24

Note that the address is different compared to the other one but I'm sure the reason is the same. According to the debugger, the only thing which is wrong with the string passed to DosWrite is that it contains 0x0A as line separators which is not nice when you are writing to file handle 1 connected to the console but it's not a crime. The other nuance is that this string is from high mem but DosWrite should be fine with that given that it's not a device handle. I don't know if TTY is reported as a "device" but if so, LIBC should put this string to low mem given that it checks for the F_DEV flag before calling DosWrite.

Note that when redirecting to a file, this crash does not occur bit I get errno 9 (bad file) at some point for reasons also not known so far.

dmik commented 6 years ago

Just for the record, jit_test.py with the default set of options emerges 29729 (!) JS tests. So may be failures are somehow related to some OOM condition...

dmik commented 6 years ago

Or may be there is some bug in python that causes some data corruption when there are too many processes being spawned and too many threads being created/destroyed(each test leads to of one process object and two thread objects). The crash here happens when the amount of performed tests is around 1300 or so.

dmik commented 6 years ago

Well, the statement about the number of tests is not correct; I modified the python code to dramatically reduce output and sometimes it crashes in DOSCALLS just after 50 tests. I now smell that it may be somehow related to unserialized access to the stdout/stderr TTY/FILE...

dmik commented 6 years ago

Note also that the code in question uses Python's yield(which is a powerful technique of async processing w/o threads) and that may also be related to strange behavior.

dmik commented 6 years ago

Something really strange is happening with yield. If I add a simple counter that increases by 1 and prints its value before calling yield, some numbers get skipped somehow. The only explanation is that print doesn't work for them for some reason. But I don't understand yet how this is possible.

dmik commented 6 years ago

Ok, I got where this mystery is coming from. The new subprocess impl in Python uses dup() to implement std I/O redirection in the child process and this is not thread-safe (and the JS test runner uses a lot of threads). Created https://trac.netlabs.org/rpm/ticket/280 for that.

StevenLevine commented 6 years ago

FWIW, the console is a device handle. It's the underlying reason for https://trac.netlabs.org/libc/ticket/294, although the ticket did not make this clear. Until DosWrite was added to os2safe.h, applications that used DosWrite to write to the console would trap if the buffer happened to be in upper memory.

dmik commented 6 years ago

@StevenLevine ok, I guessed that. But anyway, as __write handles it manually, it doesn't need SafeDosWrite. And high mem is not the reason in this case. The reason for the crash I believe is concurrent access to the stdin/out/err handles while fiddling with them using DosDupHandle.

komh commented 6 years ago

How about testing with DosEnterCritSec()/DosExitCritSec() pair ?

dmik commented 6 years ago

@komh it's not the only way to solve this problem but perhaps it's the most efficient one. Anyway, it's off-topic here, check https://trac.netlabs.org/rpm/ticket/280 for a full story.

komh commented 6 years ago

Although off-topic, I have one more question. DosEnterCritSec()/DosExitCritSec() pair work ? ^^

dmik commented 6 years ago

@komh yes, using DosEnterCritSec/DosExitCritSec works as expected. Meaning that no output is lost any more due to supbroces.Popen from multiple threads. No crashes in DosWrite and no errno 9 when output is redirected to a file.

Now I always get an assertion in the LIBCx mmap handler due to error 87 from DosSetMem (see https://github.com/bitwiseworks/mozilla-os2/issues/247#issuecomment-346048314) after a certain number of tests (quite big). The assertion is usually followed by a system trap screen. This doesn't have anything to do with the Python test hardness, however. Something is wrong in JS.

StevenLevine commented 6 years ago

FTR, the view.exe trap is at DOS32UNSETEXCEPTIONHANDLER+21 and the exception handler chain has been corrupted. This is probably a side effect of some earlier error.

dmik commented 6 years ago

@StevenLevine thx, good to know. Note that this time it happened on a freshly rebooted machine with nothing but a File Commander window. Interesting enough that this crash sometimes happens at WPS startup which makes the Desktop never appear (only the background screen is visible) until you reboot.

StevenLevine commented 6 years ago

For the error 87, I recommend you add the address and m->dos_flag values to the assertion output.

dmik commented 6 years ago

Yes, but of course I can't get an assertion to be triggered now due to the heisenbug effect. It only hangs or just traps now. But I know what might get wrong there. I will check the JS code which uses mmap() and msync() (the latter is what eventually triggers that DosSetMem).

I can see the address in the .TRP file at least and the length of the segment. The address is 0xBFBFDE000 and the length is is 0x153A (5434) bytes both are pretty valid. Here is the call stack that leads to the assertion:

flush_dirty_pages() in mmap.c#1200
sync_map() in mmap.c#1832
msync() in mmap.c#1891
ShellCloseAsmJSCacheEntryForWrite in js.cpp#5957
dmik commented 6 years ago

I finally made it trigger the assertion, flags before (flags & ~PAG_WRITE) is 0x2 (so it becomes zero when passed to DosSetMem) which perfectly explains the error and also kinda proves my guess on what might get wrong — mmap and DOS protection mode mapping goes out of sync. I need to check a few things in the code (and also learn how JS uses mmap).

StevenLevine commented 6 years ago

@dmik, assuming you are running VirtualAddressLimit 3072 the address should be fine. The cb parm will be the hard-coded 4096, so that's not it. That leaves the flags. One reason for error 87 is if any of the reserved flag bits are non-zero. DosSetMem is implemented by _VR32SetMem. The error could be coming from one of the support functions, but I'm waiting for the next occurrence before looking any deeper.

I guess we have to wait for the next failure to know more.

dmik commented 6 years ago

JS crashes have been fixed within the above LIBCx ticket. Now tests run quite well though I couldn't finish all the 30k yet — I have a few failing ones and further testing is taking extremely too long for some reason (there may be special slow tests OTOH). Anyway, here are the failed tests so far:

TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testResize.js | uncaught exception: out of memory (code 3, args "--baseline-eager --no-fpu")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout-deactivate-reactivate-signals.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout1.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout2.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout3.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout4.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout5.js | Unknown (code -2, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout6.js | Timeout (code -15, args "")
TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/asm.js/testTimeout7.js | Timeout (code -15, args "")

However, the -nosignals version of the same tests does not fail. Note that one core is pretty busy with something during these timeout tests. (I run tests in single core mode so far although they support simultaneous multicore execution which I haven't tested yet; that's for later).

I will run it once again over night. Note that it's only jit_test.py tests so far.

dmik commented 6 years ago

The testResize.js failure looks similar to the real Firefox crash seen sometimes with OOM crashes in the JS code path in mozalloc or such (I have several reports). The timeout failures might be related to #248 but that's not clear yet. A deeper understanding of tests is needed.

dmik commented 6 years ago

The testResize.js one fails because we simply run out of memory. The test barfs on this line https://github.com/bitwiseworks/mozilla-os2/blob/esr45/js/src/jit-test/tests/asm.js/testResize.js#L198. The test case already allocates about 192 MB by the time it wants 80 MB more as a single fragment but it looks that there is no such big fragment any longer. If I change the code to be less greedy, it works fine. Note that from what I see, the code responsible for these allocations is the one using mmap currently, see #250. I will try to port it to DosAllocMem to see if it helps.

Note also that only the tests starting with --baseline-eager and --no-asmjs JS options fail, the other modes (including the default) work well. These flags seem to influence the memory fragmentation somehow. Also note that this test was completely removed from ESR 52 for whatever reason.

As for testTimeoutX.js, seems that the timeout watchdog somehow does not kick in. Needs a closer look. Has something to do with signals.

dmik commented 6 years ago

Re testResize.js, the problem was because Firefox (JS) used low memory for ArrayBuffer objects so it just ran out of it during the test. The details are here: https://github.com/bitwiseworks/mozilla-os2/issues/250#issuecomment-347690062. Forcing the app high memory for the C heap (which is used for ArrayBuffer objects) solves the test failure on OS/2.

dmik commented 6 years ago

Re testTimeoutN.js, for some reason the service interrupt callback requested with JS_RequestInterruptCallback via a timeout() call is never called. So far, I can't figure out why. Seems that I will have to perform a deeper learning of some JS internals.

dmik commented 6 years ago

The essence of the timeout test is some heavy JS loop. In case of the first test it's just this code:

function f() {} function g() { while(1) { f() } } return g()

which endlessly spins, as I expect, in the JIT code and which is supposed to be interrupted. Either it's not, or the interrupt doesn't get properly delivered.

dmik commented 6 years ago

Ok, I think know what's wrong. OS/2 uses the Linux code path to break the JIT code executing on the main thread: using a signal sent with pthread_kill. But this operation is a no-op in our quite basic pthread implementation. So of course such service interrupts are never delivered to the JS code. This indeed may be the reason for #248 and similar hangs. I will try to implement pthread_kill on OS/2. If I fail, we will have to use a different approach (via thread context like on Win, though I'm not sure if it'll work on OS/2 either — we'll see).

dmik commented 6 years ago

I see that in kLIBC there is already a mechanism that uses DosKillThread + XCPT_ASYNC_PROCESS_TERMINATE to inject signals in other threads (this is what we need to implement pthread_kill). However, this mechanism does not provide an API to signal a specific thread. It only signals the current thread or all threads. It may be easily added but I promised once that I won't touch kLIBC until I fork it on GitHub (because the amount of patches is already unbearable)...

dmik commented 6 years ago

I found JS_NO_SIGNALS (and its synonym JS_DISABLE_SLOW_SCRIPT_SIGNALS) environment variable that disables this signal machinery completely for asm.js/Ion mode — some interrupt checks are emitted instead in the complied JS code. All timeout tests start working well when this variable is set. This doesn't help much in the #248 case. So there is a chance this interrupt malfunction is unrelated to the hangs. Anyway, it's worth doing as there might be other consequences (and there might be a reason why signals is the default mode upstream).

However, hacking LIBC to make pthread_kill work is too much. We have an OS/2 system exception handler in the JS code already in-place (disabled due to the change in OOB processing, see above, but easy to reenable it for XCPT_ASYNC_PROCESS_TERMINATE). As for pthread_kill, I will just put abort() there so that any app using it would immediately crash (as any other behavior is unexpected anyway).

dmik commented 6 years ago

BTW, we have DosQueryThreadContext to query registers of the thread. I wonder if there is DosSetThreadContext (similar to Win's SetThreadContext) — in this case we'd not even need DosKillThread. @StevenLevine?

dmik commented 6 years ago

Answered my own question above within #253.

StevenLevine commented 6 years ago

@dmik, TTBOMK, there's nothing like DosSetThreadContext. FWIW, I'm not quite sure where I would find a use for DosQueryThreadContext.

It sorta makes sense that you cannot effectively change registers in an XCPT_ASYNC_PROCESS_TERMINATE handler. The process is going to end, so the changes will be discarded because the thread will never get dispatched again.

dmik commented 6 years ago

With #253 closed, all JS JIT timeout tests in pass now. However, now I get hangs in python again after some many tests (hung at 800 tests now). We also have some terrible hangs when tests are run in parallel (i.e. 4 parallel tests by the number of cores — this is the default Makefiles mode BTW). Perhaps both cases have a common root.

dmik commented 6 years ago

The latter is most likely related to the unfinished http://trac.netlabs.org/rpm/ticket/280. So I have to switch back to python/libcx.

dmik commented 6 years ago

Hooray, I was finally able to run all 30k tests, see http://trac.netlabs.org/rpm/ticket/280#comment:11. And only 18 tests fail out of 29711. I will give it one more run to grab which tests fail, exactly.

dmik commented 6 years ago

BTW, JIT tests work much better -j4 mode too — previously, it would hang almost immediately, after a few tests. Now it passed several thousand and then PYTHON aborted in LIBC with this:

TEST-PASS | js/src/jit-test/tests/basic/bug510655.js | Success (code 0, Traceback (most recent call last)
  File "D:/Coding/mozilla/master/js/src/jit-test/jit_test.py", line 314, in <module>
    main(sys.argv[1:])
  File "D:/Coding/mozilla/master/js/src/jit-test/jit_test.py", line 302, in main
    ok = jittests.run_tests(job_list, prefix, options)
  File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 567, in run_tests
    ok = process_test_results(gen, num_tests, pb, options)
  File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 533, in process_test_results
    print_automation_format(ok, res)
  File "D:/Coding/mozilla/master/js/src/tests/lib/jittests.py", line 416, in print_automation_format
    result, res.test.relpath_top, message, res.rc, jitflags))
IOError: [Errno 9] Bad file number

LIBC fatal error - streams: fmutex_request failed

Killed by SIGABRT
pid=0x03ce ppid=0x03cc tid=0x0001 slot=0x0090 pri=0x0200 mc=0x0001 ps=0x0010
C:\USR\BIN\PYTHON.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

LIBC fatal error - streams: fmutex_request failed

Killed by SIGABRT
pid=0x03ce ppid=0x03cc tid=0x0001 slot=0x0090 pri=0x0200 mc=0x0002 ps=0x0017
C:\USR\BIN\PYTHON.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

Which means some pipe gets closed before some last time write attempt. This is known and pops up very often on OS/2 when porting software from Unix world which makes me think that it just behaves differently on OS/2 when it comes to pipes, compared to Unix. Once I came to a conclusion that we should ignore EBADF in such cases, let's see if I come to it now.

komh commented 6 years ago

In that case, EPIPE occurs not EBADF, isn't it ?

dmik commented 6 years ago

AFAIR EBADF is also possible under some circumstances when you deal with pipes but the relation to pipes in this particular case is just a guess, after all. Closer analysis shows that the error happens when thread 1 of the test framework (which coordinates worker threads starting test processes and processes results from them) fails to do printf to STDOUT (it returns EBADF). STDOUT in this particular run was redirected to a file using tee so pipes were definitely involved there.

What puzzles me is why LIBC aborted. This particular abort results from either stream_lock or stream_trylock, both are called when LIBC writes something to a FILE stream (to serialize multi-threaded output). The only reason why requesting a mutex might fail this way is when DosRequestMutex returns 105 (ERROR_SEM_OWNER_DIED) and that's what I guess happens in this case. This typically means a sudden death of another thread while it was holding a stream mutex (e.g. a crash inside printf or so). However, since that there is no evidence of what exactly happened, it's still just a guess. I will try logging kLIBC, but I guess it won't survive 1000s of parallel test runs.

The only error that could be there besides error 105 is error 95 (ERROR_INTERRUPT) but that error is mystery even more as the exact reasons for it are unknown.