test failures in OpenBSD

kernigh commented 6 years ago

These are the other test failures in OpenBSD, after the recent fixes of #429 (zzz problem) and #443 (vfork problem). I attach the test log from commit f5b9c64 in OpenBSD 6.3 on amd64.

Attachment: testlog.txt

arith, comvario, treemove: bug in OpenBSD

  7/153 arith                                   FAIL     0.57 s
--- stderr ---
<E> arith[592]: printf '%g\n' Nan failed -- expected nan, got -nan
<E> arith[594]: print -- $((Nan)) failed -- expected nan, got -nan
<E> arith[592]: printf '%g\n' -Nan failed -- expected -nan, got nan
<E> arith[594]: print -- $((-Nan)) failed -- expected -nan, got nan
<W> arith[-1]: error_count = 4

 34/153 comvario                                FAIL    10.15 s
--- stderr ---
/tmp/ksh.comvario.uR9mZ3s/comvario.sh: warning: -lm does not support signbit(-Na
N)
<E> comvario[241]: Expected ...my_nan=nan... == ...my_nan=-nan...

137/153 treemove                                FAIL     0.29 s
--- stderr ---
<E> treemove[328]: typeset -m not working when passed a reference to an local argument from a calling function
<E> treemove[67]: typeset -C -a c.ar=((typeset -l -E i=-nan) (typeset -l -E i=inf) (typeset -l -E i=24) (typeset -l -E i=4) (typeset -l -E i=2) (typeset -l -E i=1) (typeset -l -E i=-1) (typeset -l -E i=-inf)) is not idempotent
<W> treemove[-1]: error_count = 2

Not a bug in ksh. This is a bug in OpenBSD/amd64, where libc functions like strtod(3) flip the sign so "nan" becomes -nan and "-nan" becomes nan. I hacked my libc to fix strtod(), but I didn't fix strtold(), so I still see failures. I sent a bug report to OpenBSD: https://marc.info/?l=openbsd-bugs&m=152391654313171&w=2

directoryfd: no /proc, no /dev/fd/11/*

 43/153 directoryfd                             FAIL     2.11 s
--- stderr ---
<E> directoryfd[116]: ...got 'mkdir: /proc/53773/fd/11//test1: No such file or directory'
<E> directoryfd[116]: ...got 'mkdir: /proc/self/fd/11//test1: No such file or directory'
<E> directoryfd[116]: ...got 'mkdir: /dev/fd/11//test1: Not a directory'
<E> directoryfd[116]: ...got 'mkdir: /proc/17630/fd/11//test1: No such file or directory'

OpenBSD has no /proc directory. It does have /dev/fd/11, an fd(4) device node. To open /dev/fd/11 is to duplicate file descriptor 11. Paths like /dev/fd/11/test1 don't work because they don't open /dev/fd/11, so /dev/fd/11 is still a device node, not a directory.

OpenBSD doesn't have these paths, but the test expects them to work. Does ksh emulate these paths? Perhaps mkdir must be a builtin and not /bin/mkdir?

pty: problem with pty or vi mode

 89/153 pty                                     FAIL    21.12 s
--- stderr ---
<E> pty[395]: POSIX sh 251(C): line 19: expected "echo repeat-2", got EOF
<W> pty[-1]: error_count = 1

This uses pty to test vi mode in ksh, set -o vi. I tried entering vi mode. Commands like /, ?, n, N seem to work as well in ksh as in other shells. I don't know if another command in vi mode is broken, or if pty is broken.

sh_match: no ulimit -M or -V

113/153 sh_match                                FAIL     0.45 s
--- stderr ---
<E> sh_match[420]: test_xmlfragment1/0/testfile1.xml: Expected empty stderr, got $'test1_script.sh[2]: ulimit: as: is read only\ntest1_script.sh[3]: ulimit: vmem: is read only'
<E> sh_match[420]: test_xmlfragment1/1/testfile2.xml: Expected empty stderr, got $'test1_script.sh[2]: ulimit: as: is read only\ntest1_script.sh[3]: ulimit: vmem: is read only'
<W> sh_match[-1]: error_count = 2

The "as:" error comes from ulimit -M, and "vmem:" from ulimit -v, in the line:

    # memory safeguards to prevent out-of-control memory consumption
    ulimit -M \$(( 1024 * 1024 ))
    ulimit -v \$(( 1024 * 1024 ))
    ulimit -d \$(( 1024 * 1024 ))

OpenBSD doesn't support ulimit -M and -v. Now ksh knows this, so ulimit -a prints "not supported" for both -M and -v. The test doesn't know to ignore the errors.

In OpenBSD, ulimit -d controls memory: getrlimit(2) says of RLIMIT_DATA,

The maximum size (in bytes) of the data segment for a process; this includes memory allocated via malloc(3) and all other anonymous memory mapped via mmap(2).

In some other systems, RLIMIT_DATA only controls sbrk(2), and different limits control other allocators like mmap(2).

sigchld: unknown problem

118/153 sigchld/shcomp                          FAIL    25.10 s
--- stderr ---
<E> sigchld[202]: 99416 already finished and reaped
<W> sigchld[-1]: error_count = 1

In this run, sigchld/shcomp failed but sigchld succeeded. Some failures that don't happen in every test run; this seems to be one of them.

signal: no sigqueue(2)

119/153 signal                                  FAIL    82.21 s
--- stderr ---
<E> signal[546]: ${c.car[0].value.q} is  0 but should be 4
<E> signal[547]: ${c.car[1].value.q} is  0 but should be 5
<W> signal[-1]: error_count = 2

OpenBSD doesn't have sigqueue(2), so ksh doesn't use it. The test tries to pass 4 and 5 through sigqueue(2); the values get lost. One might modify the test to accept systems where both values are 0.

variables: locale "x" is valid

142/153 variables                               FAIL     7.22 s
--- stderr ---
<E> variables[713]: LC_ALL=x failed -- expected locale diagnostic
<E> variables[715]: LC_ALL=x failed -- expected '', got 'x'
<E> variables[717]: LC_ALL=C; LC_ALL=x failed -- expected 'C', got 'x'
<E> variables[713]: LC_CTYPE=x failed -- expected locale diagnostic
<E> variables[715]: LC_CTYPE=x failed -- expected '', got 'x'
<E> variables[717]: LC_CTYPE=C; LC_CTYPE=x failed -- expected 'C', got 'x'
<E> variables[713]: LC_MESSAGES=x failed -- expected locale diagnostic
<E> variables[715]: LC_MESSAGES=x failed -- expected '', got 'x'
<E> variables[717]: LC_MESSAGES=C; LC_MESSAGES=x failed -- expected 'C', got 'x'
<E> variables[713]: LC_COLLATE=x failed -- expected locale diagnostic
<E> variables[715]: LC_COLLATE=x failed -- expected '', got 'x'
<E> variables[717]: LC_COLLATE=C; LC_COLLATE=x failed -- expected 'C', got 'x'
<E> variables[713]: LC_NUMERIC=x failed -- expected locale diagnostic
<E> variables[715]: LC_NUMERIC=x failed -- expected '', got 'x'
<E> variables[717]: LC_NUMERIC=C; LC_NUMERIC=x failed -- expected 'C', got 'x'
<W> variables[-1]: error_count = 15

All these tests expect that "x" isn't a valid locale name. OpenBSD takes "x" as a valid alias of "C", because setlocale(3) says,

The syntax and semantics of the locale argument are not standardized and vary among operating systems. On OpenBSD, if the locale string ends with “.UTF-8”, the UTF-8 locale is selected; otherwise, the “C” locale is selected, which uses the ASCII character set. If the locale contains a dot but does not end with “.UTF-8”, setlocale() fails.

So "x.x" is a bad locale name in OpenBSD. One might assign bad=x.x and test LC_ALL=$bad. If different systems needed different bad names, then one might check $(uname) and assign bad=x, bad=x.x, or another value.

siteshwar commented 6 years ago

OpenBSD has no /proc directory. It does have /dev/fd/11, an fd(4) device node. To open /dev/fd/11 is to duplicate file descriptor 11. Paths like /dev/fd/11/test1 don't work because they don't open /dev/fd/11, so /dev/fd/11 is still a device node, not a directory.

OpenBSD doesn't have these paths, but the test expects them to work. Does ksh emulate these paths? Perhaps mkdir must be a builtin and not /bin/mkdir?

directoryfd test case and the code to emulate these paths were added after the last stable release. We should either make this test case portable or disable it altogether.

siteshwar commented 6 years ago

Not a bug in ksh. This is a bug in OpenBSD/amd64, where libc functions like strtod(3) flip the sign so "nan" becomes -nan and "-nan" becomes nan. I hacked my libc to fix strtod(), but I didn't fix strtold(), so I still see failures. I sent a bug report to OpenBSD: https://marc.info/?l=openbsd-bugs&m=152391654313171&w=2

strtold function was earlier provided by libast and it was removed with 206f37769aea679564afc3f46eb6c16467d603b8. It's possible that we will see similar other regression on other platforms as we have started using system provided standard functions.

siteshwar commented 6 years ago

So "x.x" is a bad locale name in OpenBSD. One might assign bad=x.x and test LC_ALL=$bad. If different systems needed different bad names, then one might check $(uname) and assign bad=x, bad=x.x, or another value.

I tried x.x on Linux and FreeBSD and it's detected as invalid locale. If it's invalid on macOS too, we can use it.

siteshwar commented 6 years ago

118/153 sigchld/shcomp                          FAIL    25.10 s
--- stderr ---
<E> sigchld[202]: 99416 already finished and reaped
<W> sigchld[-1]: error_count = 1

Has this started failing after we switched to fork()/exec() ? If you were able to build from 2016-01-10-beta branch, does it fail there ?

kernigh commented 6 years ago

I failed to build 2016-01-10-beta branch. I get errors because _ast_intmax_t is long but OpenBSD's intmax_t is long long, so they aren't the same type. Then src/lib/libast/comp/strtoll.c gets error: conflicting types for '_ast_strtoll' because ast_map.h declares it to return _ast_intmax_t but strtoi.h defines it to return intmax_t. Same with strtull.c and _ast_strtoull. The build continues, but libast is missing, so it fails to link and run nmake.

I built commit 6815d6c (the parent of 566ec82, "Use fork()/exec() instead of posix_spawn() by default"), and the first run of meson test sigchld failed with this testlog.txt:

Log of Meson test suite run on 2018-04-19T18:28:11.185679

1/1 sigchld                                 FAIL    24.66 s

--- command ---
SHELL='/home/kernigh/park/ast/build/src/cmd/ksh93/ksh' LANG='' LD_LIBRARY_PATH='/home/kernigh/park/ast/build/src/lib/libast:/home/kernigh/park/ast/build/src/lib/libcmd:/home/kernigh/park/ast/build/src/lib/libcoshell' /home/kernigh/park/ast/build/src/cmd/ksh93/ksh /home/kernigh/park/ast/src/cmd/ksh93/tests/sigchld.sh sigchld
--- stdout ---
childsig
--- stderr ---
    [213]: 70492 already finished and reaped
-------

OK:         0
FAIL:       1
SKIP:       0
TIMEOUT:    0

krader1961 commented 6 years ago

FWIW, @kernigh, even though I don't use OpenBSD on a regular basis I have been building and running the ksh unit tests on a OpenBSD 6.2 VM. As well as macOS and FreeBSD. I'm seeing the same problems you've reported in this issue. Thank you very much for your detailed and concise problem reports which help ensure people are aware of these problems and we don't lose track of them. Obviously the AT&T team was either not aware of, or wasn't particularly concerned about, the problems with the code on modern BSD based systems.

krader1961 commented 6 years ago

Also, the locale specific tests that currently fail because a locale is not supported need a change similar to those I've made recently to the src/cmd/ksh93/tests/wchar.sh unit test to determine whether the system supports a given locale. This supports the usual variants and maximizes test coverage regardless of the system:

locales="en_US.UTF-8 en_US.utf8 en_US.ISO-8859-15 en_US.iso885915 zh_CN.GB18030 zh_CN.gb18030"
supported="C.UTF-8"
locale -a > locale.txt

for lc_all in $locales
do
    if grep -q $lc_all locale.txt
    then
        supported+=" $lc_all"
        log_info "LC_ALL=$lc_all is supported and will be tested"
    else
        log_info "LC_ALL=$lc_all not supported"
    fi
done

kernigh commented 6 years ago

The problem is that OpenBSD supports a locale named "x", but locale -a does not list "x". The test requires a system where "x" is an invalid locale.

krader1961 commented 6 years ago

The problem is that OpenBSD supports a locale named "x",...

That's not really the correct characterization of the situation. Every system I'm familiar with supports a locale named "x". In as much as they all treat it as equivalent to the "C" (or "POSIX" if you prefer) locale assuming you haven't actually created a locale named "x". The difference is that only on OpenBSD does setlocale() return a pointer to "x" rather than a NULL pointer. The intent is that a NULL pointer makes it clear to the caller that the locale is invalid and could not be honored. That OpenBSD returns a pointer to the locale string, implying the locale is supported, is wrong. I'm sorry but it seems to me that OpenBSD is wrong.

I use fish as my interactive shell. I contributed a lot of improvements to it including how it handles manipulating the locale. Those improvements include diagnostic messages involving manipulating the locale. On every system I have access to, including MacOS, calling setlocale(..., "x") it returns a NULL pointer. Except OpenBSD 6.2 which is broken IMHO.

I hate to do it but we should probably just special-case OpenBSD for this test and bypass the test with a warning.

krader1961 commented 6 years ago

Also, @kernigh, I see that my second most previous comment completely missed the point of the issue with respect to setting LC_ALL=x on OpenBSD. So thank you for clarifying the issue and why my comment didn't make much sense. But I still maintain that OpenBSD is FUBAR in this regard :smile:.

krader1961 commented 6 years ago

directoryfd test case and the code to emulate these paths were added after the last stable release. We should either make this test case portable or disable it altogether.

I have no objection to platform specific tests. Rather than disabling or removing them globally we should simply ensure they only run on platforms where they make sense. Having said that given the nature of ksh it does seem odd to test for behaviors that are not universally valid.

kernigh commented 6 years ago

Some of the /dev/fd/* and /proc paths become valid (in OpenBSD) if I use the builtin mkdir:

$ mkdir /dev/fd/3/subdir
mkdir: /dev/fd/3/subdir: Not a directory
$ builtin mkdir
$ mkdir /dev/fd/3/subdir
$ mkdir /proc/self/fd/3/subdir2
$ mkdir /proc/$$/fd/3/subdir3
$ ls -d /tmp/subdir*
/tmp/subdir/    /tmp/subdir2/   /tmp/subdir3/

We don't build the builtins of cat and rmdir; I might need all 3 builtins for createdir1_cmdgroup to pass. Right now, directoryfd.sh comments out createdir1_redirect because of the same 3 builtins, but doesn't comment out createdir1_cmdgroup:

log_info 'TODO: This is a test for mkdir/cat/rmdir builtins. We should reenable it if we choose to make these commands builtin.'
#        (
#            testname='createdir1_redirect'
#            cmd='redirect {dirfd}<"./tmp" ; mkdir [dirfd]/test1 ; print "foo2" >tmp/test1/a ; cat [dirfd]/test1/a ; rm tmp/test1/a ; rmdir [dirfd]/test1'
#            stdoutpattern='foo2'
#        )
        (
            testname='createdir1_cmdgroup'
            cmd='{ mkdir [dirfd]/test1 ; print "foo3" >tmp/test1/a ; cat [dirfd]/test1/a ; rm tmp/test1/a ; rmdir [dirfd]/test1 ; } {dirfd}<"tmp" ;'
            stdoutpattern='foo3'
        )

I don't see the TODO in testlog.txt. The log_info line might not run because it's in the middle of a compound literal, so the shell doesn't run it as a command.

krader1961 commented 6 years ago

Some of the /dev/fd/* and /proc paths become valid (in OpenBSD) if I use the builtin mkdir.

I'm afraid to look at the code to see what it's doing. Having a command as a builtin for speed or predictability of its flags and behavior is fine. Having a builtin that pretends to support a feature the OS doesn't support is questionable at best.

krader1961 commented 6 years ago

Regarding the directoryfd.sh tests that are failing (see this comment by @kernigh): You will find this comment (among others) that were added to src/lib/libast/path/pathcanon.c after the ksh93u+ (ast.master branch) release:

pathdev() handles ast specific /dev/ and /proc/ special files

Note that function pathcanon(), which is at the heart of issue #492, is simply a thin wrapper around pathdev(). At first glance this appears similar to the non-standard AST support for its special C.UTF-8 locale that you won't find supported by any non-AST environment. And like the magic AST locale support that only works for AST commands this should be removed. It is absurd that /dev/fd/something works for a ksh builtin like mkdir but not the system provided mkdir program.

krader1961 commented 6 years ago

@siteshwar wrote:

directoryfd test case and the code to emulate these paths were added after the last stable release. We should either make this test case portable or disable it altogether.

I not only vote that we disable those tests I also vote we revert the code in src/lib/libast/path/pathcanon.c that supports the idiom on systems that don't natively support it. This is another example of trying to impose AST behavior on the world by expecting everyone to use the libast code. That isn't going to happen.

Lots of other shells, compliant with POSIX 1003 (e.g., bash and zsh) and not compliant (e.g., fish and elvish), manage to achieve wide acceptance without resorting to these tricks. I can appreciate that the AST team was trying to innovate. But this libast feature is an example of innovation that is not central to ksh, will probably never be widely adopted, and should be dropped in order to focus on keeping ksh relevant.

kernigh commented 6 years ago

I tested d952fdf (May 7, 2018) in OpenBSD/amd64 6.3 with my patch to fix strtold() in OpenBSD libc. The only failures were in pty/* and sometimes sigchld/*. Each of sigchld/C, sigchld/C.UTF-8, sigchld/shcomp has about a 1/2 chance to fail. I suspect that coprocess/ has a small chance to fail, but I did not see the failure today, and [coprocess/ can also fail in Linux](https://github.com/att/ast/pull/465#issuecomment-379927082).

kernigh commented 6 years ago

After the addition of the API/* tests, a test times out and Meson gets stuck:

  1/272 API/base64.c                            OK       0.01 s                
  2/272 API/taso.c                              OK       0.03 s                
  3/272 API/tproc.c                             OK       0.08 s                
  4/272 API/tthread.c                           OK       0.24 s                
Traceback (most recent call last):
...
  File "/home/kernigh/park/meson/mesonbuild/mtest.py", line 307, in _run_cmd   
    os.killpg(os.getpgid(p.pid), signal.SIGKILL)                               
PermissionError: [Errno 1] Operation not permitted

Not a bug in ksh. My report to Meson is "PermissionError when a test times out in OpenBSD", https://github.com/mesonbuild/meson/issues/3569

krader1961 commented 6 years ago

@kernigh I'm seeing the same thing. So there is obviously a bug in Meson. However, I can't help but think this behavior is a clue to something that is wrong with how ksh behaves on OpenBSD.

krader1961 commented 6 years ago

Vis-a-vis @kernigh's last comment see https://github.com/att/ast/issues/525

kernigh commented 6 years ago

I cloned Meson's Git, made a symlink from meson in PATH to meson.py in the clone, and made this one-line change:

diff --git a/mesonbuild/mtest.py b/mesonbuild/mtest.py
index 110a94e8..b4f9f402 100644
--- a/mesonbuild/mtest.py
+++ b/mesonbuild/mtest.py
@@ -304,7 +304,7 @@ class SingleTestRunner:
                 subprocess.call(['taskkill', '/F', '/T', '/PID', str(p.pid)])
             else:
                 try:
-                    os.killpg(os.getpgid(p.pid), signal.SIGKILL)
+                    os.killpg(p.pid, signal.SIGKILL)
                 except ProcessLookupError:
                     # Sometimes (e.g. with Wine) this happens.
                     # There's nothing we can do (maybe the process

Now Meson kills tests that time out. These tests aren't OK:

  5/272 API/tlock.c                             TIMEOUT 30.03 s                
  6/272 API/tannounce.c                         TIMEOUT 30.01 s                
  7/272 API/tbags.c                             TIMEOUT 30.04 s
  8/272 API/tdeque.c                            TIMEOUT 30.02 s
  9/272 API/tevent.c                            TIMEOUT 30.04 s
 10/272 API/tinstall.c                          TIMEOUT 30.02 s
 11/272 API/tlist.c                             TIMEOUT 30.03 s
 12/272 API/tobag.c                             TIMEOUT 30.01 s
 13/272 API/tqueue.c                            TIMEOUT 30.04 s
 14/272 API/trehash.c                           TIMEOUT 30.01 s
 15/272 API/trhbags.c                           TIMEOUT 30.03 s
 16/272 API/tsafehash.c                         TIMEOUT 30.03 s
 17/272 API/tsafetree.c                         TIMEOUT 30.04 s
 18/272 API/tsearch.c                           TIMEOUT 30.01 s
 19/272 API/tshare.c                            TIMEOUT 30.02 s
 20/272 API/tstack.c                            TIMEOUT 30.01 s
 21/272 API/tstringset.c                        TIMEOUT 30.02 s
 22/272 API/tuser.c                             TIMEOUT 30.02 s
 23/272 API/tvsafehash.c                        TIMEOUT 30.04 s
 24/272 API/tvsaferehash.c                      TIMEOUT 30.04 s
 25/272 API/tvsafetree.c                        TIMEOUT 30.03 s
 26/272 API/tvthread.c                          TIMEOUT 30.03 s
 27/272 API/twalk.c                             TIMEOUT 30.02 s                
 28/272 API/tview.c                             TIMEOUT 30.03 s
115/272 API/tsharemem.c                         FAIL     0.38 s
116/272 API/tsharesimple.c                      FAIL     0.58 s
208/272 pty/C                                   FAIL    21.26 s
209/272 pty/C.UTF-8                             FAIL    21.29 s
210/272 pty/shcomp                              FAIL    21.00 s
235/272 sigchld/C                               FAIL    29.91 s

OK:       242
FAIL:      30
SKIP:       0
TIMEOUT:   24

kernigh commented 6 years ago

Here's the problem with API/tlock.c in OpenBSD:

$ ninja src/cmd/tests/aso/tlock.c
ninja: no work to do.
$ time src/cmd/tests/aso/tlock.c
    2m41.96s real     0m09.48s user     0m12.67s system

The test sleeps too much. The spinlock uses compare-and-swap and usleep(100). (The compare-and-swap uses x86 lock cmpxchg.) The test wants to sleep for 100 microseconds, but OpenBSD sleeps for a minimum of 10000 to 20000 microseconds. This is too slow.

To measure the sleep time, I wrote this test program:

#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv) {
    struct timeval t1, t2, ti;
    int i;
    char c;

    if (argc != 2 || sscanf(argv[1], "%d %c", &i, &c) != 1) {
        fprintf(stderr, "usage: measure integer\n");
        return 1;
    }
    gettimeofday(&t1, NULL);
    usleep(i);
    gettimeofday(&t2, NULL);
    timersub(&t2, &t1, &ti);
    printf("usleep(%d) took %lld s %lld us\n", i, (long long)ti.tv_sec, (long long)ti.tv_usec);
    return 0;
}

$ cc -o measure measure.c
$ ./measure 100 
usleep(100) took 0 s 15777 us

The sleep is always between 10000 and 20000 microseconds.

$ sysctl kern.timecounter kern.clockrate
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=0
kern.timecounter.hardware=tsc
kern.timecounter.choice=i8254(0) acpihpet0(1000) tsc(2000) acpitimer0(1000) dummy(-1000000)
kern.clockrate=tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100

I tried changing the timecounter with commands as root like sysctl kern.timecounter.hardware=i8254, but this did not change the sleep time.

The kernel's clock runs at 100 hz, so each tick has 10000 microseconds. In libc, usleep() calls nanosleep(). In the kernel, nanosleep() converts the sleep time to ticks. So usleep(100) sleeps for 1 tick. The sleep seems to begin at the next tick and end at the second-next tick, so it sleeps for 10000 to 20000 microseconds.

krader1961 commented 6 years ago

Good debugging, @kernigh. I'm actually surprised the test presumes it can sleep for ~100us. Because on older operating systems minimum sleep durations were typically on the order of 1ms, 10ms, or even 100ms. So I would expect this test to fail on a lot of operating systems unless run with a ridiculously long timeout. I strongly suspect that historically the AST team ran these tests with no (or a huge) timeout. If you run the tests with a large multiplier (e.g., meson test -t10) does that mitigate the problem?

kernigh commented 6 years ago

@krader1961 just taught me that meson test has a -t option.

Running meson test -t10 fixes all the time-outs. Each test from 5/272 API/tlock.c down to 28/272 API/tview.c takes 160 to 165 seconds:

  5/272 API/tlock.c                             OK      163.06 s               
  6/272 API/tannounce.c                         OK      162.30 s               
  7/272 API/tbags.c                             OK      162.26 s               
...
 26/272 API/tvthread.c                          OK      164.15 s               
 27/272 API/twalk.c                             OK      163.47 s               
 28/272 API/tview.c                             OK      163.99 s

This run has 7 failures:

115/272 API/tsharemem.c                         FAIL     0.35 s                
116/272 API/tsharesimple.c                      FAIL     0.50 s                
208/272 pty/C                                   FAIL    21.72 s                
209/272 pty/C.UTF-8                             FAIL    21.61 s                
210/272 pty/shcomp                              FAIL    21.46 s                
236/272 sigchld/C.UTF-8                         FAIL    29.19 s                
237/272 sigchld/shcomp                          FAIL    28.30 s                

OK:       265                                                                  
FAIL:       7                                                                  
SKIP:       0                                                                  
TIMEOUT:    0

krader1961 commented 6 years ago

Thanks for that work, @kernigh. I'd really like to see us reduce the run time of those tests to something more reasonable given what they do. But as a stopgap measure I'd be okay with changing the default timeout for the API tests to five minutes. That is, adding a timeout: 600 argument to the test() invocation in src/cmd/tests/aso/meson.build and src/cmd/tests/cdt/meson.build. See src/cmd/ksh93/meson.build where we selectively increase the timeout for some tests. Along with a "TODO" comment to remind us to figure out how to make these tests more efficient.

kernigh commented 6 years ago

I'm still using a patch for OpenBSD libc to fix the arith, comvario, treemove failures. The patch went into OpenBSD-current, and should be in the next release (OpenBSD 6.4), but I am still running OpenBSD 6.3 on amd64. The patch only affects alpha, amd64, arm, i386.

Thanks to recent changes by @krader1961, API tests no longer take an annoyingly long time to run. I saw no API test failures today. I got 7 failures: pty failed as usual, sigchld randomly passed or failed as usual, but arrays is a new failure:

 99/242 arrays/C                                FAIL     1.37 s                 
100/242 arrays/C.UTF-8                          FAIL     1.66 s                 
101/242 arrays/shcomp                           FAIL     1.33 s                 
178/242 pty/C                                   FAIL    21.60 s                 
179/242 pty/C.UTF-8                             FAIL    21.62 s                 
180/242 pty/shcomp                              FAIL    21.58 s                 
205/242 sigchld/C                               FAIL    28.30 s

This is the new failure in arrays. I don't know the cause of failure, but I might want to look at vmalloc removal (#396) first.

--- stdout ---                                                        [420/1994]
<I> arrays[-1]: TEST_DIR=/tmp/ksh.arrays.tCXwnYT                                
--- stderr ---                                                                  
<E> arrays[747]: copying a large array fails                                    
<W> arrays[-1]: error_count = 1

The failure in sigchld has a new message: ksh(67930) in malloc(): recursive call. Now that vmalloc is gone, OpenBSD's malloc writes this error message and dumps core. The recursive call to malloc() is in a signal handler. This is the same kind of bug as #563, but #563 has a different test coprocess in a different signal handler.

--- stdout ---                                                                  
<I> sigchld[-1]: TEST_DIR=/tmp/ksh.sigchld.1ViTC95                              
childsig                                                                        
childsig                                                                        
childsig                                                                        
--- stderr ---                                                                  
<E> sigchld[214]: 70355 already finished and reaped                             
ksh(67930) in malloc(): recursive call                                          
/home/kernigh/park/ast/src/cmd/ksh93/tests/util/run_test.sh: line 89: 67930: Abo
rt(coredump)

This is the backtrace from egdb for the sigchld failure. (OpenBSD 6.3 for amd64 comes with an old version of gdb, but pkg_add gdb installs a newer version as egdb.)

Program terminated with signal SIGABRT, Aborted.                               
#0  thrkill () at -:3                                                          
3       -: No such file or directory.                                          
(gdb) bt                                
#0  thrkill () at -:3                                                          
#1  0x0000144506b1a4ae in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51
#2  0x0000144506b173a9 in wrterror (d=0x1444e0af6a30,                          
    msg=0x144506caf548 "recursive call")                                       
    at /usr/src/lib/libc/stdlib/malloc.c:288                                   
#3  0x0000144506b16bd7 in malloc (size=<optimized out>)                        
    at /usr/src/lib/libc/stdlib/malloc.c:1241                                  
#4  0x000014428fc4968a in set_trapinfo (shp=0x144290118ee0 <sh>, sig=20,       
    info=0x7f7ffffdf4b0) at ../src/cmd/ksh93/sh/fault.c:123                    
#5  0x000014428fc49cc3 in sh_fault (sig=20, info=0x7f7ffffdf4b0,               
    context=0x7f7ffffdf3c0) at ../src/cmd/ksh93/sh/fault.c:219                 
#6  <signal handler called>                                                    
#7  memset () at /usr/src/lib/libc/arch/amd64/string/memset.S:46               
#8  0x0000144506b16fad in omalloc (pool=<optimized out>, sz=<optimized out>,   
    zero_fill=<optimized out>, f=<optimized out>)                              
    at /usr/src/lib/libc/stdlib/malloc.c:1152                                  
#9  0x0000144506b18e3a in calloc (nmemb=1, size=16)                            
    at /usr/src/lib/libc/stdlib/malloc.c:1724                                  
#10 0x000014428fc5e6d8 in init_savelist () at ../src/cmd/ksh93/sh/jobs.c:89    
#11 0x000014428fc62d67 in job_post (shp=0x144290118ee0 <sh>, pid=62247, join=0)
    at ../src/cmd/ksh93/sh/jobs.c:1206  
#12 0x000014428fcc5df1 in _sh_fork (shp=0x144290118ee0 <sh>, parent=62247,     
---Type <return> to continue, or q <return> to quit---                         
    flags=106, jobid=0x7f7ffffe0384) at ../src/cmd/ksh93/sh/xec.c:2732         
#13 0x000014428fcc5a1f in sh_fork (shp=0x144290118ee0 <sh>, flags=106,         
    jobid=0x7f7ffffe0384) at ../src/cmd/ksh93/sh/xec.c:2805                    
#14 0x000014428fcbf7af in sh_exec (shp=0x144290118ee0 <sh>, t=0x144544ddc360,  
    flags=4) at ../src/cmd/ksh93/sh/xec.c:1525                                 
#15 0x000014428fcc25cb in sh_exec (shp=0x144290118ee0 <sh>, t=0x144544ddc040,  
    flags=4) at ../src/cmd/ksh93/sh/xec.c:2150                                 
#16 0x000014428fcc161d in sh_exec (shp=0x144290118ee0 <sh>, t=0x144544ddc040,  
    flags=4) at ../src/cmd/ksh93/sh/xec.c:1957                                 
#17 0x000014428fc7a3a1 in exfile (shp=0x144290118ee0 <sh>, iop=0x144540189d00, 
    fno=11) at ../src/cmd/ksh93/sh/main.c:536                                  
#18 0x000014428fc7b85c in sh_main (ac=3, av=0x7f7ffffe1ea8, userinit=0x0)      
    at ../src/cmd/ksh93/sh/main.c:335                                          
#19 0x000014428fcab596 in main (argc=3, argv=0x7f7ffffe1ea8)                   
    at ../src/cmd/ksh93/sh/pmain.c:41

krader1961 commented 6 years ago

Thanks, @kernigh, for your excellent, well written, observations. I'm working on a change to reintroduce the AST Vmalloc vmbusy() behavior while using the system malloc. That should take care of the sigchld test failure and might fix the arrays test failure.

kernigh commented 6 years ago

I now confirm that commit 489c672 (Replace calls to AST Vmalloc with system malloc) causes the arrays test to fail.

krader1961 commented 6 years ago

The array test is failing due to a SIGSEGV. It's blowing up on this statement in the unit test: v=$(typeset -p hugecpv). So we have another memory management bug in the name/value code.

#3  0x14d6fc98 in handle_sigsegv (signo=11) at ../src/cmd/ksh93/sh/fault.c:107
#4  <signal handler called>
#5  genvalue (argv=0x80bf2d48, prefix=0x821bf138 "hugecpv", n=0, wp=0xcf7f80b8)
    at ../src/cmd/ksh93/sh/nvtree.c:932
#6  0x14dc05a8 in walk_tree (np=0x7c950540, xp=0x0, flags=8192)
    at ../src/cmd/ksh93/sh/nvtree.c:1099
#7  0x14dc5b86 in nv_getvtree (np=0x7c950540, fp=0x7c954ba0)
    at ../src/cmd/ksh93/sh/nvtree.c:1133
#8  0x14db961d in nv_getv (np=0x7c950540, nfp=0x7c954ba0)
    at ../src/cmd/ksh93/sh/nvdisc.c:50
#9  0x14db501a in nv_getval (np=0x7c950540) at ../src/cmd/ksh93/sh/name.c:2388
#10 0x14d363d2 in print_value (iop=0x34d7880c, np=0x7c950540, tp=0xcf7f8620)
    at ../src/cmd/ksh93/bltins/typeset.c:525
#11 0x14d33680 in setall (argv=0x821bf0d4, flag=0, troot=0x7c950e40, tp=0xcf7f8620)
    at ../src/cmd/ksh93/bltins/typeset.c:648
#12 0x14d35edb in b_typeset (argc=3, argv=0x821bf0d0, context=0x34d7c494)
    at ../src/cmd/ksh93/bltins/typeset.c:474
#13 0x14df4cc7 in sh_exec (shp=0x34d7c17c, t=0x821bf070, flags=5)
    at ../src/cmd/ksh93/sh/xec.c:1233
#14 0x14dea4cb in sh_subshell (shp=0x34d7c17c, t=0x821bf070, flags=5, comsub=3)
    at ../src/cmd/ksh93/sh/subshell.c:561
#15 0x14d9c1c0 in comsubst (mp=0x7c94c8c0, t=0x821bf070, type=3)
    at ../src/cmd/ksh93/sh/macro.c:1958
#16 0x14d9d115 in varsub (mp=0x7c94c8c0) at ../src/cmd/ksh93/sh/macro.c:1103
#17 0x14d98a6b in copyto (mp=0x7c94c8c0, endch=0, newquote=0)
    at ../src/cmd/ksh93/sh/macro.c:583
#18 0x14d97c2f in sh_mactrim (shp=0x34d7c17c, str=0x821bf019 "v=$(typeset -p hugecpv)",
    mode=-1) at ../src/cmd/ksh93/sh/macro.c:173
#19 0x14da98c1 in sh_setlist (shp=0x34d7c17c, arg=0x821bf010, flags=131584, typ=0x0)
    at ../src/cmd/ksh93/sh/name.c:297
#20 0x14df3a06 in sh_exec (shp=0x34d7c17c, t=0x821bf034, flags=4)
    at ../src/cmd/ksh93/sh/xec.c:1062
#21 0x14da6765 in exfile (shp=0x34d7c17c, iop=0x34d787a4, fno=0)
    at ../src/cmd/ksh93/sh/main.c:536
#22 0x14da7ebf in sh_main (ac=1, av=0xcf7fb1f4, userinit=0)

krader1961 commented 6 years ago

/facepalm The failure is in this block of code:

if (argv[1]) {
    ssize_t r = (cp - argv[0]) + strlen(cp);
    if (argv[1][r] == '.' && strncmp(argv[0], argv[1], r) == 0) {
        wp->flags &= ~NV_COMVAR;
    }
}

Specifically, the argv[1][r] dereferences an address that is not mapped. It's one past the end of the string pointed to by argv[1] and the last character of that string ends on a page boundary with no adjacent mapped page.

This failure can occur on any distro and theoretically even with AST Vmalloc rather than the system malloc. Whether or not the SIGSEGV occurs is simply a consequence of whether or not the second of two var names passed to this function happens to end on a page boundary with no adjacent mapped page. Something that is unlikely but not impossible.

krader1961 commented 6 years ago

Double /facepalm The logic appears to make an incorrect assumption. It's trying to determine if two consecutive var names have a shared compound var name prefix. But unless there are non-obvious constraints on the caller of genvalue() the manner in which it does this can have false positives AFAICT.

For those interested in more details the problem is seen here:

(gdb) p argv[0]
$22 = 0x7de57fd5 "hugecpv.var5229"
(gdb) p argv[1]
$23 = 0x7de57ff1 "hugecpv.var523"
(gdb) p r
$24 = 15
(gdb) p cp
$25 = 0x7de57fdd "var5229"
(gdb) p argv[1] + 14
$27 = 0x7de57fff ""
(gdb) p argv[1] + 15
$26 = 0x7de58000 <Address 0x7de58000 out of bounds>

The problem occurs because the var name pointed to by argv[0] has length 15. But the var name pointed to by argv[1] is only 14 chars long and the 15'th character is the first byte of the next page which is not mapped into the virtual address space of the process.

krader1961 commented 6 years ago

The sigchld unit test is flakey on OpenBSD and sometimes dies from a SIGSEGV. I captured a core dump. While analysing the core dump I noticed the following block at the top of sigalrm():

static_fn void sigalrm(int sig, siginfo_t *info, void *context) {
...
    if (shp->st.trapcom[SIGALRM] && *shp->st.trapcom[SIGALRM]) {
        shp->siginfo[SIGALRM] = malloc(sizeof(siginfo_t));
        memcpy(shp->siginfo[SIGALRM], context, sizeof(siginfo_t));
    }

It should be copying info not context (which is type ucontext_t). That's not the cause of the SIGSEGV but it is wrong.

krader1961 commented 6 years ago

LOL and crying at the same time. The code block in my previous comment is broken in a more serious way. The shp->siginfo[] is a list of struct Siginfo *, not siginfo_t *. See set_trapinfo(). And this has been masked because struct Shell declared it as void **info and lots of explicit casts are used to hide the fact the wrong types are being used.

A rough guess is 90% of the explicit casts are not needed. Or wouldn't be needed if void * wasn't being used where another type is intended. Or a function had a foo_t * parameter but actually expects it to be a bar_t * which has a foo_t as its first member. And I'm certain some of those casts are hiding bugs.

krader1961 commented 6 years ago

The sigchld test is failing sporadically with errors like

<E> sigchld[214]: 57082 already finished and reaped

This is from the final test which spawns 512 sleep commands in the background. Instrumenting the test shows that PIDs are being recycled quickly. Something allowed by the relevant standards but is unusual for most of the UNIX kernels I've worked with.

Too, that test is pointless in addition to being broken by virtue of relying on an incorrect assumption about how PIDs are assigned. It would be better if it tracked the PID of each sleep put in the background then verified we processed a SIGCHLD for each one. And at the end verified there were no PIDs for which we did not run the SIGCHLD trap.

krader1961 commented 6 years ago

Closing since I've merged all the changes needed to have testing on OpenBSD have failure rates comparable to other platforms. That is, there are still a few bugs such as #576 which manifest occasionally on OpenBSD but they affect all platforms in a random fashion. So I'm closing this since all the systemic failures have been addressed.

att / ast