gavinhoward / bc

An implementation of the POSIX bc calculator with GNU extensions and dc, moved away from GitHub. Finished, but well-maintained.
https://git.gavinhoward.com/gavin/bc
Other
145 stars 29 forks source link

Test failure on aarch64 Alpine Linux #36

Closed maxice8 closed 2 years ago

maxice8 commented 2 years ago

All bc tests passed.

***********************************************************************
Killed

dc crashed (137) on test:

    tests/dc/errors/33.txt
make: *** [Makefile:337: test_dc_errors] Error 137

The referred file tests/dc/errors/33.txt contains a binary file

gavinhoward commented 2 years ago

This is not good.

Can you tell me what version of bc you are testing, along with the version of Alpine and musl?

gavinhoward commented 2 years ago

Oh, I forgot. Could you also tell me the compiler, its version, and the commands you are using to build bc?

maxice8 commented 2 years ago

Version: 5.0.1 Alpine: Edge musl: 1.2.2-r5

gcc:

gcc (Alpine 10.3.1_git20210625) 10.3.1 20210625
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

commandds:

build() {
    PREFIX=/usr DESTDIR="$pkgdir" EXECSUFFIX=-howard ./configure.sh -GN
    make
}

check() {
    make test
}

package() {
    make install
}

Full log of the build and failure here: https://build.alpinelinux.org/buildlogs/build-edge-aarch64/testing/howard-bc/howard-bc-5.0.1-r0.log

gavinhoward commented 2 years ago

Thank you. I'm trying to get a cross-compilation toolchain working and will debug the issue.

gavinhoward commented 2 years ago

I have bad news: I can't reproduce.

I can't reproduce on x86_64 with gcc and glibc. I can't reproduce on the same platform with the same musl. I finally got a cross-compilation toolchain working, and I can't reproduce it under QEMU.

I think that in order to reproduce this, I need to buy an aarch64 machine. This may take me a while.

Is there an imminent release for Alpine coming up?

maxice8 commented 2 years ago

Looking at the Alpine CI for aarch64 it passed completely, seems like only the builders have this failure.

There is no release imminent but it is always nice to keep stuff working and available

@ikke is there something different on the builders that could cause the crash ? maybe a blacklisted syscall ?

Ikke commented 2 years ago

I can reproduce it in my aarch64 lxc container. It reaches this point:

Running dc error file tests/dc/errors/33.txt...pass
Running dc error file tests/dc/errors/33.txt through cat...

and then starts to use an unbounded amount of memory, and eventually gets killed by the oom-killer:

oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=lxc.payload.build-edge-aarch64,mems_allowed=1,global_oom,task_memcg=/lxc.payload.ikke-edge-aarch64,task=dc,pid=49558,uid=1000
Out of memory: Killed process 49558 (dc) total-vm:819831256kB, anon-rss:162755232kB, file-rss:556kB, shmem-rss:0kB, UID:1000 pgtables:318760kB oom_score_adj:0
oom_reaper: reaped process 49558 (dc), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
gavinhoward commented 2 years ago

Hmm...I had this problem with FreeBSD too. What happens is that this test is meant to make sure that bc does not crash when it can't allocate memory. Obviously, this test will fail when the OS lies too much about the memory it can give bc.

I haven't had this problem on glibc, though, on a machine with 32GB of memory, so I don't know why it decided to have problems here.

I have a solution, though: I can accept a SIGKILL as a passing result. I don't think that doing this will be a problem because SIGKILL always has to come from outside, not from some problem in bc itself (besides allocating too much memory, but that's a problem with the OS if it lies too much).

I will have a release out within the day with that change, if that is an acceptable solution to you both.

Ikke commented 2 years ago

What is then most likely the issue is that even though the host has lots of memory, the containers are limited to the memory of just a single NUMA domain, so half the memory that is available on the host.

gavinhoward commented 2 years ago

I have changed the test to not cause OOM conditions. I'm going to run my release regimen and release 5.0.2 for you.

gavinhoward commented 2 years ago

5.0.2 is out. I hope this one works for you all!

Please reopen if it does not.

Ikke commented 2 years ago

Thanks, can confirm that 5.0.2 is no longer failing.