Closed igsilya closed 1 month ago
Even simple ./configure sometimes crashes:
configure:4721: $? = 0
configure:4710: clang -V >&5
clang: error: argument to '-V' is missing (expected 1 value)
clang: error: no input files
configure:4721: $? = 1
configure:4710: clang -qversion >&5
clang: error: unknown argument '-qversion'; did you mean '--version'?
clang: error: no input files
configure:4721: $? = 1
configure:4710: clang -version >&5
clang: error: unknown argument '-version'; did you mean '--version'?
clang: error: no input files
configure:4721: $? = 1
configure:4741: checking whether the C compiler works
configure:4763: clang -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined conftest.c >&5
configure:4767: $? = 0
configure:4817: result: yes
configure:4820: checking for C compiler default output file name
configure:4822: result: a.out
configure:4828: checking for suffix of executables
configure:4835: clang -o conftest -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined conftest.c >&5
configure:4839: $? = 0
configure:4862: result:
configure:4884: checking whether we are cross compiling
configure:4892: clang -o conftest -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined conftest.c >&5
configure:4896: $? = 0
configure:4903: ./conftest
./configure: line 4905: 5356 Segmentation fault (core dumped) ./conftest$ac_cv_exeext
configure:4907: $? = 139
configure:4914: error: in `/home/runner/work/ovs/ovs':
configure:4916: error: cannot run C compiled programs.
The issue is caused by incompatibility between llvm 14 provided in ubuntu-22.04 image and the much newer kernel configured with high-entropy ASLR.
In 20240304.1.0:
$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8
In 20240310.1.0:
$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 32
vm.mmap_rnd_compat_bits = 16
The isue was fixed in newer versions of llvm: https://github.com/llvm/llvm-project/commit/fb77ca05ffb4f8e666878f2f6718a9fb4d686839 https://reviews.llvm.org/D148280
So, we either need:
As a workaround the following addition to the workflow makes builds to work with a new image:
- name: Fix kernel mmap rnd bits
# Asan in llvm 14 provided in ubuntu 22.04 is incompatible with
# high-entropy ASLR in much newer kernels that GitHub runners are
# using leading to random crashes: https://reviews.llvm.org/D148280
run: sudo sysctl vm.mmap_rnd_bits=28
Hello @igsilya! Thank you for reporting this! We will investigate the issue and come back to you as soon as we have any updates
I can confirm that we (systemd) seem to suffer from the same issue. Some example runs:
Thanks a lot, @igsilya, for finding out the root cause and a potential workaround!
Interestingly I always though that the bigger entropy cases a bit less problems than the smaller one, I guess we are going to tune kernel param for the time being, I'll prepare the changes.
We also experienced random segfaults when using GCC and fsan. Maybe it's also related to this issue?
Update: I see https://stackoverflow.com/questions/77894856/possible-bug-in-gcc-sanitizers
Also for https://github.com/OISF/suricata
We are seeing this too: https://github.com/rpm-software-management/rpm/actions
All of a sudden, some random Python tests started crashing in our test-suite with DEADLYSIGNAL from ASAN, first sporadically and now all the runs fail due to those. After chasing a good many ghosts, finally spotted the version difference from the logs: all the successful runs are with 20240304.1.0, and all the failures are with 20240310.1.0.
We haven't been able to reproduce this locally on other systems, it only happens on GH actions.
Curiously, in our case it's only Python related tests that misbehave. If I disable ASAN, the tests pass just fine.
Our test-suite runs inside a Fedora docker image, so for that to be affected it kinda has to be something to do with the kernel in the new Ubuntu image. https://stackoverflow.com/questions/77894856/possible-bug-in-gcc-sanitizers#comment137326102_77894856 looks like a possible clue...
(yet another edit): I can also confirm the sysctl workaround as curing it. Million thanks for that @igsilya !
This is also affecting g++-13 (AddressSanitizer:DEADLYSIGNAL
) and Swift 5.10 (Segmentation fault (core dumped)
) in https://github.com/fusionlanguage/fut
Thanks @igsilya for the workaround!
Also happening to us for example here. Do you guys have some timeline for deploying a new image?
The fix is in main and will be deployed by the end of the next working week (next Friday)
@mikhailkoliada is there a way to use the new image or speeding up the deployment? We're using Docker so I don't think the workaround will work and all our builds are currently blocked, so waiting a week to unblock CI is a bit painful (we're also investigating other workarounds for now)
We're using Docker so I don't think the workaround will work
it will with --privileged
: https://github.com/PowerDNS/pdns/pull/13907/files
@0xTim @Habbie It looks like the fix is going to be the same as a workaround: https://github.com/actions/runner-images/pull/9513 . So, you'll probably need to keep the --privileged
even after the image upgrade.
Maybe @mikhailkoliada can confirm if the kernel config will actually be changed or not.
if the kernel config will actually be changed or not.
@igsilya If I understand this PR correctly then yes.
if the kernel config will actually be changed or not.
@igsilya If I understand this PR correctly then yes.
@wojsmol It's not a kernel config, it's just a sysctl change. Kernel config change would be to decrease the actual CONFIG_ARCH_MMAP_RND_BITS
.
It looks like the fix is going to be the same as a workaround:
except it's applied way before our yaml is even parsed, so I suspect it will "just work". We'll see. The workaround is functional right now anyway :)
For those interested, I added details about which commits are responsible for the failures and consecutive fixes in the various repositories involved in https://github.com/actions/runner-images/issues/9524#issuecomment-2002065399.
The issue is caused by incompatibility between llvm 14 provided in ubuntu-22.04 image and the much newer kernel configured with high-entropy ASLR.
In 20240304.1.0:
$ sudo sysctl -a | grep vm.mmap.rnd vm.mmap_rnd_bits = 28 vm.mmap_rnd_compat_bits = 8
In 20240310.1.0:
$ sudo sysctl -a | grep vm.mmap.rnd vm.mmap_rnd_bits = 32 vm.mmap_rnd_compat_bits = 16
The isue was fixed in newer versions of llvm: llvm/llvm-project@fb77ca0 https://reviews.llvm.org/D148280
So, we either need:
- an updated version of llvm/clang that will be compatible with a new kernel.
- or a kernel config change to reduce the entropy.
- or a global sysctl change to set vm.mmap_rnd_bits = 28.
As a workaround the following addition to the workflow makes builds to work with a new image:
- name: Fix kernel mmap rnd bits # Asan in llvm 14 provided in ubuntu 22.04 is incompatible with # high-entropy ASLR in much newer kernels that GitHub runners are # using leading to random crashes: https://reviews.llvm.org/D148280 run: sudo sysctl vm.mmap_rnd_bits=28
For option 1, yes, I can verify that asan/ubsan/tsan works with latest llvm but not msan.
Below is my CI bots:
So it is probably not the same case with this issue @Alexey-Ayupov or option 1 doesn't work for my case. You need another upstream fix for msan.
I am using clang binary very close to ToT https://github.com/llvm/llvm-project/commit/a0b3dbaf which certainly contains the fix mentioned in option 1. FYI, you can download the toolchain from https://commondatastorage.googleapis.com/chromium-browser-clang/Linux_x64/clang-llvmorg-19-init-2941-ga0b3dbaf-22.tgz
You need another upstream fix for msan
That would probably be https://github.com/llvm/llvm-project/pull/85142
@igsilya can you share how you RCA to be that issue? I'm curious.
Thank you for the identification -- we were hitting this issue too.
The fix has been deployed, the kernel parameters are unlikely to change so we are to live with the live sysctl patching for the time being.... Thank you for all your patience!
sudo sysctl vm.mmap_rnd_bits=28
We can't do this in a running container unless it's priviledged, right? I have an infinite "AddressSanitizer:DEADLYSIGNAL" loop while running gcc10 in a Focal container on a Jammy host (kernel 6.5), that is indeed solved by applying the above command to the host (vm.mmap_rnd_bits
is 32 at boot)
Description
For some reason we have all the binaries built with
clang -fsanitize=address,undefined
crashing on the new 20240310.1.0 Ubuntu 22.04 images.Examples: https://github.com/ovsrobot/ovs/actions/runs/8237626849 https://github.com/igsilya/ovs/actions/runs/8238397886
Tests that happend to get 20240310.1.0 image are all crashing on startup:
Above is a crash of the test with a very simple application.
Platforms affected
Runner images affected
Image version and build link
Runner Image Image: ubuntu-22.04 Version: 20240310.1.0
Is it regression?
Yes. 20240304.1.0 works just fine. For example another job from the same run: https://github.com/ovsrobot/ovs/actions/runs/8237626849/job/22527053666
Expected behavior
All the built binaries should not just crash.
Actual behavior
All the built binaries are crashing on start.
Repro steps
Run this workflow: https://github.com/openvswitch/ovs/actions/workflows/build-and-test.yml with openvswitch/ovs repository.