actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Crashing Clang with Asan+Ubsan builds in 20240310.1.0 Ubuntu 22.04. #9491

Closed igsilya closed 1 month ago

igsilya commented 2 months ago

Description

For some reason we have all the binaries built with clang -fsanitize=address,undefined crashing on the new 20240310.1.0 Ubuntu 22.04 images.

Examples: https://github.com/ovsrobot/ovs/actions/runs/8237626849 https://github.com/igsilya/ovs/actions/runs/8238397886

Tests that happend to get 20240310.1.0 image are all crashing on startup:

#                             -*- compilation -*-
740. json.at:261: testing exponent must contain at least one digit (1) - C ...
../../../tests/json.at:261: printf %s "[1e]" > input
../../../tests/json.at:261: ovstest test-json  input
--- /dev/null   2024-03-11 19:04:04.198924076 +0000
+++ /home/runner/work/ovs/ovs/openvswitch-3.3.90/_build/sub/tests/testsuite.dir/at-groups/740/stderr    2024-03-11 19:13:51.408292432 +0000
@@ -0,0 +1 @@
+/home/runner/work/ovs/ovs/openvswitch-3.3.90/_build/sub/tests/testsuite.dir/at-groups/740/test-source: line 29: 66606 Segmentation fault      (core dumped) ovstest test-json input
stdout:
../../../tests/json.at:261: exit code was 139, expected 1
input:
> [1e]740. json.at:261: 740. exponent must contain at least one digit (1) - C (json.at:261): FAILED (json.at:261)

Above is a crash of the test with a very simple application.

Platforms affected

Runner images affected

Image version and build link

Runner Image Image: ubuntu-22.04 Version: 20240310.1.0

Is it regression?

Yes. 20240304.1.0 works just fine. For example another job from the same run: https://github.com/ovsrobot/ovs/actions/runs/8237626849/job/22527053666

Expected behavior

All the built binaries should not just crash.

Actual behavior

All the built binaries are crashing on start.

Repro steps

Run this workflow: https://github.com/openvswitch/ovs/actions/workflows/build-and-test.yml with openvswitch/ovs repository.

igsilya commented 2 months ago

Even simple ./configure sometimes crashes:

configure:4721: $? = 0
configure:4710: clang -V >&5
clang: error: argument to '-V' is missing (expected 1 value)
clang: error: no input files
configure:4721: $? = 1
configure:4710: clang -qversion >&5
clang: error: unknown argument '-qversion'; did you mean '--version'?
clang: error: no input files
configure:4721: $? = 1
configure:4710: clang -version >&5
clang: error: unknown argument '-version'; did you mean '--version'?
clang: error: no input files
configure:4721: $? = 1
configure:4741: checking whether the C compiler works
configure:4763: clang -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined   conftest.c  >&5
configure:4767: $? = 0
configure:4817: result: yes
configure:4820: checking for C compiler default output file name
configure:4822: result: a.out
configure:4828: checking for suffix of executables
configure:4835: clang -o conftest -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined   conftest.c  >&5
configure:4839: $? = 0
configure:4862: result: 
configure:4884: checking whether we are cross compiling
configure:4892: clang -o conftest -g -O2 -Wno-error=unused-command-line-argument -fno-omit-frame-pointer -fno-common -fsanitize=address,undefined   conftest.c  >&5
configure:4896: $? = 0
configure:4903: ./conftest
./configure: line 4905:  5356 Segmentation fault      (core dumped) ./conftest$ac_cv_exeext
configure:4907: $? = 139
configure:4914: error: in `/home/runner/work/ovs/ovs':
configure:4916: error: cannot run C compiled programs.
igsilya commented 2 months ago

The issue is caused by incompatibility between llvm 14 provided in ubuntu-22.04 image and the much newer kernel configured with high-entropy ASLR.

In 20240304.1.0:

$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8

In 20240310.1.0:

$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 32
vm.mmap_rnd_compat_bits = 16

The isue was fixed in newer versions of llvm: https://github.com/llvm/llvm-project/commit/fb77ca05ffb4f8e666878f2f6718a9fb4d686839 https://reviews.llvm.org/D148280

So, we either need:

  1. an updated version of llvm/clang that will be compatible with a new kernel.
  2. or a kernel config change to reduce the entropy.
  3. or a global sysctl change to set vm.mmap_rnd_bits = 28.

As a workaround the following addition to the workflow makes builds to work with a new image:

    - name: Fix kernel mmap rnd bits
      # Asan in llvm 14 provided in ubuntu 22.04 is incompatible with
      # high-entropy ASLR in much newer kernels that GitHub runners are
      # using leading to random crashes: https://reviews.llvm.org/D148280
      run: sudo sysctl vm.mmap_rnd_bits=28
MaksimZhukov commented 2 months ago

Hello @igsilya! Thank you for reporting this! We will investigate the issue and come back to you as soon as we have any updates

mrc0mmand commented 2 months ago

I can confirm that we (systemd) seem to suffer from the same issue. Some example runs:

Thanks a lot, @igsilya, for finding out the root cause and a potential workaround!

mikhailkoliada commented 2 months ago

Interestingly I always though that the bigger entropy cases a bit less problems than the smaller one, I guess we are going to tune kernel param for the time being, I'll prepare the changes.

njzjz commented 2 months ago

We also experienced random segfaults when using GCC and fsan. Maybe it's also related to this issue?


Update: I see https://stackoverflow.com/questions/77894856/possible-bug-in-gcc-sanitizers

catenacyber commented 2 months ago

Also for https://github.com/OISF/suricata

pmatilai commented 2 months ago

We are seeing this too: https://github.com/rpm-software-management/rpm/actions

All of a sudden, some random Python tests started crashing in our test-suite with DEADLYSIGNAL from ASAN, first sporadically and now all the runs fail due to those. After chasing a good many ghosts, finally spotted the version difference from the logs: all the successful runs are with 20240304.1.0, and all the failures are with 20240310.1.0.

We haven't been able to reproduce this locally on other systems, it only happens on GH actions.

Curiously, in our case it's only Python related tests that misbehave. If I disable ASAN, the tests pass just fine.

Our test-suite runs inside a Fedora docker image, so for that to be affected it kinda has to be something to do with the kernel in the new Ubuntu image. https://stackoverflow.com/questions/77894856/possible-bug-in-gcc-sanitizers#comment137326102_77894856 looks like a possible clue...

(yet another edit): I can also confirm the sysctl workaround as curing it. Million thanks for that @igsilya !

pfusik commented 2 months ago

This is also affecting g++-13 (AddressSanitizer:DEADLYSIGNAL) and Swift 5.10 (Segmentation fault (core dumped)) in https://github.com/fusionlanguage/fut Thanks @igsilya for the workaround!

EduPonz commented 2 months ago

Also happening to us for example here. Do you guys have some timeline for deploying a new image?

mikhailkoliada commented 2 months ago

The fix is in main and will be deployed by the end of the next working week (next Friday)

0xTim commented 2 months ago

@mikhailkoliada is there a way to use the new image or speeding up the deployment? We're using Docker so I don't think the workaround will work and all our builds are currently blocked, so waiting a week to unblock CI is a bit painful (we're also investigating other workarounds for now)

Habbie commented 2 months ago

We're using Docker so I don't think the workaround will work

it will with --privileged: https://github.com/PowerDNS/pdns/pull/13907/files

igsilya commented 2 months ago

@0xTim @Habbie It looks like the fix is going to be the same as a workaround: https://github.com/actions/runner-images/pull/9513 . So, you'll probably need to keep the --privileged even after the image upgrade.

Maybe @mikhailkoliada can confirm if the kernel config will actually be changed or not.

wojsmol commented 2 months ago

if the kernel config will actually be changed or not.

@igsilya If I understand this PR correctly then yes.

igsilya commented 2 months ago

if the kernel config will actually be changed or not.

@igsilya If I understand this PR correctly then yes.

@wojsmol It's not a kernel config, it's just a sysctl change. Kernel config change would be to decrease the actual CONFIG_ARCH_MMAP_RND_BITS.

Habbie commented 2 months ago

It looks like the fix is going to be the same as a workaround:

except it's applied way before our yaml is even parsed, so I suspect it will "just work". We'll see. The workaround is functional right now anyway :)

phil-blain commented 2 months ago

For those interested, I added details about which commits are responsible for the failures and consecutive fixes in the various repositories involved in https://github.com/actions/runner-images/issues/9524#issuecomment-2002065399.

Chilledheart commented 2 months ago

The issue is caused by incompatibility between llvm 14 provided in ubuntu-22.04 image and the much newer kernel configured with high-entropy ASLR.

In 20240304.1.0:

$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8

In 20240310.1.0:

$ sudo sysctl -a | grep vm.mmap.rnd
vm.mmap_rnd_bits = 32
vm.mmap_rnd_compat_bits = 16

The isue was fixed in newer versions of llvm: llvm/llvm-project@fb77ca0 https://reviews.llvm.org/D148280

So, we either need:

  1. an updated version of llvm/clang that will be compatible with a new kernel.
  2. or a kernel config change to reduce the entropy.
  3. or a global sysctl change to set vm.mmap_rnd_bits = 28.

As a workaround the following addition to the workflow makes builds to work with a new image:

    - name: Fix kernel mmap rnd bits
      # Asan in llvm 14 provided in ubuntu 22.04 is incompatible with
      # high-entropy ASLR in much newer kernels that GitHub runners are
      # using leading to random crashes: https://reviews.llvm.org/D148280
      run: sudo sysctl vm.mmap_rnd_bits=28

For option 1, yes, I can verify that asan/ubsan/tsan works with latest llvm but not msan.

Below is my CI bots:

Screenshot 2024-03-18 at 18 56 45

So it is probably not the same case with this issue @Alexey-Ayupov or option 1 doesn't work for my case. You need another upstream fix for msan.

I am using clang binary very close to ToT https://github.com/llvm/llvm-project/commit/a0b3dbaf which certainly contains the fix mentioned in option 1. FYI, you can download the toolchain from https://commondatastorage.googleapis.com/chromium-browser-clang/Linux_x64/clang-llvmorg-19-init-2941-ga0b3dbaf-22.tgz

evverx commented 2 months ago

You need another upstream fix for msan

That would probably be https://github.com/llvm/llvm-project/pull/85142

fzakaria commented 2 months ago

@igsilya can you share how you RCA to be that issue? I'm curious.

Thank you for the identification -- we were hitting this issue too.

mikhailkoliada commented 1 month ago

The fix has been deployed, the kernel parameters are unlikely to change so we are to live with the live sysctl patching for the time being.... Thank you for all your patience!

cmigliorini commented 2 weeks ago

sudo sysctl vm.mmap_rnd_bits=28

We can't do this in a running container unless it's priviledged, right? I have an infinite "AddressSanitizer:DEADLYSIGNAL" loop while running gcc10 in a Focal container on a Jammy host (kernel 6.5), that is indeed solved by applying the above command to the host (vm.mmap_rnd_bits is 32 at boot)