actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Ubuntu runner/archlinux container; incompatibilities introduced ~February 4, 2021 #2658

Closed backwardsEric closed 3 years ago

backwardsEric commented 3 years ago

Description
Test cases on the Angband project, running using the ubuntu-latest runner and the archlinux container, ran successfully up to February 4, 2021 (last working example is here, https://github.com/angband/angband/pull/4631/checks?check_run_id=1835133558 ). Some time after that, the runs began to fail in actions/checkout@v2 with a message like, "/__e/node12/bin/node: /usr/lib/libc.so.6: version `GLIBC_2.33' not found (required by /usr/lib/libstdc++.so.6)" (the first example seen of that is here, https://github.com/angband/angband/pull/4632/checks?check_run_id=1850279958 ). Performing a system update in the container (pacman -Syu) or updating glibc in the container resolves that, but introduces other problems: running autoconf fails with "This script requires a shell more modern than all the shells that I found on your system.", pacman -Qi glibc fails with the updated glibc, and trying to determine the source of the autoconf problem with this workflow, https://github.com/backwardsEric/angband/blob/test-docker-failures/.github/workflows/broken-shell.yaml , indicates that [ -x $SHELL ] fails in the container after updating glibc (sample output of that workflow for the test with the updated glibc is here https://github.com/backwardsEric/angband/runs/1857512456?check_suite_focus=true ).

A bug posted on the Arch Linux tracker, https://bugs.archlinux.org/task/69563?project=1&order=dateopened&sort=desc , seems to point at the interaction between the host and the container as the source of the latter problems: specifically how the host handles faccessat2 operations used by glibc 2.33.

Area for Triage:
Servers Containers

Question, Bug, or Feature?:
Bug

Virtual environments affected

Image version 20210131.1

Expected behavior Either actions/checkout@v2 works from the archlinux container without updating to glibc 2.33 from 2.32 or, with an update to glibc 2.33 in the container, a file access check like [ -x $SHELL ] succeeds.

Actual behavior actions/checkout@v2 fails with the stock glibc in the container: "/__e/node12/bin/node: /usr/lib/libc.so.6: version `GLIBC_2.33' not found (required by /usr/lib/libstdc++.so.6)". With the updated glibc (2.33-3), actions/checkout@v2 works but the tests run by https://github.com/backwardsEric/angband/blob/test-docker-failures/.github/workflows/broken-shell.yaml report

[] on /bin/bash +f /usr/bin/test on /bin/bash +x+f

after running

echo "[] on $SHELL "[ -x $SHELL ] && echo +x``[ -L $SHELL ] && echo +L``[ -f $SHELL ] && echo +f echo "/usr/bin/test on $SHELL "`/usr/bin/test -x $SHELL && echo +x/usr/bin/test -L $SHELL && echo +L/usr/bin/test -f $SHELL && echo +f

Repro steps
For the failures with cloning the repository:

  1. Running this workflow, https://github.com/backwardsEric/angband/blob/test-docker-failures/.github/workflows/broken-clone-project.yaml , in my fork of the Angband repository gives this failed result, https://github.com/backwardsEric/angband/actions/runs/549200450 , when glibc in the container is not updated.

For the file access issues:

  1. Running this workflow, https://github.com/backwardsEric/angband/blob/test-docker-failures/.github/workflows/broken-shell.yaml , indicates that [ -x $SHELL ] fails with the updated glibc, at least when run within backticks. The result is here, https://github.com/backwardsEric/angband/actions/runs/549013406 .
maxim-lobanov commented 3 years ago

Hello @backwardsEric , I have found

From Set up Job step, you can see that both builds are run on 20210131.1. It means that both (successful and failed) builds were run on identical VM with identical configuration and software set. The issue is not related to Hosted VM. Also you can compare logs of Initialize containers step, as I can see docker image itself was not updated too and can't be root cause of the issue. Also, actions/checkout action was not updated from November 2020 (https://github.com/actions/checkout/releases) The single difference between builds is Install Build Dependencies step (successful build on left, failed on right): image I guess one of these updated dependencies have broken your build.

backwardsEric commented 3 years ago

In agreement with your analysis, updating the container's packages (pacman -Syu) along with installing the other build dependencies does avoid the build breaking in actions/checkout@v2. The build then fails due to autoconf not finding a shell that it likes (example here, https://github.com/angband/angband/pull/4635/checks?check_run_id=1854604597 ). From what I've seen (the sort of tests in the broken_shell.yaml workflow cited in the original post), that's due to problems with file access checks. The Arch Linux bug report for failures in hosted environments for Arch and it's glibc-2.33-3 package ( https://bugs.archlinux.org/task/69563?project=1&order=dateopened&sort=desc ) points the blame at the host environment for the access check problems. The comments in that report mention a change to runc ( https://github.com/opencontainers/runc/pull/2750 ) that would help the host environment handle the updated version of Arch.

aminvakil commented 3 years ago

I maintain this repo (https://github.com/aminvakil/docker-archlinux) and it's running ci every night, starting tonight, it's facing the issue @backwardsEric has mentioned: https://bugs.archlinux.org/task/69563#comment196482.

I think using a version of docker-ce which has faccessat2 support would solve the issue.

nihaals commented 3 years ago

I think this issue is related

nihaals commented 3 years ago

My MWE might make testing easier

yan12125 commented 3 years ago

The comments in that report mention a change to runc ( opencontainers/runc#2750 ) that would help the host environment handle the updated version of Arch.

From a recent build log (https://github.visualstudio.com/08427f54-005b-4b34-b700-dca767ba7c14/_apis/build/builds/98565/logs/7), VMs for GitHub actions appear to use runc 1.0.0~rc92 instead of 1.0.0~rc93 mentioned in the Arch Linux bug report (https://bugs.archlinux.org/task/69563). Maybe that's the missing bit.

vsafonkin commented 3 years ago

https://github.com/actions/virtual-environments/issues/2698#issuecomment-779262068

cyphar commented 3 years ago

For some more context, the issue is that glibc 2.33 uses faccessat2 which is not permitted under the default Docker seccomp profile in older releases (or the host's libseccomp version is outdated, which means that even if the profile allows faccessat2 it will be blocked because libseccomp doesn't know what it is).

The patch to runc (which is in 1.0.0-rc93) fixes this problem by returning -ENOSYS for "syscalls newer than any listed in the profile" which means that faccessat2 gets -ENOSYS on older hosts -- which then glibc 2.33 handles gracefully. The solution is to update the host runc to 1.0.0-rc93. We worked on solving this issue some time ago -- see opencontainers/runc#2750 and the many linked issues.

MiguelNdeCarvalho commented 3 years ago

Hey,

I have added the fix to my Dockerfile, but I still get some problem with pacman-key I think it fails to update the repositories and install the packages, I fixed it by adding IgnorePkg = glibc in the pacman.conf

Thanks, MiguelNdeCarvalho

soloturn commented 3 years ago

@maxim-lobanov there is a bug filed at arch and a glibc fix rolled out since 1 week, feb 13. levente polyak as well as the pacman maintainer allan mcrae mentions "whitelisting syscalls in non-Arch packages/software is not our problem". if it is not arch, it is the host, isn't it? like runc, which as well is fixed since 3 weeks.

maxim-lobanov commented 3 years ago

@soloturn see https://github.com/actions/virtual-environments/issues/2698#issuecomment-779262068 PR in docker repo was merged recently. Issue should be fixed with next docker release.

soloturn commented 3 years ago

thank you @maxim-lobanov ! do you think it would be possible for github to offer arch as host platform additionally to ubuntu to avoid issues going on for weeks in future? as rolling upgrade systems are quick to fix challenges, and notice them early as well?

maxim-lobanov commented 3 years ago

@soloturn , we are tracking requests for adding more platforms to GitHub Actions but currently we don't have plans to add additional platforms due to maintenance concerns.

soloturn commented 3 years ago

@soloturn , we are tracking requests for adding more platforms to GitHub Actions but currently we don't have plans to add additional platforms due to maintenance concerns.

@maxim-lobanov i see i see. would there be a possibility to vote on a ticket to facilitate your thought process?

soloturn commented 3 years ago

@soloturn see #2698 (comment) PR in docker repo was merged recently. Issue should be fixed with next docker release.

when will this be?

fthiery commented 3 years ago

@soloturn see #2698 (comment) PR in docker repo was merged recently. Issue should be fixed with next docker release.

Can you link docker PR please ?

cyphar commented 3 years ago

https://github.com/moby/moby/pull/41994 and the relevant backports (https://github.com/moby/moby/pull/42014 and https://github.com/moby/moby/pull/42015). Please note that they're all merged, we're just waiting for the next release.

fthiery commented 3 years ago

Thanks !

Does it mean that workarounds (such as https://github.com/MiguelNdeCarvalho/docker-baseimage-archlinux/pull/8/files) will not be necessary anymore after docker 20.10.4 will be released ?

I'm confused because i did compile/install runc v1.0.0-rc93 and replaced /usr/bin/runc by a symlink to /usr/local/bin/runc and the problem was still happening (until i include the workaround in my Arch-based Dockerfiles). Note that my host is running Debian 10 with 5.9 backported kernel.

cyphar commented 3 years ago

I'm confused because i did compile/install runc v1.0.0-rc93 and replaced /usr/bin/runc by a symlink to /usr/local/bin/runc and the problem was still happening (until i include the workaround in my Arch-based Dockerfiles). Note that my host is running Debian 10 with 5.9 backported kernel.

Which problem specifically? I don't know about the GLIBC_2.33 but the problem of faccessat2 causing permission errors doesn't happen when I try to reproduce this (I didn't use Debian but instead a different stable distribution that also has an older libseccomp version). Are you sure that Docker was using the new version of runc? At the very least, the faccessat2 issue will be fixed by the upgrade.

eyenx commented 3 years ago

Looks for me as it works now.

nihaals commented 3 years ago

Looks for me as it works now.

No repro.

eyenx commented 3 years ago

Looks for me as it works now.

No repro.

works for me

nihaals commented 3 years ago

Weird, I wonder what's going on here. The fixed Docker version (20.10.4) was released on 2021-02-26 so not sure why I'm still having issues. Our runners version match too

maxim-lobanov commented 3 years ago

Docker is not updated on images yet. New image with updated version will be deployed by the end of this week. As I understand, @eyenx is updating Docker in runtime.

miketimofeev commented 3 years ago

@maxim-lobanov @nihaals I'm afraid Azure docker-moby is not yet updated, the version is Docker-Moby Server 20.10.3+azure so probably it'll take one more week.

nihaals commented 3 years ago

Yeah I just checked the README, I just assumed it was updated early based on the successful workflow run.

For future reference, the update will be shown in the tools list when it changes to >20.10.3.

miketimofeev commented 3 years ago

@nihaals we're probably a bit delayed with docker 20.10.4 due to this bug https://github.com/moby/moby/issues/42093

fthiery commented 3 years ago

FWITW i did deploy docker 20.10.4 on debian, and the archlinux workaround is still necessary for me (https://github.com/MiguelNdeCarvalho/docker-baseimage-archlinux/pull/8/files)

Still getting error: failed to initialize alpm library without it.

NB: i'm not using github actions, but gitlab

nihaals commented 3 years ago

This seems to be fixed now

MiguelNdeCarvalho commented 3 years ago

This seems to be fixed now

Have you tested this in gitbub-actions?

nihaals commented 3 years ago

This seems to be fixed now

Have you tested this in gitbub-actions?

Yes, 3 of my repos, one of which being the MWE (you can check the Actions log), are working as expected now.

MiguelNdeCarvalho commented 3 years ago

Thanks @nihaals, I think @backwardsEric should close the issue as it is solved.

nihaals commented 3 years ago

It's weird that this issue is even fixed though, Docker hasn't been updated, according to the README it's still on 20.10.3+azure.

@miketimofeev have any idea what might have impacted this? It was fixed between 2021-03-04 00:24 UTC and 2021-03-05 00:25 UTC.

aminvakil commented 3 years ago

I can also confirm it got fixed in https://github.com/aminvakil/docker-archlinux.

nihaals commented 3 years ago

It sounded like alpine:edge was also affected by this issue, maybe that should be tested too? It could be something Arch-specific, in which case I wouldn't say this issue is fixed

miketimofeev commented 3 years ago

@nihaals we haven't updated our environments yet, so it's probably something changed from the docker images side.

catthehacker commented 3 years ago

moby-engine and moby-containerd have been updated in Microsoft repo, as of now moby-containerd depends on moby-runc=1.0.0~rc93+azure-1 that has https://github.com/opencontainers/runc/pull/2750 fix included. Reference: #2725 Versions of moby-* packages in GitHub Actions environment: https://github.com/catthehacker/GitHubActions/runs/2052228937?check_suite_focus=true#step:3:13

@miketimofeev ubuntu-* environments are updated in 99% with latest image that includes above updated packages. image

MiguelNdeCarvalho commented 3 years ago

Hey,

I have tested in my server running Debian10 and I still got the problem.

Docker version: Docker version 20.10.5, build 55c4c88

Thanks, MiguelNdeCarvalho

catthehacker commented 3 years ago

@MiguelNdeCarvalho can you provide link to build log?

MiguelNdeCarvalho commented 3 years ago

Hey,

I am trying to build my image: https://github.com/MiguelNdeCarvalho/docker-miguelndecarvalho-repo Here are the logs:

Sending build context to Docker daemon  269.3kB
Step 1/7 : FROM ghcr.io/miguelndecarvalho/docker-baseimage-archlinux:latest
latest: Pulling from miguelndecarvalho/docker-baseimage-archlinux
3a1a2b5435e4: Already exists 
091399994feb: Already exists 
f304b80498d4: Already exists 
fd10e48cfaab: Already exists 
db9840ef1daa: Already exists 
c462ce7f520d: Already exists 
6e96a2176f27: Already exists 
4edf1bb483ba: Already exists 
Digest: sha256:457ae5c4b9b67423c6192621fb4297dc0e3e7148dee724c8684ca1eec6133bcc
Status: Downloaded newer image for ghcr.io/miguelndecarvalho/docker-baseimage-archlinux:latest
 ---> b960d30de19e
Step 2/7 : LABEL maintainer="MiguelNdeCarvalho <geral@miguelndecarvalho.pt>"
 ---> Running in 09dffedef5ec
Removing intermediate container 09dffedef5ec
 ---> e68154fd1a65
Step 3/7 : RUN echo "- install packages needed -" &&     pacman -Syu --noconfirm     base-devel     git         cronie
 ---> Running in 2e8281db37f6
- install packages needed -
error: failed to initialize alpm library
(could not find or read directory: /var/lib/pacman/)
The command '/bin/sh -c echo "- install packages needed -" &&     pacman -Syu --noconfirm     base-devel     git        cronie' returned a non-zero code: 255

Thanks, MiguelNdeCarvalho

catthehacker commented 3 years ago

From the repo it seems that image has been building fine on GitHub Actions for past 2 days. Wherever you are trying to build the image you have to investigate yourself why it fails and if docker/containerd/runc versions included the fix.

MiguelNdeCarvalho commented 3 years ago

Hey,

I checked on my server and I was using runc version 1.0.0-rc92, I saw that had an update for containerd.io and it brought runc version 1.0.0-rc93. Now container is working just fine.

Thanks, MiguelNdecarvalho

fthiery commented 3 years ago

For information, i understood what was happening in my own case; while my debian host was up to date, i was using a docker:stable dind image to build (alpine based), which was still on Docker version 19.03.14, build 5eb3275 (the fixes are apparently not included yet). Upgrading to docker:20.10-dind fixed it for me. Russian dolls .. ;)

backwardsEric commented 3 years ago

The test cases I included in the original report all now work as expected. I'll close this as fixed.