actions / runner-images

GitHub Actions runner images
MIT License
10.05k stars 3.03k forks source link

Regression using ubuntu linux/amd64 host with linux/386 container #7695

Closed molinav closed 1 year ago

molinav commented 1 year ago

Description

I observed that one of the project workflows I maintain is not able anymore to build 32-bit packages on 64-bit GNU/Linux hosts and the only thing that has changed is the GitHub runner image version:

Passing the --platform option together with the container setup is not an option, because this option and its argument are not passed to the docker pull call during the container preparation and an issue pointing to this problem was closed long ago (https://github.com/actions/runner/issues/648).

Platforms affected

Runner images affected

Image version and build link

Before (working, 20230426.1): https://github.com/matplotlib/basemap/actions/runs/4884953600/jobs/8718596379 Now (failing, 20230517.1): https://github.com/matplotlib/basemap/actions/runs/5218138554/jobs/9418704735

Is it regression?

Yes, because with runner image version 20230426.1 it was working.

Expected behavior

The ubuntu-latest 64-bit runners should be able to run linux/386 containers as before.

Actual behavior

The ubuntu-latest 64-bit runners are failing because they do not identify linux/386 as a valid architecture.

Repro steps

The workflow below reproduces the bug: https://github.com/matplotlib/basemap/blob/v1.3.7/.github/workflows/basemap-for-manylinux.yml

In particular, the following job is enough, it does not even start because the container cannot be created: https://github.com/matplotlib/basemap/blob/v1.3.7/.github/workflows/basemap-for-manylinux.yml#LL78-L125

vpolikarpov-akvelon commented 1 year ago

Hey @molinav. Thank you for reporting. We will investigate it.

vpolikarpov-akvelon commented 1 year ago

Hey @molinav. We updated some underlying infrastructure that may relate to this issue. Could you try running your workflow again?

molinav commented 1 year ago

Hi @vpolikarpov-akvelon. Unfortunately the problem is still triggered (Runner Image Provisioner is now 2.0.226.1), see below: https://github.com/matplotlib/basemap/actions/runs/5256661680/jobs/9498570331

Current runner version: 2.304.0
Operating System
  Ubuntu
  22.04.2
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20230517.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230517.1/images/linux/Ubuntu2204-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230517.1
Runner Image Provisioner
  2.0.226.1
GITHUB_TOKEN Permissions
  Actions: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/download-artifact@v1' (SHA:18f0f591fbc635562c815484d73b6e8e3980482e)
Download action repository 'actions/upload-artifact@v1' (SHA:3446296876d12d4e3a0f3145a3c87e67bf0a16b5)
Complete job name: build-geos (x86)
Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  '1.41'
  Docker daemon API version: '1.41'
  /usr/bin/docker version --format '{{.Client.APIVersion}}'
  '1.41'
  Docker client API version: '1.41'
Clean up resources from previous jobs
  /usr/bin/docker ps --all --quiet --no-trunc --filter "label=ed866e"
  /usr/bin/docker network prune --force --filter "label=ed866e"
Create local container network
  /usr/bin/docker network create --label ed866e github_network_9547124535194f69a2c677db1907e35a
6925b09582f071c74d6c21b1ab7f99ce765195ba00475d3eceacef9aceb785de
Starting job container
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  no matching manifest for linux/amd64 in the manifest list entries
  3.6-debian-4: Pulling from pylegacy/x86-python
  Warning: Docker pull failed with exit code 1, back off 7.413 seconds before retry.
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  3.6-debian-4: Pulling from pylegacy/x86-python
  no matching manifest for linux/amd64 in the manifest list entries
  Warning: Docker pull failed with exit code 1, back off 7.159 seconds before retry.
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  3.6-debian-4: Pulling from pylegacy/x86-python
  no matching manifest for linux/amd64 in the manifest list entries
  Error: Docker pull failed with exit code 1
molinav commented 1 year ago

I also tested on my personal computer (Windows 10 Pro x64, WSL with Debian 11) to ensure that the linux/386 image can actually be run from my 64-bit machine.

On Windows with Docker Desktop + Linux containers:

[vic@onyx] C:\Users\vic> docker run pylegacy/x86-python:3.6-debian-4 sh -c 'echo "Hello, world!"'
Unable to find image 'pylegacy/x86-python:3.6-debian-4' locally
3.6-debian-4: Pulling from pylegacy/x86-python
138bac1fe8c9: Pull complete
8ea2a5bcb8cc: Pull complete
44710418a973: Pull complete
9a878bf3e276: Pull complete
8c2d7412451a: Pull complete
Digest: sha256:1ec7445d6482d32da785550a660a014124e97eceb63d8bbb6edbd663fa5abe28
Status: Downloaded newer image for pylegacy/x86-python:3.6-debian-4
Hello, world!
[vic@onyx] C:\Users\vic>

On WSL with Docker CLI:

vic@onyx:~$ docker run pylegacy/x86-python:3.6-debian-4 sh -c 'echo "Hello, world!"'
Unable to find image 'pylegacy/x86-python:3.6-debian-4' locally
3.6-debian-4: Pulling from pylegacy/x86-python
138bac1fe8c9: Already exists
8ea2a5bcb8cc: Already exists
44710418a973: Already exists
9a878bf3e276: Already exists
8c2d7412451a: Already exists
Digest: sha256:1ec7445d6482d32da785550a660a014124e97eceb63d8bbb6edbd663fa5abe28
Status: Downloaded newer image for pylegacy/x86-python:3.6-debian-4
Hello, world!
vic@onyx:~$
molinav commented 1 year ago

To keep this alive, I have been trying to run the same workflows when I saw that new runner images were available, and the exact same problem persists in all of them (last runner version was 20230619.1.0).

vpolikarpov-akvelon commented 1 year ago

Hey, @molinav. I have carefully investigated the information you provided once more.

I noticed that there were only three successful builds on May 4 and May 5. Two weeks later, on May 18, the docker image pylegacy/x86-python was updated. The workflow failures started on June 9. Since we haven't made any significant changes on our end, I suspect that the problem might be caused by the image update. Unfortunately, I couldn't access the version of the image that was before May 18, but if it was mistakenly built for amd64 instead of 386, it would explain why you had successful builds before. I suggest checking if this was the case.

In any case, the runner pulls the image using a plain docker pull: link to source code. There doesn't seem to be any logic to manually specify the platform, and it appears that there never was. I also couldn't find any Docker daemon options that configure fallback behavior for the container platform.

Regarding your local PC, the reason you can pull the image without explicitly specifying the platform may be due to the Docker version. The Docker version on GitHub-hosted runners is currently 20.10.25+azure-2 while the current latest version is 24.0.2. If you are using Docker Desktop, the behavior may differ even more.

Considering all this information, I don't believe it is related to the runner image update. If there is something I overlooked, please let us know in the comment.

molinav commented 1 year ago

Thanks for the feedback, @vpolikarpov-akvelon!

The Docker image update on May 18 should be related to a rebuild of the same Dockerfile with the latest Python versions built from source (very likely the Python patch versions were different for the still-supported Python versions).

As you indicated, the plain docker pull call has been there for a while, without special platform argument or environment variable. However, it seems to me that the target architecture of my images looks correct, based on the GitHub Actions from before, I explain myself:

I hope that I could clarify a bit better the behaviour that I was seeing at the beginning of May with respect to the behaviour that I see since June. Could it be that the Docker version has changed, and the latest Docker version in the runner images has a different behaviour on what to do in this multi-platform cases?

My tests on Windows and WSL2 were done with Docker Desktop, which is providing Docker 24.0.2 at the moment.

It seems that it is possible to bypass the default Docker platform used when pulling through the DOCKER_DEFAULT_PLATFORM environment variable, so if I could do this in the host that is initialising the container in the GitHub Action, probably my workflows would work again:

export DOCKER_DEFAULT_PLATFORM=linux/386

but I could not figure out how to do this (if it is even possible), because my exported environment variables are only being set after the creation of the job container.

vpolikarpov-akvelon commented 1 year ago

Well, I tried to revert moby-engine upgrade that took place here on VM created from runner image and it helped indeed. Looks like until version 20.10.25 moby-engine ignored arch completely during pull and could download non-matching arch even when platform is specified explicitly.

I didn't find any related changes in moby-engine changelog, but I think it may be caused by the update of dependent package opencontainers/image-spec from v1.0.3 to v1.1.0-rc2. The new version of image-spec introduces additional annotations for arch. So the behavior we are talking about seems to be a new feature, but not a bug.

We can't pin version of moby-engine package, therefore the only way for you to restore functionality you have lost is to request it in runner repo. I think this feature may be re-requested taking into account new information and recent spec updates.

As a workaround I can suggest using images hashes, like this:

  build-geos:
    strategy:
      matrix:
        image:
          - "pylegacy/x64-python:3.6-debian-4@sha256:41f8377e5294575bae233cc2370c6d4168c4fa0b24e83af4245b64b4940d572d"
          - "pylegacy/x86-python:3.6-debian-4@sha256:91bc1c1b2e60948144cc32d5009827f2bf5331c51d43ff4c4ebfe43b0b9e7843"

It's quite dumb, I know, but I don't see any other options for now.

molinav commented 1 year ago

Thanks for your detailed analysis, @vpolikarpov-akvelon. I am currently inspecting other sources of issues, since my naive rebuild of the Docker images (which you pointed yesterday) could also have caused some impact which I was not being aware of.

It seems that since buildkit v0.10.0, buildx is generating multi-platform manifests even if only a single architecture is built. The workflows building my Docker images use a buildkit setup action that fetches the latest buildkit available (no version pinning). So it could be that my Docker images before May were built using an older buildkit (0.9.1?) and generating old manifest formats, and the ones after May are built using this multi-platform manifest format for single-platform images.

I could find similar issues and pull requests from last weeks (https://github.com/docker/buildx/issues/1533, https://github.com/open-policy-agent/opa/pull/6052, https://github.com/freelawproject/courtlistener/pull/2830#issuecomment-1599998581). I am currently rebuilding my Docker images with the --provenance false switch. When they are ready, I will try to repeat my failing Python workflows and see if this is a possible workaround.

molinav commented 1 year ago

@vpolikarpov-akvelon I think I can confirm the source of the issue and it is not related to any runner-images update but, as you said, an (unexpected) change in my Docker image rebuilds caused by BuildKit. In summary:

With the old Docker versions, amd64 hosts can run i386 images if:

  1. The image repository is single-platform (i386-only) and the architecture is compatible with the host architecture. A warning will be raised to the console.
  2. The image repository is multi-platform, the target architecture (i386) is compatible with the host architecture (amd64), and the switch --platform linux/386 is given explicitly. Otherwise, Docker only tries to find amd64 in the multi-platform manifest and raise an error if not found.

The old BuildKit generates single-platform images with buildx when only one platform is passed to the build call. The new BuildKit generates multi-platform images even if only one platform is passed to the build call. So last month I was in situation 1. and everything worked, now I was in situation 2. and everything was failing because the switch --platform linux/386 is not given explicitly. Newer Docker versions (24.0.x) seem to understand that if the amd64 platform is not available in the manifest but i386 platform is, then the i386 image is the one pulled, and when this happens it will not raise any warning as before.

The current workaround that solves my problem is to force --provenance false when using newer BuildKit versions, since this switch will force the generation of single-platform images as with elder BuildKit versions, and then I am back to situation 1.

I rebuilt my Docker images with the --provenance false switch and now the workflows are working again using the latest runner-images pulled by the GitHub Actions, which confirms that the problem was not on the runner-images side: https://github.com/matplotlib/basemap/actions/runs/5415246195

Thanks for your effort and time, @vpolikarpov-akvelon!

vpolikarpov-akvelon commented 1 year ago

@molinav, thank you for the solution and detailed explanation. As problem seems to be resolved now, I'm closing this thread. Feel free to reach out again if you have other problems or questions.