antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.62k stars 346 forks source link

Migrate Arm image building and testing to this repo #6453

Open antoninbas opened 2 weeks ago

antoninbas commented 2 weeks ago

Antrea has had support for the arm64 and arm/v7 platforms for a while now. antrea/antrea-agent-ubuntu and antrea/antrea-controller-ubuntu are multi-platform image manifests.

The way the build is currently structured is as follows:

  1. Whenever the main branch is updated, a Github workflow runs and invokes ./hack/build-antrea-linux-all.sh --pull --push-base-images. The workflow then tags and pushes antrea/antrea-agent-ubuntu-amd64 and antrea/antrea-controller-ubuntu-amd64. At this point, the multi-platform manifests have not been updated.
  2. As a final step, the above workflow triggers a separate workflow hosted in a different repository (vmware-tanzu/antrea-build-infra). This repository supports a handful of self-hosted Arm64 runners. The repository is private as a public repo with non-ephemeral self-hosted runners would not be secure (https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#self-hosted-runner-security). The runners are graciously provided by OSUOSL which supports multiple open-source projects.
  3. The workflow builds the Agent and Controller Docker images for arm64 (antrea/antrea-agent-ubuntu-arm64, antrea/antrea-controller-ubuntu-arm64:latest) and arm/v7 (antrea/antrea-agent-ubuntu-arm, antrea/antrea-controller-ubuntu-arm:latest). Note that an Arm64 machine can build 32-bit Arm artifacts without emulation. At that point, the multi-platform manifest is created and pushed to the registry. This completes the process of updating antrea/antrea-agent-ubuntu:latest and antrea/antrea-controller-ubuntu:latest.
  4. Finally, a new Github workflow (also in vmware-tanzu/antrea-build-infra) is triggered to test the Arm images, on the same set of self-hosted Arm64 runners.

The same process is used for Antrea tagged releases.

The drawbacks of the current approach are:

The alternative considered when Arm support was introduced was to use QEMU emulation to build the multi-platform images. This would only be practical if we built OVS, and potentially the Antrea Go binaries, without emulation, using cross-compilation support from the C and Go compilers. Otherwise, building OVS for Arm using QEMU would take way too much time. This would require making the build (Dockerfiles) more complex and harder to maintain. Even then, the build would be slow, as other things such as installing system packages / dependencies could take a while. As for testing, emulation is just not practical.

Even today, emulation is unlikely to be a good option for us. But recently there has been some interesting developments, with the availability (or upcoming availability) of hosted native Arm64 runners for Github workflows:

  1. the CNCF has its own a program to make Arm64 runners available to CNCF projects: https://actuated.dev/blog/arm-ci-cncf-ampere
  2. Github has just announced that hosted Arm64 runners are in Beta for Enterprise accounts: https://github.blog/changelog/2024-06-03-actions-arm-based-linux-and-windows-runners-are-now-in-public-beta/. IIRC the CNCF uses an Enterprise account.

Using one of these options, we would no longer need to manage self-hosted Arm64 runners. We could also move all of the build infrastructure to this repository, and remove the dependency on vmware-tanzu/antrea-build-infra (at least for building, we may initially want to keep testing the Arm-based Antrea images using the OSUOSL machines, to keep our Github runners usage low).

I am currently asking the CNCF if option 2 (Github-hosted Arm64 runners) is available for CNCF projects. I will update this issue once I find out. Edit: according to CNCF staff, this is already enabled and available to all projects under the CNCF Github Enterprise account, so option 2 is something we could pursue right away. I have not tested it yet.

antoninbas commented 4 days ago

I have been experimenting with the Github-hosted arm runners provided by the CNCF. At the moment, I am running into an issue where I cannot get the arm/v7 version of the Docker images to build on the arm runners, which use the aarch64 architecture. Most aarch64 CPUs which use the Armv8-A architecture are compatible with 32-bit arm/v7 binaries, and we actually leverage this in our current setup which uses self-hosted aarch64 runners. However, with the Github-hosted runners (which also use the Ampere platform), I keep getting the following error when building the antrea-openvswitch image:

2024-06-27T21:41:51.5264487Z #5 [context ubuntu] ubuntu:24.04
2024-06-27T21:41:51.5267799Z #5 sha256:aa9f84d6e529483956a3454f02193e9a0f758a08b5191f2199c928065d307720 8.39MB / 26.82MB 0.2s
2024-06-27T21:41:51.6375370Z #5 sha256:aa9f84d6e529483956a3454f02193e9a0f758a08b5191f2199c928065d307720 26.82MB / 26.82MB 0.3s done
2024-06-27T21:41:51.7904855Z #5 extracting sha256:aa9f84d6e529483956a3454f02193e9a0f758a08b5191f2199c928065d307720
2024-06-27T21:41:52.1948378Z #5 extracting sha256:aa9f84d6e529483956a3454f02193e9a0f758a08b5191f2199c928065d307720 0.6s done
2024-06-27T21:41:52.1949794Z #5 DONE 0.9s
2024-06-27T21:41:52.3411096Z 
2024-06-27T21:41:52.3414037Z #7 [ovs-debs 1/6] RUN echo "xyz"
2024-06-27T21:41:52.3421499Z #7 0.034 xyz
2024-06-27T21:41:52.3421990Z #7 DONE 0.1s
2024-06-27T21:41:52.3422245Z 
2024-06-27T21:41:52.3422648Z #8 [ovs-debs 2/6] RUN apt-get update
2024-06-27T21:41:52.3423379Z #8 0.060 The futex facility returned an unexpected error code.
2024-06-27T21:42:01.5620525Z #8 9.281 Aborted (core dumped)
2024-06-27T21:42:01.5842816Z #8 ERROR: process "/bin/sh -c apt-get update" did not complete successfully: exit code: 134
2024-06-27T21:42:01.5843829Z ------
2024-06-27T21:42:01.5844339Z  > [ovs-debs 2/6] RUN apt-get update:
2024-06-27T21:42:01.5845044Z 0.060 The futex facility returned an unexpected error code.
2024-06-27T21:42:01.5845751Z 9.281 Aborted (core dumped)

The echo command was added by me to the Dockerfile, to show that commands can run successfully. But running apt fails immediately with what I think is a libc error. I tried with both ubuntu:22.04 and ubuntu:24.04 base images.

We can wait a bit and see if the issue gets resolved as software is updated on the runners. We could also try qemu emulation to build the arm/v7 images, and see if it is fast (enough) on aarch64.