actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Build breaks with `Version: 20220717.1` #5934

Closed Praveenrajmani closed 1 year ago

Praveenrajmani commented 1 year ago

Description

The test builds on github runners started failing for the past 24 hours, when investigated, looks like the virtual environment version was upgraded.

Working virtual environment version: 20220710.1 Working virtual environment provisioner version: 1.0.0.0-main-20220616-1

Non-working virtual environment version: 20220717.1 Non-working virtual environment provisioner version: 1.0.0.0-main-20220701-2

Is there a way to pick a selective virtual environment version for the github action runners?

Platforms affected

Virtual environments affected

Image version and build link

Working build: https://github.com/minio/directpv/runs/7389713364?check_suite_focus=true Non-working build: https://github.com/minio/directpv/runs/7408016471?check_suite_focus=true

Is it regression?

yes

Expected behavior

the builds were succeeding 24hrs before, expecting it to succeed with the latest virtual environment version

Actual behavior

caused a regression with the latest release

Repro steps

Simply rerunning any previously succeeded builds would fail

Also raised a dummy PR without any changes to test the regression.

al-cheb commented 1 year ago

Hey @Praveenrajmani. I see the pipelines, you provided, are not the same.

Working - https://github.com/minio/directpv/actions/runs/2690641775/workflow

      - name: Setup Minikube
        uses: manusa/actions-setup-minikube@v2.4.3
        with:
          minikube version: 'v1.24.0'
          kubernetes version: 'v1.22.5'
          github token: ${{ secrets.GITHUB_TOKEN }}

Non-working - https://github.com/minio/directpv/runs/7408016471?check_suite_focus=true

      - name: Setup Minikube
        uses: manusa/actions-setup-minikube@v2.4.3
        with:
          minikube version: 'v1.24.0'
          kubernetes version: 'v1.20.14'
          github token: ${{ secrets.GITHUB_TOKEN }}
Praveenrajmani commented 1 year ago

Hi @al-cheb,

Let me share the correct ones

working: https://github.com/minio/directpv/runs/7389713364?check_suite_focus=true non-working: https://github.com/minio/directpv/runs/7409833561?check_suite_focus=true

Praveenrajmani commented 1 year ago

@al-cheb One question - Is there a way to pick a selective virtual environment version for the github action runners?

al-cheb commented 1 year ago

@al-cheb One question - Is there a way to pick a selective virtual environment version for the github action runners?

Unfortunately, it's not possible to pick up a version.

sylus commented 1 year ago

I believe I have the same problem as of Version: 20220717.1 the docker overlay doesn't seem to be working correctly?

Build Success: https://github.com/drupalwxt/wxt/actions/runs/2676949961 Build Fail: https://github.com/drupalwxt/wxt/actions/runs/2699844768

sylus commented 1 year ago

Very odd though as I switched to 22.04 it worked.

Environment: ubuntu-22.04 Version: 20220717.1

Wonder if it works for you @Praveenrajmani using 22.04?

varunsh-coder commented 1 year ago

Ubuntu 20.04.4 with image version 20220717.1 is also causing issue for https://github.com/step-security/harden-runner

I noticed that in the latest update, the linux kernel version for Ubuntu 20.04.4 is updated to 5.15.0-1014-azure. In previous updates, it was 5.13.0-XXXXX. PR: https://github.com/actions/virtual-environments/commit/ce779a6f6ea52b63a32ce2200a319be149c9203f

I was looking at the mapping of ubuntu versions and linux kernel versions here, and see 5.13 next to Ubuntu 20.04.4.

I am not an expert at these mappings, but please check if this change to 5.15 kernel is expected for Ubuntu 20.04.4

Praveenrajmani commented 1 year ago

It works when i downgrade the ubuntu version to 18.04 @sylus

al-cheb commented 1 year ago

@Praveenrajmani , I was able to reproduce this issue on a clean Azure vm, in that case the issue is not related to GitHub Runner image. Feel free to ping me if you need any logs from this vm.

image

varunagrawal commented 1 year ago

Maybe related, we're seeing failures for GTSAM on 18.04 as well on PRs that didn't have issues before. From looking at the error message, it's an OOM issue for the compiler, so maybe something is up with how some devices are configured?

narasamdya commented 1 year ago

Our CI builds started to fail when we use 20220717.1 image; some of our Linux unit tests started to fail.

Successful build: Pipelines - Run 20220715.12 (azure.com) - on 20220710.1 image

Failed build: Pipelines - Run 20220719.2 (azure.com) - on 20220717.1 image

The standard error show messages like

/usr/bin/ld: cannot open output file main: Function not implemented
collect2: error: ld returned 1 exit status

/bin/cp: cannot create regular file '/tmp/czbp230e.x0p/file-to-cp.txt.copy': Function not implemented
al-cheb commented 1 year ago

@narasamdya, @sylus, Could you provide minimal repro steps how to reproduce the issue?

varunsh-coder commented 1 year ago

Ubuntu 20.04.4 with image version 20220717.1 is also causing issue for https://github.com/step-security/harden-runner

I noticed that in the latest update, the linux kernel version for Ubuntu 20.04.4 is updated to 5.15.0-1014-azure. In previous updates, it was 5.13.0-XXXXX. PR: ce779a6

I was looking at the mapping of ubuntu versions and linux kernel versions here, and see 5.13 next to Ubuntu 20.04.4.

I am not an expert at these mappings, but please check if this change to 5.15 kernel is expected for Ubuntu 20.04.4

Hi @al-cheb I wanted to check if you got a chance to review the mapping of ubuntu versions and linux kernel versions here? Is it expected to use 5.15 kernel for Ubuntu 20.04.4? I only see 5.13 next to Ubuntu 20.04.4 in the Ubuntu kernel support lifecycle picture...Thanks!

al-cheb commented 1 year ago

Hi @al-cheb I wanted to check if you got a chance to review the mapping of ubuntu versions and linux kernel versions here? Is it expected to use 5.15 kernel for Ubuntu 20.04.4? I only see 5.13 next to Ubuntu 20.04.4 in the Ubuntu kernel support lifecycle picture...Thanks!

Looks like the documentation is stale because we use a vm template for Ubuntu Server 20.04 created by Canonical:

~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

~$ uname -a
Linux u2001 5.15.0-1014-azure #17~20.04.1-Ubuntu SMP Thu Jun 23 20:01:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

~$ date
Wed Jul 20 19:56:42 UTC 2022

PS > (Get-AzVM -Name u2004 -Status).StorageProfile.ImageReference

Publisher               : canonical
Offer                   : 0001-com-ubuntu-server-focal
Sku                     : 20_04-lts
Version                 : latest
ExactVersion            : 20.04.202207130
SharedGalleryImageId    : 
CommunityGalleryImageId : 
Id                      :
Praveenrajmani commented 1 year ago

I was able to reproduce this issue on a clean Azure vm

What was the virtual environment version here? @al-cheb

al-cheb commented 1 year ago

I was able to reproduce this issue on a clean Azure vm

What was the virtual environment version here? @al-cheb

@Praveenrajmani, As I mentioned before I took a clean vm and pre-installed only moby-engine on it. Maybe it's a bug in the new Linux kernel version: 5.15.0-1014-azure kernel.

varunsh-coder commented 1 year ago

I did more digging for my scenario and found the root cause for it.

I have a GitHub Action (https://github.com/step-security/harden-runner) that uses Linux Audit Framework to audit events on the Ubuntu VM.

In the latest release of the Ubuntu 20.04.4 VM, there is already a process that is listening to the audit events. The process is /opt/microsoft/auoms/bin/auomscollect.

In the previous release of Ubuntu 20.04.4 VM and in the latest release of Ubuntu 22.04.4 VM, this is not the case.

As a result, https://github.com/step-security/harden-runner is not able to listen to audit events, as only one process can modify the audit rules.

Searching for auomscollect, I found this project - https://github.com/microsoft/OMS-Auditd-Plugin. Do you know if having this enabled in the Runner VM is by design? If not, can this be turned off?

al-cheb commented 1 year ago

@varunagrawal, The OMS Audit data collection daemon is active by default on Azure runners.

varunsh-coder commented 1 year ago

@varunagrawal, The OMS Audit data collection daemon is active by default on Azure runners.

Hi @al-cheb auomscollect was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.

al-cheb commented 1 year ago

@varunagrawal, The OMS Audit data collection daemon is active by default on Azure runners.

Hi @al-cheb auomscollect was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.

@varunsh-coder, sorry, I was wrong. I have reached out the right team to clarify that moment. Currently, it’s unexpected behavior that AzSecPack is being installed on Ubuntu18/20, but we are in control of at the moment. I will let you know if I get more information.

varunagrawal commented 1 year ago

@al-cheb you're tagging the wrong person. 😂

varunsh-coder commented 1 year ago

@varunagrawal, The OMS Audit data collection daemon is active by default on Azure runners.

Hi @al-cheb auomscollect was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.

@varunsh-coder, sorry, I was wrong. I have reached out the right team to clarify that moment. Currently, it’s unexpected behavior that AzSecPack is being installed on Ubuntu18/20, but we are in control of at the moment. I will let you know if I get more information.

Thanks @al-cheb! Here is a workflow I created to view status of auoms on ubuntu-latest and ubuntu-22.04. It shows that the service is running on one and not running on the other.

https://github.com/varunsh-coder/actions-playground/actions/runs/2712687332

al-cheb commented 1 year ago

@varunsh-coder, The ubuntu-latest spec is linked to ubuntu-20.04.

al-cheb commented 1 year ago

@varunsh-coder, We are planning to deploy new Ubuntu18/20 images next week with disabled OMS daemon.

tsal commented 1 year ago

We've spent the last two days troubleshooting why our Gradle builds (Android) started failing a couple of days ago. After a lot of struggling with memory settings (that we don't believe are the problem) and testing the builds, we finally got something useful from the logs, but it isn't anymore useful than telling us GitHub's runner cancelled the job run -- not an OOM kill.

##[debug]System.OperationCanceledException: The operation was canceled.
##[debug]   at System.Threading.CancellationToken.ThrowOperationCanceledException()
##[debug]   at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage)
##[debug]   at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Build APK

At first, we thought maybe the Runner was changing the MaxMetaspaceSize JVM option, and we tried changing that.

I can't provide any solutions for this, because I think it's either internal code GitHub owns, or it's the OS sending a HUP because of a malloc limit (not quite an OOM). Either way, I think your latest images for Ubuntu 18 and 20 are broken in a way that only shows up in large build contexts.

EDIT: To be clear, we can reproduce this failure on jobs that executed fine last week, on the same commit. The builds will always fail at least one of the matrix jobs, causing the whole thing to fail.

As others have mentioned above, 22.04 seems to "fix" this issue where the job is getting cancelled without any reason given.

varunsh-coder commented 1 year ago

@varunsh-coder, We are planning to deploy new Ubuntu18/20 images next week with disabled OMS daemon.

Thanks a lot @al-cheb! This is great to hear.

I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run.

Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.

al-cheb commented 1 year ago

Hey @Praveenrajmani, @varunsh-coder , @sylus , @tsal. The new Ubuntu Server 18.04/20.04 images have been deployed.

al-cheb commented 1 year ago

@Praveenrajmani, I checked new Ubuntu Server 20.04 image and it works for me.

image

Praveenrajmani commented 1 year ago

Hi @al-cheb , I can still see the same problem, and my virtual environment provisioner version remains the same - https://github.com/minio/directpv/runs/7519147904?check_suite_focus=true

image

shouldn't the provisioner version be updated too?

al-cheb commented 1 year ago

@Praveenrajmani, I checked it worked for Run upgrade test from v2.0.9. Looks like you should fix v1.4.6 version and adapt to the new kernel version.

Praveenrajmani commented 1 year ago

looks like 1.4.6 is not supported on latest kernel versions @al-cheb

what is the kernel version here? @al-cheb

al-cheb commented 1 year ago

looks like 1.4.6 is not supported on latest kernel versions @al-cheb

what is the kernel version here? @al-cheb

Linux kernel version: 5.15.0-1014-azure

Praveenrajmani commented 1 year ago

thanks @al-cheb , May i know the kernel version that was used in 20220710.1? Are there any dependency docs explaining this?

al-cheb commented 1 year ago

thanks @al-cheb , May i know the kernel version that was used in 20220710.1? Are there any dependency docs explaining this?

Linux kernel version: 5.13.0-1031-azure

varunsh-coder commented 1 year ago

Hey @Praveenrajmani, @varunsh-coder , @sylus , @tsal. The new Ubuntu Server 18.04/20.04 images have been deployed.

Thanks @al-cheb! I can confirm that https://github.com/step-security/harden-runner is working fine now. I did have ideas on separate comment on how to prevent regression. Please let me know if you have thoughts on that. Thanks!

al-cheb commented 1 year ago

I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run.

Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.

Thank you for the proposal. We will think about it.

al-cheb commented 1 year ago

I am planning to close the thread as reproduceable on Azure clean vm. Feel free to open the thread if you have any concerns.

varunsh-coder commented 1 year ago

I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run. Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.

Thank you for the proposal. We will think about it.

Hi @al-cheb I wanted to check if there is a way to test out a GitHub Actions runner VM before it is released. As an example, I would like to test few of my workflows on a new image 1-2 weeks before it gets released to be used for GitHub hosted-runners. Is there a way I can do this? Thanks!

al-cheb commented 1 year ago

@varunsh-coder , hey, unfortunatly, we can't provide early access to a new image. What kind of tests do you want to run, maybe we could integrate them with our current tests?

varunsh-coder commented 1 year ago

@varunagrawal , hey, unfortunatly, we can't provide early access to a new image. What kind of tests do you want to run, maybe we could integrate them with our current tests?

@al-cheb, you tagged the wrong Varun again :). I am @varunsh-coder. I need to run a few workflows that use https://github.com/step-security/harden-runner GitHub Action on the new image, to make sure the workflows pass.

al-cheb commented 1 year ago

@varunsh-coder, I will create an internal ticket to investigate how we can integrate them. Which workflows should we use?

varunsh-coder commented 1 year ago

@varunsh-coder, I will create an internal ticket to investigate how we can integrate them. Which workflows should we use?

Thanks a lot @al-cheb for your help with this! I currently run an integration test on a set of workflows on different repos. See this: https://github.com/step-security/harden-runner/actions/runs/3156704291/jobs/5136693292#step:5:8. But even a simple workflow that runs for more than 5 minutes, and makes a few outbound calls should do. If you want, I can create a workflow and share with you.

al-cheb commented 1 year ago

@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.

varunsh-coder commented 1 year ago

@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.

ok, let me work on that and get back. thank you!

varunsh-coder commented 1 year ago

@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.

ok, let me work on that and get back. thank you!

Hi @al-cheb, here is the workflow: https://github.com/varunsh-coder/actions-playground/blob/main/.github/workflows/harden-runner-test.yml

Please let me know if you can include this as part of your new image tests. It can run independently in any repository. It should pass for new images, but if it fails, I would like to know before the new image gets released. Thank you!

erik-bershel commented 1 year ago

Hello @varunsh-coder! There is a problem at the moment. We use Azure DevOps for the phases you are interested in. Thus, the GH Actions workflow is not really suitable for integrating with our Canary tests. I propose to think together about what can be done in this case. Could you provide a more detailed description of the required steps so that we can rewrite the code ourselves, or do you have examples of snippets that run on the ADO platform?

varunsh-coder commented 1 year ago

Hello @varunsh-coder! There is a problem at the moment. We use Azure DevOps for the phases you are interested in. Thus, the GH Actions workflow is not really suitable for integrating with our Canary tests. I propose to think together about what can be done in this case. Could you provide a more detailed description of the required steps so that we can rewrite the code ourselves, or do you have examples of snippets that run on the ADO platform?

Hi @erik-bershel, thanks a lot for the info. As next steps, I can port the GitHub Action to an ADO task and setup an ADO pipeline. It will take me some time to do this, but once I am done, I can share the ADO pipeline YAML file.

erik-bershel commented 1 year ago

Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?

varunsh-coder commented 1 year ago

Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?

Thanks @erik-bershel for following up! Not yet, should be done in a couple of weeks. Will get back once it is ready. Thanks!

varunsh-coder commented 1 year ago

Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?

Thanks @erik-bershel for following up! Not yet, should be done in a couple of weeks. Will get back once it is ready. Thanks!

Hi @erik-bershel, the ADO pipeline is ready. It is here: https://github.com/varunsh-coder/actions-playground/blob/main/azure-pipelines.yml

It has 4 jobs. 2 run on ubuntu-20.04 and 2 on ubuntu-22.04. They need to be triggered as part of canary test for new image, so I guess you will need to update it to use vmImage tags for the unreleased image.

If any of the jobs fail during canary test, I would like to be notified. Is that ok? Please let me know if you have any questions. I can share my email address or setup a notification method as needed. Thanks again!