Closed Praveenrajmani closed 1 year ago
Hey @Praveenrajmani. I see the pipelines, you provided, are not the same.
Working - https://github.com/minio/directpv/actions/runs/2690641775/workflow
- name: Setup Minikube
uses: manusa/actions-setup-minikube@v2.4.3
with:
minikube version: 'v1.24.0'
kubernetes version: 'v1.22.5'
github token: ${{ secrets.GITHUB_TOKEN }}
Non-working - https://github.com/minio/directpv/runs/7408016471?check_suite_focus=true
- name: Setup Minikube
uses: manusa/actions-setup-minikube@v2.4.3
with:
minikube version: 'v1.24.0'
kubernetes version: 'v1.20.14'
github token: ${{ secrets.GITHUB_TOKEN }}
Hi @al-cheb,
Let me share the correct ones
working: https://github.com/minio/directpv/runs/7389713364?check_suite_focus=true non-working: https://github.com/minio/directpv/runs/7409833561?check_suite_focus=true
@al-cheb One question - Is there a way to pick a selective virtual environment version for the github action runners?
@al-cheb One question - Is there a way to pick a selective virtual environment version for the github action runners?
Unfortunately, it's not possible to pick up a version.
I believe I have the same problem as of Version: 20220717.1 the docker overlay doesn't seem to be working correctly?
Build Success: https://github.com/drupalwxt/wxt/actions/runs/2676949961 Build Fail: https://github.com/drupalwxt/wxt/actions/runs/2699844768
Very odd though as I switched to 22.04 it worked.
Environment: ubuntu-22.04 Version: 20220717.1
Wonder if it works for you @Praveenrajmani using 22.04?
Ubuntu 20.04.4
with image version 20220717.1
is also causing issue for https://github.com/step-security/harden-runner
I noticed that in the latest update, the linux kernel version for Ubuntu 20.04.4
is updated to 5.15.0-1014-azure
. In previous updates, it was 5.13.0-XXXXX
.
PR: https://github.com/actions/virtual-environments/commit/ce779a6f6ea52b63a32ce2200a319be149c9203f
I was looking at the mapping of ubuntu versions and linux kernel versions here, and see 5.13
next to Ubuntu 20.04.4
.
I am not an expert at these mappings, but please check if this change to 5.15
kernel is expected for Ubuntu 20.04.4
It works when i downgrade the ubuntu version to 18.04 @sylus
@Praveenrajmani , I was able to reproduce this issue on a clean Azure vm, in that case the issue is not related to GitHub Runner image. Feel free to ping me if you need any logs from this vm.
Maybe related, we're seeing failures for GTSAM on 18.04 as well on PRs that didn't have issues before. From looking at the error message, it's an OOM issue for the compiler, so maybe something is up with how some devices are configured?
Our CI builds started to fail when we use 20220717.1 image; some of our Linux unit tests started to fail.
Successful build: Pipelines - Run 20220715.12 (azure.com) - on 20220710.1 image
Failed build: Pipelines - Run 20220719.2 (azure.com) - on 20220717.1 image
The standard error show messages like
/usr/bin/ld: cannot open output file main: Function not implemented
collect2: error: ld returned 1 exit status
/bin/cp: cannot create regular file '/tmp/czbp230e.x0p/file-to-cp.txt.copy': Function not implemented
@narasamdya, @sylus, Could you provide minimal repro steps how to reproduce the issue?
Ubuntu 20.04.4
with image version20220717.1
is also causing issue for https://github.com/step-security/harden-runnerI noticed that in the latest update, the linux kernel version for
Ubuntu 20.04.4
is updated to5.15.0-1014-azure
. In previous updates, it was5.13.0-XXXXX
. PR: ce779a6I was looking at the mapping of ubuntu versions and linux kernel versions here, and see
5.13
next toUbuntu 20.04.4
.I am not an expert at these mappings, but please check if this change to
5.15
kernel is expected forUbuntu 20.04.4
Hi @al-cheb I wanted to check if you got a chance to review the mapping of ubuntu versions and linux kernel versions here? Is it expected to use 5.15
kernel for Ubuntu 20.04.4
? I only see 5.13
next to Ubuntu 20.04.4
in the Ubuntu kernel support lifecycle picture...Thanks!
Hi @al-cheb I wanted to check if you got a chance to review the mapping of ubuntu versions and linux kernel versions here? Is it expected to use
5.15
kernel forUbuntu 20.04.4
? I only see5.13
next toUbuntu 20.04.4
in the Ubuntu kernel support lifecycle picture...Thanks!
Looks like the documentation is stale because we use a vm template for Ubuntu Server 20.04 created by Canonical:
~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
~$ uname -a
Linux u2001 5.15.0-1014-azure #17~20.04.1-Ubuntu SMP Thu Jun 23 20:01:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
~$ date
Wed Jul 20 19:56:42 UTC 2022
PS > (Get-AzVM -Name u2004 -Status).StorageProfile.ImageReference
Publisher : canonical
Offer : 0001-com-ubuntu-server-focal
Sku : 20_04-lts
Version : latest
ExactVersion : 20.04.202207130
SharedGalleryImageId :
CommunityGalleryImageId :
Id :
I was able to reproduce this issue on a clean Azure vm
What was the virtual environment version here? @al-cheb
I was able to reproduce this issue on a clean Azure vm
What was the virtual environment version here? @al-cheb
@Praveenrajmani, As I mentioned before I took a clean vm and pre-installed only moby-engine on it. Maybe it's a bug in the new Linux kernel version: 5.15.0-1014-azure
kernel.
I did more digging for my scenario and found the root cause for it.
I have a GitHub Action (https://github.com/step-security/harden-runner) that uses Linux Audit Framework to audit events on the Ubuntu VM.
In the latest release of the Ubuntu 20.04.4 VM, there is already a process that is listening to the audit events. The process is /opt/microsoft/auoms/bin/auomscollect
.
In the previous release of Ubuntu 20.04.4 VM and in the latest release of Ubuntu 22.04.4 VM, this is not the case.
As a result, https://github.com/step-security/harden-runner is not able to listen to audit events, as only one process can modify the audit rules.
Searching for auomscollect
, I found this project - https://github.com/microsoft/OMS-Auditd-Plugin. Do you know if having this enabled in the Runner VM is by design? If not, can this be turned off?
@varunagrawal, The OMS Audit data collection daemon
is active by default on Azure runners.
@varunagrawal, The
OMS Audit data collection daemon
is active by default on Azure runners.
Hi @al-cheb auomscollect
was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.
@varunagrawal, The
OMS Audit data collection daemon
is active by default on Azure runners.Hi @al-cheb
auomscollect
was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.
@varunsh-coder, sorry, I was wrong. I have reached out the right team to clarify that moment. Currently, it’s unexpected behavior that AzSecPack is being installed on Ubuntu18/20, but we are in control of at the moment. I will let you know if I get more information.
@al-cheb you're tagging the wrong person. 😂
@varunagrawal, The
OMS Audit data collection daemon
is active by default on Azure runners.Hi @al-cheb
auomscollect
was not enabled until this release of Ubuntu 20.04.4 VM. Also in the current release of Ubuntu 22.04 VM, it is not enabled. It is definitely a change from the previous release of Ubuntu 20.04.4 VM and it is odd that the same change is not applied to Ubuntu 22.04 VM yet.@varunsh-coder, sorry, I was wrong. I have reached out the right team to clarify that moment. Currently, it’s unexpected behavior that AzSecPack is being installed on Ubuntu18/20, but we are in control of at the moment. I will let you know if I get more information.
Thanks @al-cheb! Here is a workflow I created to view status of auoms
on ubuntu-latest
and ubuntu-22.04
. It shows that the service is running on one and not running on the other.
https://github.com/varunsh-coder/actions-playground/actions/runs/2712687332
@varunsh-coder, The ubuntu-latest
spec is linked to ubuntu-20.04
.
@varunsh-coder, We are planning to deploy new Ubuntu18/20 images next week with disabled OMS daemon.
We've spent the last two days troubleshooting why our Gradle builds (Android) started failing a couple of days ago. After a lot of struggling with memory settings (that we don't believe are the problem) and testing the builds, we finally got something useful from the logs, but it isn't anymore useful than telling us GitHub's runner cancelled the job run -- not an OOM kill.
##[debug]System.OperationCanceledException: The operation was canceled.
##[debug] at System.Threading.CancellationToken.ThrowOperationCanceledException()
##[debug] at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage)
##[debug] at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug] at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Build APK
At first, we thought maybe the Runner was changing the MaxMetaspaceSize
JVM option, and we tried changing that.
I can't provide any solutions for this, because I think it's either internal code GitHub owns, or it's the OS sending a HUP because of a malloc limit (not quite an OOM). Either way, I think your latest images for Ubuntu 18 and 20 are broken in a way that only shows up in large build contexts.
EDIT: To be clear, we can reproduce this failure on jobs that executed fine last week, on the same commit. The builds will always fail at least one of the matrix jobs, causing the whole thing to fail.
As others have mentioned above, 22.04 seems to "fix" this issue where the job is getting cancelled without any reason given.
@varunsh-coder, We are planning to deploy new Ubuntu18/20 images next week with disabled OMS daemon.
Thanks a lot @al-cheb! This is great to hear.
I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run.
Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.
Hey @Praveenrajmani, @varunsh-coder , @sylus , @tsal. The new Ubuntu Server 18.04/20.04 images have been deployed.
@Praveenrajmani, I checked new Ubuntu Server 20.04 image and it works for me.
Hi @al-cheb , I can still see the same problem, and my virtual environment provisioner version remains the same - https://github.com/minio/directpv/runs/7519147904?check_suite_focus=true
shouldn't the provisioner version be updated too?
@Praveenrajmani, I checked it worked for Run upgrade test from v2.0.9
. Looks like you should fix v1.4.6
version and adapt to the new kernel version.
looks like 1.4.6 is not supported on latest kernel versions @al-cheb
what is the kernel version here? @al-cheb
looks like 1.4.6 is not supported on latest kernel versions @al-cheb
what is the kernel version here? @al-cheb
Linux kernel version: 5.15.0-1014-azure
thanks @al-cheb , May i know the kernel version that was used in 20220710.1
? Are there any dependency docs explaining this?
thanks @al-cheb , May i know the kernel version that was used in
20220710.1
? Are there any dependency docs explaining this?
Linux kernel version: 5.13.0-1031-azure
Hey @Praveenrajmani, @varunsh-coder , @sylus , @tsal. The new Ubuntu Server 18.04/20.04 images have been deployed.
Thanks @al-cheb! I can confirm that https://github.com/step-security/harden-runner is working fine now. I did have ideas on separate comment on how to prevent regression. Please let me know if you have thoughts on that. Thanks!
I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run.
Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.
Thank you for the proposal. We will think about it.
I am planning to close the thread as reproduceable on Azure clean vm. Feel free to open the thread if you have any concerns.
I have a suggestion/ idea to improve the process. Do you think it would be possible to list the services that are expected to run as part of the VM README? As of now, it lists the software installed, but not the services that are expected to run. Listing the services that will run on the VM will help inform users of any new services that are getting added/ removed as part of a release. If this process was in place, we would have seen that a new service is getting added for Ubuntu 18/20 images.
Thank you for the proposal. We will think about it.
Hi @al-cheb I wanted to check if there is a way to test out a GitHub Actions runner VM before it is released. As an example, I would like to test few of my workflows on a new image 1-2 weeks before it gets released to be used for GitHub hosted-runners. Is there a way I can do this? Thanks!
@varunsh-coder , hey, unfortunatly, we can't provide early access to a new image. What kind of tests do you want to run, maybe we could integrate them with our current tests?
@varunagrawal , hey, unfortunatly, we can't provide early access to a new image. What kind of tests do you want to run, maybe we could integrate them with our current tests?
@al-cheb, you tagged the wrong Varun again :). I am @varunsh-coder. I need to run a few workflows that use https://github.com/step-security/harden-runner GitHub Action on the new image, to make sure the workflows pass.
@varunsh-coder, I will create an internal ticket to investigate how we can integrate them. Which workflows should we use?
@varunsh-coder, I will create an internal ticket to investigate how we can integrate them. Which workflows should we use?
Thanks a lot @al-cheb for your help with this! I currently run an integration test on a set of workflows on different repos. See this: https://github.com/step-security/harden-runner/actions/runs/3156704291/jobs/5136693292#step:5:8. But even a simple workflow that runs for more than 5 minutes, and makes a few outbound calls should do. If you want, I can create a workflow and share with you.
@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.
@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.
ok, let me work on that and get back. thank you!
@varunsh-coder, Thank you. If we have a separate workflow it will help us a lot.
ok, let me work on that and get back. thank you!
Hi @al-cheb, here is the workflow: https://github.com/varunsh-coder/actions-playground/blob/main/.github/workflows/harden-runner-test.yml
Please let me know if you can include this as part of your new image tests. It can run independently in any repository. It should pass for new images, but if it fails, I would like to know before the new image gets released. Thank you!
Hello @varunsh-coder! There is a problem at the moment. We use Azure DevOps for the phases you are interested in. Thus, the GH Actions workflow is not really suitable for integrating with our Canary tests. I propose to think together about what can be done in this case. Could you provide a more detailed description of the required steps so that we can rewrite the code ourselves, or do you have examples of snippets that run on the ADO platform?
Hello @varunsh-coder! There is a problem at the moment. We use Azure DevOps for the phases you are interested in. Thus, the GH Actions workflow is not really suitable for integrating with our Canary tests. I propose to think together about what can be done in this case. Could you provide a more detailed description of the required steps so that we can rewrite the code ourselves, or do you have examples of snippets that run on the ADO platform?
Hi @erik-bershel, thanks a lot for the info. As next steps, I can port the GitHub Action to an ADO task and setup an ADO pipeline. It will take me some time to do this, but once I am done, I can share the ADO pipeline YAML file.
Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?
Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?
Thanks @erik-bershel for following up! Not yet, should be done in a couple of weeks. Will get back once it is ready. Thanks!
Hi @varunsh-coder, is there anything I can help with? Any news about the implementation?
Thanks @erik-bershel for following up! Not yet, should be done in a couple of weeks. Will get back once it is ready. Thanks!
Hi @erik-bershel, the ADO pipeline is ready. It is here: https://github.com/varunsh-coder/actions-playground/blob/main/azure-pipelines.yml
It has 4 jobs. 2 run on ubuntu-20.04
and 2 on ubuntu-22.04
. They need to be triggered as part of canary test for new image, so I guess you will need to update it to use vmImage
tags for the unreleased image.
If any of the jobs fail during canary test, I would like to be notified. Is that ok? Please let me know if you have any questions. I can share my email address or setup a notification method as needed. Thanks again!
Description
The test builds on github runners started failing for the past 24 hours, when investigated, looks like the virtual environment version was upgraded.
Working virtual environment version: 20220710.1 Working virtual environment provisioner version: 1.0.0.0-main-20220616-1
Non-working virtual environment version: 20220717.1 Non-working virtual environment provisioner version: 1.0.0.0-main-20220701-2
Is there a way to pick a selective virtual environment version for the github action runners?
Platforms affected
Virtual environments affected
Image version and build link
Working build: https://github.com/minio/directpv/runs/7389713364?check_suite_focus=true Non-working build: https://github.com/minio/directpv/runs/7408016471?check_suite_focus=true
Is it regression?
yes
Expected behavior
the builds were succeeding 24hrs before, expecting it to succeed with the latest virtual environment version
Actual behavior
caused a regression with the latest release
Repro steps
Simply rerunning any previously succeeded builds would fail
Also raised a dummy PR without any changes to test the regression.