containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.33k stars 2.38k forks source link

CI Cleanup: Remove cgroups v1 & runc support #23020

Closed cevich closed 3 months ago

cevich commented 3 months ago

With (esp. Debian) CI VM images built by https://github.com/containers/automation_images/pull/338 CI no-longer tests with runc nor cgroups v1. Add logic to fail under these conditions. Prune back high-level YAML/script envars and logic formerly required to support these things.

Does this PR introduce a user-facing change?

None
cevich commented 3 months ago

@edsantiago I sent you a slack message, posting it here assuming it was lost:

I noticed a TON of system test calls to skip_if_cgroupsv1() and some is_cgroupsv2(). I'm guessing there are a lot of similar e2e conditionals. I'm unsure if I should bother updating/removing all of them for the new "cgroups v2 only" world-order. I'm assuming no, but do you have a different opinion?

Luap99 commented 3 months ago

I noticed a TON of system test calls to skip_if_cgroupsv1() and some is_cgroupsv2(). I'm guessing there are a lot of similar e2e conditionals. I'm unsure if I should bother updating/removing all of them for the new "cgroups v2 only" world-order. I'm assuming no, but do you have a different opinion?

Does downstream QE run these tests with cgroupsv1 on RHEL 9? If so I think it is best to keep them for a while at least. If not I like to remove them, although I wouldn't block this PR on it. That could happen in a follow up, I think there a bigger CI priorities right now compared to removing a bunch of conditionals.

edsantiago commented 3 months ago

Saw your message, am still catching up from PTO. I'd say removing the cgroups conditionals is a rainy-day exercise for the future. Although it seems trivial, it won't be (I expect linter issues, easy but tedious). Oh, and Paul's point is a good one: I had assumed that podman v5 is cgroupsv2-only, but I never know what RHEL is going to do.

For the time being, I think it's best to not tackle cgroups conditionals.

cevich commented 3 months ago

Thanks for the feedback guys, I too had not considered the RHEL case :blush:

Do we even ever have "rainy days" :rofl:

edsantiago commented 3 months ago
[+0815s] not ok 277 [120] podman image scp transfer in 2599ms
...
<+052ms> # # podman image scp foo.bar/nonesuch/c_9yzja1xujd:mytag some9825dude@localhost::
...
         # time="2024-06-18T15:55:35Z" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"

This smells like an ssh problem. Maybe a missing loginctl, or some sort of pam setup not being done in debian for rootless?

cevich commented 3 months ago

Looks like we cross-posted.

Maybe a missing loginctl, or some sort of pam setup not being done in debian for rootless?

I'm pretty sure there is no loginctl used for the rootless setup. Since that was my thought too, let's give it a try...

cevich commented 3 months ago

Force-push: Added test fixme! commit to see if enabling rootless lingering fixes podman image scp problem on rootful debian.

cevich commented 3 months ago

Same/similar failure despite lingering being enabled for the rootless user. Thinking more, I wonder if this is happening because CGROUP_MANAGER=systemd is set in /etc/ci_environment and is getting passed through into the rootless user environment somehow :thinking:

cevich commented 3 months ago

I wonder if

Answer: Doesn't appear to be. There's no modification of the rootless user's .bashrc or .bash_profile or anything else that would load /etc/ci_environment for the user, unless maybe podman itself is passing the current CGROUP_MANAGER value through?

I think the next step is to just go hands-on with hack/get_ci_vm.sh where the entire setup can be simulated for experimentation.

Luap99 commented 3 months ago

What system version is used? On fedora we saw a regression were lingering was broken on 256-rc1 to 3 using 256-rc4 or final release fixed it again AFAIK

edsantiago commented 3 months ago

I think that may be the problem: systemd on debian is 256~rc3-5. The important thing there is the rc3, which is bad; I had misread the 5 as being >4 and therefore good. That was the wrong part to look at.

Since rootless tests work despite the bad systemd, I would suggest just leaving this ssh test disabled for now. Unless someone feels like building new CI VMs.

cevich commented 3 months ago

I would suggest just leaving this ssh test disabled

Maybe add a timebomb() onto it?

cevich commented 3 months ago

Confirmed, looks like Lokesh's recent builds have 256~rc3-5 :disappointed:

cevich commented 3 months ago

Force-push: Added skip for scp test on debian

cevich commented 3 months ago

This is a new one for me:

<+016ms> # # podman pull quay.io/libpod/testimage:20240123
<+0120s> # Trying to pull quay.io/libpod/testimage:20240123...
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+005ms> # [ rc=124 ]
         # *** TIMED OUT ***
         # # [teardown]

Assuming it's a flake and re-running.

cevich commented 3 months ago

@edsantiago want me to wait for #23058 to go in, then re-test this w/o the debian/systemd scp test skip?

edsantiago commented 3 months ago

@cevich CI is blowing up hard right now (see #23059); I don't know if it's a github problem, or cirrus, or something to do with the new cirrus.yml skips. Let's get that resolved before pushing anything.

(But yes, once things clear up, I think it'd be good to rebase with the debian skip removed)

cevich commented 3 months ago

CI is blowing up hard right now

Ya I saw that 23059. IMO (I didn't look closely) ISTM could easily be a networking/quay timeout of some form. I think it's probably just a coincidence with the new skips.

cevich commented 3 months ago

force-push: Rebased on top of https://github.com/containers/podman/pull/23059 w/ updated CI VM images (https://github.com/containers/podman/pull/23058).

edsantiago commented 3 months ago

I don't understand why you included #23059, but otherwise LGTM. Fingers crossed for debian CI

edsantiago commented 3 months ago

Sigh

cevich commented 3 months ago

Oh my bad, for some reason I thought the timeout fixed the scp problem. Must have been brain-tired :blush:

rhatdan commented 3 months ago

/approve /lgtm

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)~~ [rhatdan] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment