cgwalters commented 3 years ago

CoreOS CI 2021

Nothing in this is a firm commitment. Including "2021" in the title is intended to imply that e.g. we may change small or large aspects of this in 2022 (or earlier). Individual repository owners may choose to do different things. However, having standardized and well-maintained centralized CI flows is a huge benefit to our team.

This issue is moved from https://github.com/coreos/fedora-coreos-tracker/issues/263

Proposal: Use a combination of CoreOS CI Jenkins, OpenShift Prow and per-repository Github actions.

CoreOS CI Jenkins (CCI)

It is what we use on various repositories, and is how FCOS is released today. We have a lot of institutional knowledge around this and it gives us a place where we can easily control the end-to-end interactions. Jenkins is a well understood tool.

This is deployed in CentOS CI which is a bare metal OpenShift cluster where nested virt is enabled.

Advantages:

We own it, from GH webhook

Disadvantages:

We own it (except the underlying cluster infra)
Non-elastic and can hit capacity issues

Current and proposed use cases:

e2e CoreOS testing (particularly with qemu)

OpenShift Prow

Prow is heavily oriented towards testing OpenShift container components. However, as of recently we enabled nested virt on the build02 GCP cluster, which means we can create "container native" flows that still test the OS with coreos-assembler.

For Fedora CoreOS, we are independent of OpenShift release cycles. For RHEL CoreOS, we are tightly tied to them. It is really a requirement for openshift/os to increasingly tightly integrate with Prow. Specifically for openshift/os for example we want to follow the same release-4.X git branches as the platform.

We also use it as a "merge bot" on some repositories with /approve and /lgtm.

Advantages of Prow:

"We" (in the OpenShift sense) own it and is maintained by a dedicated (not us) team
Has CI cloud credentials paid for by the OpenShift org - e.g. we can wire up /test azure and /test aws to e.g. Ignition
Makes it easy for us to centralize CI flows
Fully elastic infrastructure

Disadvantages of Prow:

The openshift/release repository is enormous and unwieldy, but if we define our own centralized steps hopefully we wouldn't need to touch them often
By default, removes control of merge timing from the PR submitter (unless they /hold)
Very spammy in PRs, which makes notifications significantly less usable (walters: I have an email rule that trashes these)
Removes PR references from Git history (can this be configured?)
Assumes output and/or testing artifact is a container. It's harder to create KVM-type containers (Gangplank can help here)
Prow test results can't be opened in background tabs (https://github.com/kubernetes/test-infra/issues/19499)

Current and proposed use cases:

e2e CoreOS testing (qemu and cloud infra)

GitHub Actions

Free for small scale, nice to use. This is a good option for per-repository specific things that don't need centralization.

Advantages:

UX nicely integrated with GitHub

Disadvantages:

Actions running fully within a repo have write access to the repo, requiring extra attention to security

Example use cases:

Validate a project builds with an older Rust toolchain
Validate gofmt/rustfmt

Other CI systems

Ideally, we focus on these 3 and sharing as much as possible between them. The more CI systems we have, the more overhead there is for engineers and particularly new contributors to understand.

On this topic, this proposal specifically calls for dropping Travis usage.

CI types

Our different repositories

CI for CI itself - e.g. testing changes to the CI flow. For Prow, this is covered by "rehearsals". For GH actions, it is naturally covered by PRs to the repo. coreos-ci-lib would need special handling.
CI for config git (i.e. fedora-coreos-config and openshift/os)
CI for coreos-assembler/kola (can affect everything )
Owned component (rpm-ostree, ignition, afterburn)
Unowned component (kernel, systemd, NetworkManager)
Production delivery (FCOS/RCHOS devel and final builds)

Each type has different requirements and tolerances.

cgwalters commented 3 years ago

In particular a major sub-thread I want to pursue is this:

Has CI cloud credentials paid for by the OpenShift org - e.g. we can wire up /test azure and /test aws to e.g. Ignition

Examples:

/test azure, /test aws, /test vsphere for ignition/afterburn
The same but for fedora-coreos-config git as well as release jobs
The same for openshift/os

Also, we could add an optional /test e2e-okd-aws to fedora-coreos-config that would launch and test an OKD cluster. We could also add a Prow periodic that does the same.

In particular a flow I'd really like to enable that crosses all domains - as a Fedora kernel developer, test this kernel package in OKD. (And the same for C8S/RHEL).

bgilbert commented 3 years ago

Is it possible to run tests with Prow, but not have it spam PR comments and not have it responsible for merging?

cgwalters commented 3 years ago

Yes, I think if we e.g. removed the approve and lgtm from https://github.com/openshift/release/blob/91f2752387252f26d3d9e1530823f2b0c530751a/core-services/prow/02_config/_plugins.yaml#L3837 then you wouldn't get the "This PR is approved" comment etc. It may be possible to remove all the plugins indeed to have Prow just run CI jobs and doesn't control merging. (If it doesn't work, the OpenShift Prow maintainers are also upstream Prow developers)

Skimming through that file, e.g. ComplianceAsCode turned off most plugins and looking at a random recent PR there I don't see any GH comments from the Prow bot.

cgwalters commented 3 years ago

The "merge logic" is a whole sub-thread to this. I would say ideally we have some consistency across at least coreos/ repositories on this. The Prow logic predates the existence of the Github-native PR approval, not to mention the even more recent Github-native "Merge this PR after CI is green" logic.

We could maybe take a vote on the options of:

Use Prow as a merge bot
Use GH-native logic for approval/merge
Per repository choices

Maybe a sub-spike on this is adding a Prow job to e.g. coreos/ignition but turn off the plugins and see how that goes? (I assume ignition doesn't want the comment spam?)

cgwalters commented 3 years ago

Let's try out https://github.com/openshift/release/pull/16706

cgwalters commented 3 years ago

Even more sub-details around this:

When configured to handle merging, Prow will re-test PRs against the latest master. It solves the "semantically but not textually conflicting PRs" problem, see https://github.com/barosl/homu#why-is-it-needed Prow also automatically handles e.g. batching PRs together. It's not clear to me that at our scale this problem occurs often. But, we will need to watch out for it.

The comment spam around failing tests isn't unique to Prow; e.g. Github Actions also seems to default to sending a direct email with PR results, and Travis did the same for a long time. The core problem they're solving is that if your CI takes longer than a minute, the submitter is going to context switch away; and without an async notification of failure they need to poll.

Now honestly, I've been trashing the Prow emails for a long time and indeed I rely on polling; I periodically check on my PRs across various repos. But, it's a bit awkward.

bgilbert commented 3 years ago

The GitHub Actions email is harmless by comparison, since it doesn't conflate humans and bots into the same thread (email or web).

Prow feels like it's designed for a workflow where only the bot is thinking about the overall state of the repo, and humans are laser-focused on individual PRs. That could make sense for projects at a certain scale, but to me it feels intrusive for smaller projects. The lack of control of merge timing [1] and the assumption that automated tests have sufficient coverage [2] both seem like sources of friction.

[1] Both its tendency to encourage premature merges by accepting the first /lgtm, and its habit of waiting 30+ minutes when it does decide to merge [2] It assumes that rebases are safe as long as tests pass, and that rebases don't discard information (which is only true if manual testing is not used).

bgilbert commented 3 years ago

(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)

cgwalters commented 3 years ago

Both its tendency to encourage premature merges by accepting the first /lgtm,

Hmm, I think /lgtm is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like ":+1: LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.

and its habit of waiting 30+ minutes when it does decide to merge

I usually only see that kind of latency when it's doing retesting (for the reason of semantic PR conflicts), and on repos which have nontrivial CI (which is definitely true for many of the Kubernetes and OpenShift repos, as well as ours - we're using Prow for "heavy lifting" jobs).

(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)

Yeah...I think I agree that for probably all of our repos it's a high cost relative to the value. The standard other mitigation for this is to have "post merge periodics" that build and test git master - we definitely do that too. Periodics/post merge are a good place to do more expensive CI too.

Anyways, https://github.com/openshift/release/pull/16706 merged and you can see the effect here: No Prow comments in https://github.com/coreos/rpm-ostree/pull/2655 (edit: I mean other than it replying to me around the use of /override - but it specifically didn't add a comment about approvals or the tests failing)

What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.

cgwalters commented 3 years ago

OK on a non-Prow topic: The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.

Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff. That's OK I guess, but we should be sure that we have required status checks and branch protection on I think.

bgilbert commented 3 years ago

Hmm, I think /lgtm is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like "+1 LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.

Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default. That makes it easier for the submitter to make revisions based on discussion, or to wait for more review.

The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.

I'm not thrilled with it either. The obvious case is safe (jobs don't get write access if they're running on PRs submitted from a fork) but that only makes the other cases more subtle.

Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff.

Yup, that's right.

[1] When the submitter has ongoing responsibility for the repo, which usually means when they have write access.

bgilbert commented 3 years ago

Does Prow get us any CI on s390x?

cgwalters commented 3 years ago

Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default.

If I don't want a PR to merge I usually make it draft, personally.

bgilbert commented 3 years ago

If I don't want a PR to merge I usually make it draft, personally.

That has different semantics, though. I read "draft" as "incomplete" and "non-draft" as "is believed ready to merge as-is". But the latter might change due to events, and it's not great to have to leap for the "convert to draft" button.

cgwalters commented 3 years ago

Does Prow get us any CI on s390x?

The answer appears to be "not today", but there is a whole Multi-Arch CI thing internally that is doing related things, and that work might lead into supporting this for us.

cgwalters commented 3 years ago

What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.

xref https://github.com/openshift/release/pull/16910

bgilbert commented 3 years ago

I've enabled required PR reviews in the cosa master branch and marked continuous-integration/jenkins/pr-merge as required.

cgwalters commented 3 years ago

OK there's a new pain point. CI for rpm-ostree is currently failing here https://jenkins-coreos-ci.apps.ocp.ci.centos.org/blue/organizations/jenkins/github-ci%2Fcoreos%2Frpm-ostree/detail/PR-2694/3/artifacts due to:

[2021-03-24T23:49:50.814Z] [Warning][coreos-ci/pod-cc34d3db-a641-4ee4-88be-7a87e1536d09-9rdh2-4gbmc][Failed] Failed to pull image "registry.svc.ci.openshift.org/coreos/cosa-buildroot:latest": rpc error: code = Unknown desc = error pinging docker registry registry.svc.ci.openshift.org: invalid status code from registry 503 (Service Unavailable)

Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci. But then later, the DPTP team is shutting off registry.svc in favor of registry.ci which is more rigorous about what goes into it - it also requires authentication.

I think what we should be doing is this: https://docs.ci.openshift.org/docs/how-tos/mirroring-to-quay/

But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.

Longer term though at least for cosa I think it would make a lot of sense to ensure that the rhcos-4.X git branches are built and tested in Prow.

This topic also relates to https://github.com/coreos/fedora-coreos-config/pull/740 and of course the general question of where our CI builds run. Following that perhaps it should be quay.io/coreos/fedora-coreos-buildroot:stable or so.

jlebon commented 3 years ago

Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci.

Hmm, you should have access to the image. I've also invited you to the org itself. (I'm not entirely sure what the story is there wrt why it's in a separate namespace vs just coreos/. We should probably eventually fold it back into coreos/ but... we'd probably want to keep mirroring at the old location for a while.)

But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.

No strong opinion either way. Maintenance-wise, the simplest would be to just have Quay.io build the buildroot image to keep it consistent with how the main image is built. (Maybe just quay.io/coreos-assembler/buildroot given the namespace). But if you'd like, we can exercise the new mirroring path and set up whatever secrets needed for that.

cgwalters commented 3 years ago

quay.io/coreos-assembler/buildroot

The thing is though that cosa is (somewhat) of a "cross" tool, whereas the buildroot is Fedora right now and we can't escape that. So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next that would be 34 for example?

jlebon commented 3 years ago

So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next that would be 34 for example?

Hmm, I agree that's where we want to go. But until we do https://github.com/coreos/fedora-coreos-config/pull/740, it's a bit of a lie since it's really tracking the cosa $releasever. So maybe for now, just quay.io/coreos-assembler/fcos-buildroot:latest, but once we do https://github.com/coreos/fedora-coreos-config/pull/740, we set up builds for each branch as you've described? (I think in practice, we don't really need one for each stream since stable and testing are just two weeks apart. I think just testing and next would be sufficient; or maybe that should be testing-devel and next-devel to match the branch names we'd want the image to track.)

jlebon commented 3 years ago

Meh, if we're agreed on https://github.com/coreos/fedora-coreos-config/pull/740, I can just brush that up right now and we get it in and set up those tags.

cgwalters commented 3 years ago

Meh, if we're agreed on coreos/fedora-coreos-config#740, I can just brush that up right now and we get it in and set up those tags.

SGTM!

cgwalters commented 3 years ago

That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.

jlebon commented 3 years ago

Longer term though at least for cosa I think it would make a lot of sense to ensure that the rhcos-4.X git branches are built and tested in Prow.

Yeah, CI for those branches is in a sad state right now. CoreOS CI wants to build Fedora and of course we don't have access to RHEL packages like api.ci does. (Though worth noting that it's slightly less broken now that CoreOS CI knows how to build images properly, which means that it should use the right Fedora base image at least.)

So this is definitely a good argument for building and testing in Prow.

My only hesitation would be whether we can make the registry.ci locations pullable publicly (i.e. no auth). I personally don't like the idea of introducing mirroring into the mix just to work around that because it adds lag and complexity.

Hmm, but actually, we could just use Prow for testing RHCOS builds on PRs targeting those branches, but still have pushed commits build on Quay.io like today, right?

That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.

Do we need a promotion model for cosa other than the PR workflow? If a change breaks, we revert in git and rebuild. It seems like that model has worked pretty well so far.

cgwalters commented 3 years ago

Perhaps one debate to have is creating github.com/openshift/coreos-assembler - i.e. we drop the rhcos-$x branches in github.com/coreos/coreos-assembler.

jlebon commented 3 years ago

OK, we now have https://quay.io/repository/coreos-assembler/fcos-buildroot?tab=builds! We'll need to update repos which used the buildroot image to start building with that image instead. I opened https://github.com/coreos/coreos-ci-lib/pull/66 to make this easier.

cgwalters commented 3 years ago

OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md

jlebon commented 3 years ago

OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md

SGTM!

cgwalters commented 3 years ago

OK since we got rough consensus here, closing. We can open new issues or even PRs to the doc for proposed plans.

eriksjolund commented 3 years ago

I followed the link in https://github.com/coreos/fedora-coreos-tracker/issues/764#issue-828291335 and noticed that things have now changed regarding read-only tokens for GitHub Actions.

GitHub has now implemented more granular permissions. See https://github.community/t/read-only-token-for-ci/17295/8 https://docs.github.com/en/actions/reference/authentication-in-a-workflow#permissions-for-the-github_token

cgwalters commented 3 years ago

Hi, thanks for following up. Indeed, we (particularly bgilbert) have been gradually enabling those restrictions, e.g. https://github.com/coreos/stream-metadata-go/pull/28 etc.

bgilbert commented 3 years ago

As far as I know, all of our Actions workflows (except for bootupd) should have restrictions enabled now.

coreos / fedora-coreos-tracker

CI for CoreOS projects: 2021 plan #764

CoreOS CI 2021

CoreOS CI Jenkins (CCI)

OpenShift Prow

GitHub Actions

Other CI systems

CI types