Closed cgwalters closed 3 years ago
In particular a major sub-thread I want to pursue is this:
Has CI cloud credentials paid for by the OpenShift org - e.g. we can wire up /test azure and /test aws to e.g. Ignition
Examples:
/test azure
, /test aws
, /test vsphere
for ignition/afterburnAlso, we could add an optional /test e2e-okd-aws
to fedora-coreos-config that would launch and test an OKD cluster. We could also add a Prow periodic that does the same.
In particular a flow I'd really like to enable that crosses all domains - as a Fedora kernel developer, test this kernel package in OKD. (And the same for C8S/RHEL).
Is it possible to run tests with Prow, but not have it spam PR comments and not have it responsible for merging?
Yes, I think if we e.g. removed the approve
and lgtm
from https://github.com/openshift/release/blob/91f2752387252f26d3d9e1530823f2b0c530751a/core-services/prow/02_config/_plugins.yaml#L3837 then you wouldn't get the "This PR is approved" comment etc. It may be possible to remove all the plugins indeed to have Prow just run CI jobs and doesn't control merging. (If it doesn't work, the OpenShift Prow maintainers are also upstream Prow developers)
Skimming through that file, e.g. ComplianceAsCode turned off most plugins and looking at a random recent PR there I don't see any GH comments from the Prow bot.
The "merge logic" is a whole sub-thread to this. I would say ideally we have some consistency across at least coreos/ repositories on this. The Prow logic predates the existence of the Github-native PR approval, not to mention the even more recent Github-native "Merge this PR after CI is green" logic.
We could maybe take a vote on the options of:
Maybe a sub-spike on this is adding a Prow job to e.g. coreos/ignition but turn off the plugins and see how that goes? (I assume ignition doesn't want the comment spam?)
Let's try out https://github.com/openshift/release/pull/16706
Even more sub-details around this:
When configured to handle merging, Prow will re-test PRs against the latest master. It solves the "semantically but not textually conflicting PRs" problem, see https://github.com/barosl/homu#why-is-it-needed Prow also automatically handles e.g. batching PRs together. It's not clear to me that at our scale this problem occurs often. But, we will need to watch out for it.
The comment spam around failing tests isn't unique to Prow; e.g. Github Actions also seems to default to sending a direct email with PR results, and Travis did the same for a long time. The core problem they're solving is that if your CI takes longer than a minute, the submitter is going to context switch away; and without an async notification of failure they need to poll.
Now honestly, I've been trashing the Prow emails for a long time and indeed I rely on polling; I periodically check on my PRs across various repos. But, it's a bit awkward.
The GitHub Actions email is harmless by comparison, since it doesn't conflate humans and bots into the same thread (email or web).
Prow feels like it's designed for a workflow where only the bot is thinking about the overall state of the repo, and humans are laser-focused on individual PRs. That could make sense for projects at a certain scale, but to me it feels intrusive for smaller projects. The lack of control of merge timing [1] and the assumption that automated tests have sufficient coverage [2] both seem like sources of friction.
[1] Both its tendency to encourage premature merges by accepting the first /lgtm
, and its habit of waiting 30+ minutes when it does decide to merge
[2] It assumes that rebases are safe as long as tests pass, and that rebases don't discard information (which is only true if manual testing is not used).
(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)
Both its tendency to encourage premature merges by accepting the first /lgtm,
Hmm, I think /lgtm
is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like ":+1: LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.
and its habit of waiting 30+ minutes when it does decide to merge
I usually only see that kind of latency when it's doing retesting (for the reason of semantic PR conflicts), and on repos which have nontrivial CI (which is definitely true for many of the Kubernetes and OpenShift repos, as well as ours - we're using Prow for "heavy lifting" jobs).
(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)
Yeah...I think I agree that for probably all of our repos it's a high cost relative to the value. The standard other mitigation for this is to have "post merge periodics" that build and test git master - we definitely do that too. Periodics/post merge are a good place to do more expensive CI too.
Anyways, https://github.com/openshift/release/pull/16706 merged and you can see the effect here: No Prow comments in https://github.com/coreos/rpm-ostree/pull/2655 (edit: I mean other than it replying to me around the use of /override
- but it specifically didn't add a comment about approvals or the tests failing)
What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.
OK on a non-Prow topic: The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.
Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff. That's OK I guess, but we should be sure that we have required status checks and branch protection on I think.
Hmm, I think
/lgtm
is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like "+1 LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.
Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default. That makes it easier for the submitter to make revisions based on discussion, or to wait for more review.
The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.
I'm not thrilled with it either. The obvious case is safe (jobs don't get write access if they're running on PRs submitted from a fork) but that only makes the other cases more subtle.
Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff.
Yup, that's right.
[1] When the submitter has ongoing responsibility for the repo, which usually means when they have write access.
Does Prow get us any CI on s390x?
Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default.
If I don't want a PR to merge I usually make it draft, personally.
If I don't want a PR to merge I usually make it draft, personally.
That has different semantics, though. I read "draft" as "incomplete" and "non-draft" as "is believed ready to merge as-is". But the latter might change due to events, and it's not great to have to leap for the "convert to draft" button.
Does Prow get us any CI on s390x?
The answer appears to be "not today", but there is a whole Multi-Arch CI thing internally that is doing related things, and that work might lead into supporting this for us.
What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.
I've enabled required PR reviews in the cosa master branch and marked continuous-integration/jenkins/pr-merge
as required.
OK there's a new pain point. CI for rpm-ostree is currently failing here https://jenkins-coreos-ci.apps.ocp.ci.centos.org/blue/organizations/jenkins/github-ci%2Fcoreos%2Frpm-ostree/detail/PR-2694/3/artifacts due to:
[2021-03-24T23:49:50.814Z] [Warning][coreos-ci/pod-cc34d3db-a641-4ee4-88be-7a87e1536d09-9rdh2-4gbmc][Failed] Failed to pull image "registry.svc.ci.openshift.org/coreos/cosa-buildroot:latest": rpc error: code = Unknown desc = error pinging docker registry registry.svc.ci.openshift.org: invalid status code from registry 503 (Service Unavailable)
Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci. But then later, the DPTP team is shutting off registry.svc in favor of registry.ci which is more rigorous about what goes into it - it also requires authentication.
I think what we should be doing is this: https://docs.ci.openshift.org/docs/how-tos/mirroring-to-quay/
But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.
Longer term though at least for cosa I think it would make a lot of sense to ensure that the rhcos-4.X
git branches are built and tested in Prow.
This topic also relates to https://github.com/coreos/fedora-coreos-config/pull/740 and of course the general question of where our CI builds run. Following that perhaps it should be quay.io/coreos/fedora-coreos-buildroot:stable
or so.
Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci.
Hmm, you should have access to the image. I've also invited you to the org itself. (I'm not entirely sure what the story is there wrt why it's in a separate namespace vs just coreos/. We should probably eventually fold it back into coreos/ but... we'd probably want to keep mirroring at the old location for a while.)
But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.
No strong opinion either way. Maintenance-wise, the simplest would be to just have Quay.io build the buildroot image to keep it consistent with how the main image is built. (Maybe just quay.io/coreos-assembler/buildroot given the namespace). But if you'd like, we can exercise the new mirroring path and set up whatever secrets needed for that.
quay.io/coreos-assembler/buildroot
The thing is though that cosa is (somewhat) of a "cross" tool, whereas the buildroot is Fedora right now and we can't escape that. So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable
where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next
that would be 34 for example?
So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next that would be 34 for example?
Hmm, I agree that's where we want to go. But until we do https://github.com/coreos/fedora-coreos-config/pull/740, it's a bit of a lie since it's really tracking the cosa $releasever
. So maybe for now, just quay.io/coreos-assembler/fcos-buildroot:latest
, but once we do https://github.com/coreos/fedora-coreos-config/pull/740, we set up builds for each branch as you've described? (I think in practice, we don't really need one for each stream since stable and testing are just two weeks apart. I think just testing
and next
would be sufficient; or maybe that should be testing-devel
and next-devel
to match the branch names we'd want the image to track.)
Meh, if we're agreed on https://github.com/coreos/fedora-coreos-config/pull/740, I can just brush that up right now and we get it in and set up those tags.
Meh, if we're agreed on coreos/fedora-coreos-config#740, I can just brush that up right now and we get it in and set up those tags.
SGTM!
That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.
Longer term though at least for cosa I think it would make a lot of sense to ensure that the
rhcos-4.X
git branches are built and tested in Prow.
Yeah, CI for those branches is in a sad state right now. CoreOS CI wants to build Fedora and of course we don't have access to RHEL packages like api.ci does. (Though worth noting that it's slightly less broken now that CoreOS CI knows how to build images properly, which means that it should use the right Fedora base image at least.)
So this is definitely a good argument for building and testing in Prow.
My only hesitation would be whether we can make the registry.ci locations pullable publicly (i.e. no auth). I personally don't like the idea of introducing mirroring into the mix just to work around that because it adds lag and complexity.
Hmm, but actually, we could just use Prow for testing RHCOS builds on PRs targeting those branches, but still have pushed commits build on Quay.io like today, right?
That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.
Do we need a promotion model for cosa other than the PR workflow? If a change breaks, we revert in git and rebuild. It seems like that model has worked pretty well so far.
Perhaps one debate to have is creating github.com/openshift/coreos-assembler - i.e. we drop the rhcos-$x branches in github.com/coreos/coreos-assembler.
OK, we now have https://quay.io/repository/coreos-assembler/fcos-buildroot?tab=builds! We'll need to update repos which used the buildroot image to start building with that image instead. I opened https://github.com/coreos/coreos-ci-lib/pull/66 to make this easier.
OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md
OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md
SGTM!
OK since we got rough consensus here, closing. We can open new issues or even PRs to the doc for proposed plans.
I followed the link in https://github.com/coreos/fedora-coreos-tracker/issues/764#issue-828291335 and noticed that things have now changed regarding read-only tokens for GitHub Actions.
GitHub has now implemented more granular permissions. See https://github.community/t/read-only-token-for-ci/17295/8 https://docs.github.com/en/actions/reference/authentication-in-a-workflow#permissions-for-the-github_token
Hi, thanks for following up. Indeed, we (particularly bgilbert) have been gradually enabling those restrictions, e.g. https://github.com/coreos/stream-metadata-go/pull/28 etc.
CoreOS CI 2021
Nothing in this is a firm commitment. Including "2021" in the title is intended to imply that e.g. we may change small or large aspects of this in 2022 (or earlier). Individual repository owners may choose to do different things. However, having standardized and well-maintained centralized CI flows is a huge benefit to our team.
This issue is moved from https://github.com/coreos/fedora-coreos-tracker/issues/263
Proposal: Use a combination of CoreOS CI Jenkins, OpenShift Prow and per-repository Github actions.
CoreOS CI Jenkins (CCI)
It is what we use on various repositories, and is how FCOS is released today. We have a lot of institutional knowledge around this and it gives us a place where we can easily control the end-to-end interactions. Jenkins is a well understood tool.
This is deployed in CentOS CI which is a bare metal OpenShift cluster where nested virt is enabled.
Advantages:
Disadvantages:
Current and proposed use cases:
OpenShift Prow
Prow is heavily oriented towards testing OpenShift container components. However, as of recently we enabled nested virt on the
build02
GCP cluster, which means we can create "container native" flows that still test the OS with coreos-assembler.For Fedora CoreOS, we are independent of OpenShift release cycles. For RHEL CoreOS, we are tightly tied to them. It is really a requirement for openshift/os to increasingly tightly integrate with Prow. Specifically for openshift/os for example we want to follow the same
release-4.X
git branches as the platform.We also use it as a "merge bot" on some repositories with
/approve
and/lgtm
.Advantages of Prow:
/test azure
and/test aws
to e.g. IgnitionDisadvantages of Prow:
/hold
)Current and proposed use cases:
GitHub Actions
Free for small scale, nice to use. This is a good option for per-repository specific things that don't need centralization.
Advantages:
Disadvantages:
Example use cases:
Other CI systems
Ideally, we focus on these 3 and sharing as much as possible between them. The more CI systems we have, the more overhead there is for engineers and particularly new contributors to understand.
On this topic, this proposal specifically calls for dropping Travis usage.
CI types
Our different repositories
Each type has different requirements and tolerances.