[PR Workflow] Automate blocking PRs from being merged (EventHub)

konrad-jamrozik commented 1 year ago

Per email from @rkmanda, the following needs to be automated (for details and additional criteria, see email thread RE: About Spec PR review requirements from 6/2/2023, as well as RE: Brainstorm ways to automate the responsibilities of the PR assignee):

Are all checks other than the ARM specific ones already codified into the required checks today? Heres the list of the merge criteria we had documented earlier

Common [between control and data plane] a. All required checks in the pipeline passed b. If BreakingChangesRequired label exists check existence of the BreakingChangesApproved label. c. If Individual language breaking change label exists, check existence of the appropriate corresponding approval labels.
Control plane a. Existence of ARMSignedOff b. Existence of ArcSignedOff if its applicable
Suppression review a. Not being performed by anyone today b. If SuppressionReviewRequired label exists, check existence of the Approved-Suppression label
PR assignee confirms with the author when it is ok to merge a. In remote scenarios - Force merge when there are pipeline issues b. NOTE: DoNotMerge label exists but sometimes authors forget.

Possible approaches

One way of doing it is to do all these checks in the unified pipeline Summary task and make it fail if they fail. Summary task passing would mean PR is ready to be merged.

Alternatively, instead of failing the Summary task, we could introduce a new task, named like VerifyLabelsAllowMerging, that must pass. This way we will avoid bolting additional responsibilities on top of the Summary task.

Update 6/14/2023: suppressions

The SuppressionReviewRequired label should be blocking (unless appropriate "suppression approved" label is present (if applicable)). Per:

https://github.com/Azure/azure-rest-api-specs/pull/24227#issuecomment-1570639108

Update 7/5/2023: ShiftLeft approval

Approved-OKToMerge label needs to be present for ShiftLeft scenarios.

For full details, see this Teams discussion.

It also participates in the workflow of merge queues in private repo.
See this Teams discussion and this PR:

https://github.com/Azure/azure-rest-api-specs/pull/25231

Update 7/18/2023: rules for data-plane

Per this comment:

https://github.com/Azure/azure-sdk-tools/issues/6184#issue-1712414883

Pull Requests

Deployed 7/6/2023: Pull Request 481115: Refactor + improve logs in PipelineEventListener.ts; remove dead code from pipelineChecks.ts.
Pull Request 481115: Refactor + improve logs in PipelineEventListener.ts; remove dead code from pipelineChecks.ts.
Pull Request 482899: Improve diagnostic logs around sending events to EventHubs + rename refactor
Pull Request 483276: Add experimental stub support for check AllMergingRequirementsFulfilled + major refactoring and logging update of renderUnifiedPipelineChecks and related code
Pull Request 484831: Apply few bugfixes to the experimental check
Pull Request 484966: AllMergingRequirementsMet experimental check now checks for appropriate required checks and labels states
Pull Request 485263: De-anonymize "All merging requirements met" check; fix bug in evaluating missing required labels; improve the check description
Pull Request 485529: Improve "all merging requirements met" check description with help guidance
Pull Request 485873: Improve data-plane rules for "all merging requirements met" check
plus few more - see openapi-alps repo.

Follow-up work

https://github.com/Azure/azure-sdk-tools/issues/6613

konrad-jamrozik commented 1 year ago

@weshaggard I think this task boils down to the following subtasks:

Review which of the existing CI checks should actually be required and set them as appropriate. Setting the status checks on GitHub side is explained in step 7 here and also here. The ADO doc about integrating with GH status checks refers to the same doc, as can be seen in ADO docs on that here.
- I think the ADO pipeline tasks themselves have simple return code of success/failure, as explained here.
- Full lists of checks running is defined in openapi-alps .azure-pipelines.yml. There might be some complexity involved in post-processing them by our bots, but I grokked the code and I think I know where is what.
- Specifically, the pipeline-bot PipelineEventListener may be doing some relevant post-processing, but again, I think I can find my way around.
Add another task to the openapi-alps .azure-pipelines.yml, to run at the end, like VerifyLabelsAllowMerging, to capture all the rules for labels on the PR. If any of them is violated, the check will fail.

weshaggard commented 1 year ago

For #1 once the checks are ready I can update the required checks in the branch protection. I don't want to make them required until we are confident, they work as expected and can be trusted.

For your VerifyLablesAllowMerging that one likely cannot be a task in the pipeline as we need to update it based on our other bots as well. I suspect this should be a github status check instead of the standard Devops check suite/run so that we can mutate it from all our bots and pipelines as the process proceeds.

For the name instead of "VerifyLabelsAllowMerging" I would suggest "All Reviews Complete" or something along those lines. We will need to have our pipelines and bots also update the text for that status check to indicate what is currently left to be reviewed, we might also want to consider adding that to our status comment.

konrad-jamrozik commented 1 year ago

@weshaggard re this:

For https://github.com/Azure/azure-sdk-tools/pull/1 once the checks are ready I can update the required checks in the branch protection. I don't want to make them required until we are confident, they work as expected and can be trusted.

I think we could capture what people are already doing. For example, I know that checks like breaking change verification, production LintDiff or Avocado, in practice must pass, so they should be required. But some other checks appear to be bypassed most of the time. So I would say we could start by making an assessment what is required and what not, giving our partners an opportunity to review it, and setting the checks based on that.

I could compile a list of all checks and reach out to our partners.

Re this:

For your VerifyLablesAllowMerging that one likely cannot be a task in the pipeline as we need to update it based on our other bots as well. I suspect this should be a github status check instead of the standard Devops check suite/run so that we can mutate it from all our bots and pipelines as the process proceeds.

I realized that we probably want for this to be the Summary task after all. This is because the Summary task processes the labels, including making any label corrections if labels have been manually tampered with. Hence if we would have a separate GitHub check of kind status, it would, in practice, have to do very similar work to the Summary task. Otherwise someone could tamper with the labels after Summary task ran, and thus bypass the labelling requirements.

konrad-jamrozik commented 1 year ago

I had a chat with @weshaggard. Here are our conclusions.

We will ensure the PR is ready to be merged by adding a GitHub status check of type status, like All Merging Requirements Fulfilled.
That status check will be set by the pipeline-bot by the pullRequestLabeled event handler, which triggers when the PR is labelled or unlabelled. It will principally check for two things:
1. The Summary task check needs to be green, denoting the CI check suite ran successfully.
2. All the labels are in consistent state. This will be achieved by reading all the labels on the PR using Octokit (Summary task is currently doing this too) and doing label state check.
The status check update will also trigger on "check suite completed" event to handle cases where a PR has no label. This happens for PRs that modify infrastructure or docs. In other words, for PRs that don't have the resource-manager or data-plane label i.e. PRs that do not touch specs. An example of such PR is #24280.

Specifically, the pipeline-bot is processing the labelling events via PipelineEventListener.processEvent. This calls renderUnifiedPipelineChecks, which ends up applying the check update via a call to context.appClient.checks.update.

Handling race conditions

Both pipeline-bot and workflow-bot process labelled/unlabelled events, and workflow-bot may add or remove labels. Hence it is possible that pipeline-bot will trigger, read the label states, and update the All Merging Requirements Fulfilled status check just before workflow-bot updates the label. As a result, the PR may end up being in inconsistent state. However, this should be transient. For example, someone may add ARMSignedOff label which will make the pipeline-bot denote the PR as ready to merge, and moments later workflow-bot may remove that label because, say, it was invalid. This should trigger unlabelling event, which should make pipeline-bot run again, and put the PR in invalid state again.

Work plan

I will work on this iteratively. The plan for first PR is to add a capability to the pipeline-bot to do the following on labelled or unlabelled event:

Read the Summary task check status.
Read all the labels from GitHub API (Octokit) and apply some most basic label validation logic, to be improved in future PR.
Update the AllMergingRequirementsFulfilled check based on the above. That check needs to be added.

In second PR, or possibly also in the first PR, I will also trigger the same logic upon "check suite run completed" event. Details of doing this need to be determined.

Future work

In later iterations on this work we might consider some way of preserving the state that was used to compute the labels, to protect against manual tampering with them. For example, currently the Summary task computes the label denoting breaking changes presence based on the output of the "Breaking Change" task, which runs in the same pipeline. This information is not readily available during "on labeled" event of pipeline-bot. It would have to be pulled e.g. from given ADO build run for given PR, assuming that build would persist it e.g. as an artifact drop.

konrad-jamrozik commented 1 year ago

A note on suppressing checks

CI checks can be suppressed by adding appropriate label. For example, if the Swagger BreakingChange check is failing, adding the Approved-BreakingChange will immediately change its state from failed to neutral. You can even observe the text:

The check status is neutral due to the check was suppressed by label Approved-BreakingChange.

My cursory investigation shows that the code emitting that state change lives in pipeline-bot, in renderUnifiedPipelineChecks, here.

The production environment config file that defines which label can suppress which check appears to be config/production/Azure/azure-rest-api-specs.yml.

The function that specifically translates ADO task status to GitHub check status is checkStatusFromTask.

Related discussion

https://github.com/Azure/azure-sdk-tools/issues/6327#issuecomment-1604408542

maririos commented 1 year ago

FYI @praveenkuttappan for Release Planner

konrad-jamrozik commented 1 year ago

@mikekistler I am about to deploy the checks described in this work item. What checks should I add for data plane reviews?

I see labels like:

What labelling rules I should add? Any required checks that should pass specifically for data-plane? Anything else?

mikekistler commented 1 year ago

@konrad-jamrozik I think the rules we want are described in this issue:

https://github.com/Azure/azure-sdk-tools/issues/6184

konrad-jamrozik commented 1 year ago

@rkmanda @raych1 @tadelesh @msyyc in the private repo, I see labels like these:

But they do not align with the public repo. I.e. the private repo has Approved-BreakingChange-Go while public repo has Approved-SdkBreakingChange-Go (note the extra Sdk after Approved-). The config file for private repo has the same labels as the config for public repo. In fact, I cannot find Approved-BreakingChange-Go in the source code at all. Can you let me know how to interpret this? Perhaps the PRs in the private repo are getting mislabeled?

tadelesh commented 1 year ago

Seems mislabeled. @raych1 may know more about the background.

konrad-jamrozik commented 1 year ago

Progress update 7/20/2023

The experimental check is deployed in non-blocking state and has most of the rules codified in it. I announced it on the ARM swagger reviews channel.

Major big remaining work left to do

DONE Add proper support for data plane "review required" label - currently there is a todo (TODO:) in the code for that.
POSTPONED Make the check required, so it is blocking the PR.
DONE Add more diagnostic logs - currently there is a todo (kja) in the code for that.
ISSUE CREATED Instead of hardcoding the list of required checks (requiredChecksNames), read the list of required checks on the PR from GitHub API. Note that we should warn on mismatch between the these checks and the list of tasks defined in the config file, e.g. here. Notably, we should catch all cases where there is a required check without a corresponding task.
- https://github.com/Azure/azure-sdk-tools/issues/6613
PART OF #6613 Support different sets of requirements for different branches. This overlaps with the un-hardcoding the list of required checks. As of 7/24/2023 discussing this with @heaths.
DONE Discuss if the check should fail if the PR is not approved. If not, then it should mention the approval is still needed. Conclusion: it should pass even when there is no PR approval.
FUTURE WORK At some point integrate with @mikeharder's work on the specs - typespec - pr check (and the corresponding check for the -pr repo), which is going to replace CadlValidation and TypeSpec Validation checks. Note it is completely new setup, disconnected from the openapi-alps infrastructure.

Future work

Upgrade "swagger validation report" to "Requirements to merge this PR" with all the helpful guidance. Make the check description focus on the low-level, technical details. This is related to:
- https://github.com/Azure/azure-sdk-tools/issues/6286

benbp commented 1 year ago

@konrad-jamrozik the github checks api is a bit of a beast with lots of little edge cases. Let me know when you get to the "list required checks via api" step.

konrad-jamrozik commented 1 year ago

Closing as principally done. It is deployed to production and running. Primary follow-up work moved to:

https://github.com/Azure/azure-sdk-tools/issues/6613

Azure / azure-sdk-tools