dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

[User Story] CI Health: Redefining CI investigations and Health #75243

Open hoyosjs opened 2 years ago

hoyosjs commented 2 years ago

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower the time wasted in repetitive investigation of known issues.

The work streams are roughly:

Future Work

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.

Issue Details
# [User Story] CI Health: Redefining CI investigations and Merge on Green The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower the time wasted in repetitive investigation of known issues. The work streams are roughly: - Issues can be easily searched for throughout the different components of a PR: - [x] Build issue search within AzDO has been deployed. - [ ] Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update - It's easy to report issues directly from the `Build Analysis` check tab - [x] Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability. - [x] Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue. - [ ] Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window. - Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found: - [ ] https://github.com/dotnet/runtime/pull/74615 largely achieved this work, but it needs to be updated for the opening issues workflow that got enabled as well as some of the timing expectations for the system. - Tests should have failures logged in a format that : - [ ] The legacy system (xUnit) is not properly surfacing asserts. Ensure that StdErr for the child process is properly being redirected as much as possible. - [ ] Ensure the new source-generated testing framework allows for proper attribution at the test level. This includes an analysis of catastrophe style issues that are now reported as workitem failures. @davidwrighton was taking a cursory look at this. - [ ] Ensure timeouts and hang dumps are properly handled in the new testing system and surfaced in a way build analysis can surface them. - Redefine merge on red: Make build analysis the definition for merge on red - [ ] Turning 'Build Analysis' into a required check requires: - Reporting an issue should rerun the check against it to move it to the known column. - Correlating an issue manually is possible (even if undesirable) to unblock merging. - Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun. - [ ] Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues. - [ ] Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge. - [ ] Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface cc: @JulieLeeMSFT @tommcdon @markwilkie cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek
Author: hoyosjs
Assignees: -
Labels: `area-Infrastructure`, `User Story`
Milestone: 8.0.0
JulieLeeMSFT commented 2 years ago

cc @jeffschwMSFT @mangod9.

danmoseley commented 2 years ago

Ensure timeouts and hang dumps are properly handled in the new testing system,

Does this include making sure hangs lead to dumps? Or we believe that's now the case (I wasn't aware). I do agree this would really help a category of test failures that aren't currently actionable.

hoyosjs commented 2 years ago

@danmoseley This happens for coreclr tests. Libraries tests have no provision for this, other than dotnet-test based runs and I think there were reasons not to move to that?

danmoseley commented 2 years ago

Ah. Yes, for moving to dotnet-test I think we discussed that we need some more lightweight runner due to being bottom of the stack. @ViktorHofer do we have anything like that on the backlog still?

ViktorHofer commented 2 years ago

@ViktorHofer do we have anything like that on the backlog still?

I filed https://github.com/microsoft/vstest/issues/3595 for that a while ago. We basically need a way to run our tests in-proc with a minimal set of dependencies.