Creating a separate pipeline to stage onboarding new test targets

jkoritzinsky commented 3 years ago

Our repo has been working diligently to increase test coverage for all of our various platforms, but very often when we first onboard a new test target (ie. WASM on Browser tests, Android runtime tests, etc.) we experience flakiness over the first few days due to issues in the underlying infrastructure that only pop up at the scale we test (https://github.com/dotnet/runtime/issues/43983 is a recent example).

In CI Council, we discussed possibly creating a separate "runtime-staging" pipeline of sorts where we initially include the new testing legs. Once the new legs stabilize, we'd move them into the regular "runtime" pipeline. This would enable us to continue to add new platforms without the spurious failures that only appear at scale effecting the main PR/CI pipelines and tanking our pass rate.

cc: @dotnet/runtime-infrastructure @ViktorHofer

ghost commented 3 years ago

Tagging subscribers to this area: @ViktorHofer See info in area-owners.md if you want to be subscribed.

ViktorHofer commented 3 years ago

We would still run the separate pipeline per PR to gather the high amount of data but mark it as non failing so that it doesn't impact the PR's result and the main pipeline's pass rate.

trylek commented 3 years ago

If we're about to run the envisioned "runtime-staging" pipeline per PR, then it should be "a delta to the existing runtime pipeline"; if it was "a staging version of the runtime pipeline that gets copied to the main one once in a while", we'd be duplicating most runs in each PR which is probably out of scope of our lab budget.

ViktorHofer commented 3 years ago

Yes the idea is a delta. Just the platforms/devices that are fresh and which are likely to be flaky (ie the recent addition of android runtime tests would have benefited of this).

directhex commented 3 years ago

I hope we can leverage this to help with CI of community-supported architectures - e.g. @nealef has been working on IBM zSystem support, and it'd be great to get builds going in our upstream CI, to help spot issues early

ViktorHofer commented 3 years ago

Makes sense for this new pipeline. Depending on hardware budget we could run certain jobs only for rolling builds or even scheduled builds.

nealef commented 3 years ago

Would https://github.com/microsoft/azure-pipelines-agent.git allow us to plug the existing Linux on Z virtual machines into this infrastructure. Those are currently Jenkins workers but I can install whatever you like on them and they already have docker up and running if you need to do work in containers.

ViktorHofer commented 3 years ago

cc @mthalman @MichaelSimons @MattGal for the question above.

akoeplinger commented 3 years ago

For the zSystem support I talked with @nealef and we'll setup a branch in dotnet/runtimelab to work on that. We can already cross-build thanks to https://github.com/dotnet/arcade/pull/5863 and https://github.com/dotnet/dotnet-buildtools-prereqs-docker/pull/351 so that should cover the build aspect.

For running tests we'd need a Helix queue but we might be able to use the s390x Docker images running via QEMU if we can't plug in a physical system.

directhex commented 3 years ago

IBM have previously offered a VM in their s390x cloud for CI, but someone with sufficient authority needs to sign a liability waiver (and someone needs to configure it to actually work)

Sent from my iPhone

On Nov 3, 2020, at 8:13 AM, Alexander Köplinger notifications@github.com wrote:

For the zSystem support I talked with @nealef and we'll setup a branch in dotnet/runtimelab to work on that. We can already cross-build thanks to dotnet/arcade#5863 and dotnet/dotnet-buildtools-prereqs-docker#351 so that should cover the build aspect.

For running tests we'd need a Helix queue but we might be able to use the s390x Docker images running via QEMU if we can't plug in a physical system.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

MattGal commented 3 years ago

For the zSystem support I talked with @nealef and we'll setup a branch in dotnet/runtimelab to work on that. We can already cross-build thanks to dotnet/arcade#5863 and dotnet/dotnet-buildtools-prereqs-docker#351 so that should cover the build aspect.

For running tests we'd need a Helix queue but we might be able to use the s390x Docker images running via QEMU if we can't plug in a physical system.

Linux on Z is new to me; we'd have to try it out but from reading it seems like we could just use Helix Docker support for testing. It's an easy experiment to try when we're ready to do it.

safern commented 3 years ago

I want to start a discussion here for the next proposal after talking to @sunandabalu and @markwilkie and gathering some data points:

Make this build definition a 2 purpose build definition.

Run innerloop libraries/runtime tests on new test targets when on boarding a new platform, runtime, etc.
- These new test targets will be added to this platform and until we get a 1 week period of no flaky failures that are not related to the PR changes. Once it is on a stable state (after a 1 week period with a reasonable pass rate > 90% maybe?) it can be moved to the main PR pipeline.
- The crew responsible for the new target should keep an eye on test failures or infrastructure issues and fix them by looking at history of runs on PRs and CI.
Run a set of tests that are identified as flaky in a base set of configurations.
- Tests will be marked as "Category=staging" or "Category=quarantine", and then we will only run these tests as part of this pipeline in the "stable" set of platforms.
- Once this test hits a 2 week period of stability, we can move it down to the innerloop tests.
- Owner of test should be area-owner, or person that added the test?
- Should there be a deadline to move the test from staging back into innerloop? If the test is not fixed, should we deleted it after a period of time? How do we make sure test coverage is not regressed?

Open Questions

These are questions that have come up from talking to people about this.

Should we just make the outerloop libraries builds this pipeline as part of PRs without failing the build and just making it partially succeed?

I gather some numbers on the amount of machine time we spend to run a windows x64 test payload using coreclr for innerloop tests vs outerloop tests. The result was:

"InnerloopTime": 00:51:30.2820000, "OuterloopTime": 01:32:56.3050000, "Diff": 00:41:26.0230000

Based on this I would be very cautious about doing that as adding ~1:32 of machine time to each helix queues per PR would be very bad as I don't think we have enough capacity to manage that.

How are we going to ensure the build is relevant to people in their PRs?

The build is always going to be green, so that means, someone can break an staging test or an innerloop tests in these new platforms, how are we going to make it so that the build is not constantly broken and so that test failures are not a LOT where we get to the point that data is just too big for this to be useful.

Should we instead of adding 1 pipeline add 2?

Because of the concern from previous question, there was an idea of adding a onboarding and a staging pipeline, where onboarding pipeline's purpose is to have the new non-stable platforms there and staging the pipeline that just runs staging tests.

cc: @danmosemsft @dotnet/runtime-infrastructure @jkotas @stephentoub for opinions?

ViktorHofer commented 3 years ago

Based on this I would be very cautious about doing that as adding ~1:32 of machine time to each helix queues per PR would be very bad as I don't think we have enough capacity to manage that.

In my opinion reliability should be the metric that defines in which pipeline a test/leg runs and cost should be the metric that defines how often a test/leg runs. What's the benefit of introducing yet another pipeline?

safern commented 3 years ago

What's the benefit of introducing yet another pipeline?

In my opinion, it would be being intentional about what tests run on that pipeline. Outerloop tests are not tests that are flaky or in quarantine trying to make its way back to the main pipeline. Outerloop tests are tests that take a really long time or modify machine state, or require admin elevation, etc. I think as the Outerloop pipeline as a very different purpose than this one. For example, that one is used by the networking team because there are tests that will never be in the main pipeline because of network reliability. So how would you separate, new platforms, tests that are staged, tests that will never be reliable to run on CI or that take a really long time?

sunandabalu commented 3 years ago

To echo what @safern has said above, each of the pipeline has different purpose, life span, out come for the tests running in the pipeline. It also helps in setting clear owners and expectation around test runs. Plus easier to look at test histories in each of the pipelines and spot trends etc.

For eg:

For a test in Quarantine Pipeline, the life span is pretty limited. Within X weeks, the test should either get fixed (green runs for the past Y days) so it can get promoted to the main PR pipeline or get deleted if there is no value provided by the test. There should not be a case of "out of sight, out of mind" as in a test cannot be infinitely flaky even in quarantine pipeline, coz then you are just burning resources.
Whereas a test in Newly Onboarding Test Pipeline , the life span of a test can be longer until the tests get to a stable state and then can get promoted. This pipeline is expected to be red / can be red during the time tests are being dev'ed.
Separating the pipelines based on purpose helps in area owners be only blocked on themselves to get their tests get promoted. Having all of them run in one pipeline that partially succeeds is a slippery slope wrt ownership IMHO.

danmoseley commented 3 years ago

I'm a little unclear whether what's being proposed is a solution to bringing up new platforms (which makes sense) or some generalized staging system for flaky or unproven tests. I am not sure of the value of the latter at least for libraries. Our flakiness seems evenly distributed among new and old tests, and among test/product bugs vs environmental problems (OS, network). The latter can only be addressed by some of the resilience ideas we've discussed before -- eg reruns on other machines, or statistical measures.

steveisok commented 3 years ago

I think the onboarding approach is a good idea and can be helpful both for new platforms and any that may significantly misbehave after the fact. I can see some benefits to the flaky pipeline, but maybe as Dan said, built in resilience solves most of the problems.

sunandabalu commented 3 years ago

Agreed, that the right long-term solution to handle flaky/transient tests is to have in-build resiliency mechanism like auto-retry. Moving them out / Sequestering them to a separate pipeline is an intermediate band-aid to get the main PR builds green, and have a path to be able to not merge on red (which is happening 50% time now).

jkotas commented 3 years ago

Moving them out / Sequestering them to a separate pipeline is an intermediate band-aid

We have existing interim solutions that is to disable the flaky test and open an issue on it. Why is that not good enough as interim solution?

I think it would be better to spend energy on creating the long-term solution.

safern commented 3 years ago

Now that these points are brought up, I agree with them. Moving it to a "Staging=Category" would require a code change. So opening an issue and using the ActiveIssueAttribute should be sufficient.

I think we should just focus on the new platforms onboarding pipeline and just disable flaky tests, or add retry to the ones that can't be fixed to be more reliable. (I believe there are some crypto ones that hit an OS bug that repros once every million times).

jkotas commented 3 years ago

Also, some of the flakiness we are seeing is in the lower layers of the stack and can affect pretty much any test (examples #44114, #43571). There is no good way to sequester this type of flakiness into a separate pipeline.

ViktorHofer commented 3 years ago

So how would you separate, new platforms, tests that are staged, tests that will never be reliable to run on CI or that take a really long time?

The same way we separate platforms that only run per rolling build from platforms that run per PR build. We have that mechanism already in place. It's possible to only run a set of tests on the same pipeline and filter on the type of build: rolling, PR, scheduled, triggered by comment.

Not saying we should do that but we should definitely consider the pros and cons of adding a new pipeline vs. reusing an existing one.

safern commented 3 years ago

Not saying we should do that but we should definitely consider the pros and cons of adding a new pipeline vs. reusing an existing one.

But if we do that, we're moving away from the purpose of this issue, which is onboarding new platforms. Running outerloop tests in PRs without failing the build provides no value to that scenario whatsoever, in fact, it just pollutes the build with data as if there are a bunch of outerloop test failures, investigating what is needed to get the new platform stable will be harder.

I think outerloop should stay where it is. If we want to move it to run as rolling, that's a separate discussion and not part of this issue.

ViktorHofer commented 3 years ago

Running outerloop tests in PRs without failing the build provides no value to that scenario whatsoever, in fact, it just pollutes the build with data as if there are a bunch of outerloop test failures, investigating what is needed to get the new platform stable will be harder.

I never said we should run tests that are currently marked as Outerloop per PR.

safern commented 3 years ago

I never said we should run tests that are currently marked as Outerloop per PR.

Oh, that is what I was understanding from previous offline discussions. Then it seems like this issue has the info I need for new platform on boarding.

ViktorHofer commented 3 years ago

Outerloop tests should continue to runs as is. My point is that we should consider using the outerloop pipeline for the platform bring-up as well. Again, I'm not saying we should do that but I would like to discuss the pros and cons of it.

From a naïve developer point of view who interacts with dotnet/runtime, it's already confusing that some PR legs link to entirely different build pipelines. In addition to that, I don't think it's clear what the difference of runtime vs runtime-live-build is. But that's a general statement, not relevant for this discussion.

On a broader topic, is there documentation available with best practices on when to set-up a new build pipeline vs. using an existing one?

safern commented 3 years ago

My point is that we should consider using the outerloop pipeline for the platform bring-up as well. Again, I'm not saying we should do that but I would like to discuss the pros and cons of it.

Pros of a new pipeline:

It will run on PRs (testing PR capacity). Bringing a new platform should run at a PR workload before adding it to the PR build to test stability of the new test platform infrastructure. (i.e, Helix can handle Android test runs correctly, the xharness infrastructure has mitigated infra issues when communicating with ADB, etc).
Innerloop tests should be stable on the new platform before we can move the new platform into the PR build. So that means, that at a PR workload, we can see which tests are failing intermittently an disable them before moving the new platform leg to the main PR pipeline.
Outerloop build runs only once a day, we want to mark it as a failing build to know it is failing, and it's purpose is to only run Outerloop tests. New platforms should be green on innerloop to be able to add them to the main pipeline.
If we use Outerloop pipeline for new platform bringup, the Outerloop pipeline will now be running, outerloop tests for some platforms, and innerloop tests for others (the new platform we're testing), and whoever is investigating the new platform issues, will have to filter out the outerloop test failures everytime they're looking at the data for the platform they care about.

CONS:

Maintaining a separate yml/pipeline. I don't see that as a really downside as it should be a pretty simple build and only touched/maintained by feature groups that are actively working on bringing a new pipeline.

I don't think it's clear what the difference of runtime vs runtime-live-build is

That I agree... I think we should rename, runtime-live-build, to runtime-dev-workflow.

On a broader topic, is there documentation available with best practices on when to set-up a new build pipeline vs. using an existing one?

No there is not. But I think, each pipeline should have it's own purpose. So adding a new pipeline vs using an existing one, should be based on, what's the purpose of the new pipeline, can we achieve that with an existing pipeline? If the purpose of the new pipeline will be pointless by changing it, then add a new pipeline. The "less" special a pipeline is and the less things it does, the better in my opinion.

Also, another open question I have for this discussion is, for new platforms, should we only ignore test failures for the build outcome? That means, should the build fail in PRs if there is a build failure? I would say yes, as we should help the feature crew at least give some protection for building the vertical on that new platform.

markwilkie commented 3 years ago

We have existing interim solutions that is to disable the flaky test and open an issue on it. Why is that not good enough as interim solution?

The thing I would ask is that we're able to query (or somehow know) which tests have been deemed "flaky" which will verify assumptions and retry logic for the resiliency work. This is because there are two types of flakiness - tests which fail intermittently, and random tests which fail intermittently. I presume only the former would be annotated as "flaky"?

The immediate goal here is to get the noise level down by taking care of the stuff we know of. (e.g. new platform bring up, etc) Once that's complete, we'll have much better optics into the nature of the flakiness such that we can start to implement mechanisms to be more resilient.

safern commented 3 years ago

Based on the conversation above we're ready to at least create the new staging pipeline and using it for new test platforms until they stabilize. We have a good reason now. We will be moving Android test runs that use an emulator. Plus this will be used for new targets we plan to add into our CI.

Interpreter legs for iOS and Android, and also tvOS will come soon.

Working on this so that we can get something working next week.

dotnet / runtime