dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.65k forks source link

CoreCLR test suite optimization proposal: support for test project grouping #54512

Open trylek opened 3 years ago

trylek commented 3 years ago

Problem description

Current CoreCLR Pri1 test set has over 10K individual test projects. This is beyond the means of a single msbuild execution and is mitigated by partitioning the test projects into subgroups. Today at least three such partitionings exist (partitioning during test build, partitioning into XUnit wrappers, partitioning for Helix execution). While @echesakov did his best to make the Helix partitioning as good as possible, the entire logic adds enormous complexity to the test system, complicates developer ramp-up and is a constant cause of developer complaints. The 10K separate apps also mean 10K .NET Core runtime startups incurring enormous testing cost, it's not hard to imagine that the repeated .NET Core runtime initializations take an equal or greater amount of time than the actual test code execution.

Caveat - we don't yet have any hard data to substantiate this claim. I'm working on figuring out how to produce it in some form.

Ideal state

As I personally heard in presentations by @jaredpar and @stephentoub, perf optimization of Roslyn and libraries tests that took place several years ago involved the reduction of the number of separate test apps as a key step. I believe we should take the same route in CoreCLR testing; in bulk testing (local or lab Pri0 / Pri1 testing) we should run fewer than 1K test apps, ideally less than 500. Once that happens, we should be able to remove all the partitioning goo and just run the tests one by one, both locally and in Helix.

Downsides, challenges and problems to solve

Today, about 3/4 of the test suite corresponds to the JIT unit tests - a search in my runtime repo clone under src\tests\JIT for *.csproj/ilproj yields 7312 matches. If we're serious about this effort, we must tackle JIT tests first. According to the proposed ideal state, we should strive to reduce the number of separate apps to about 300~400. I think that roughly corresponds to two subdirectory levels under JIT (e.g. Methodical\divrem) but I have yet to provide more precise numbers.

While the test aggregation is expected to solve a known set of problems (test system complexity caused by the partitioning systems, performance of test build and execution), it has the potential to introduce a new set of problems we should plan ahead of and work on fixing or mitigating as part of the proposal. In particular, a larger number of tests being run as a single app can complicate debugging, profiling, TTT analysis, and JIT dump analysis; runtime and / or hard crash in one test tears down the subsequent tests in an aggregated test app, reducing test coverage in the presence of failures.

The counter-arguments clearly highlight sets of tests that are unsuitable for aggregation - typically interop tests where the individual tests sometimes tamper with the machine state (e.g. by registering COM classes), perhaps also the GC tests that are often lengthy and / or have the potential to tear down the app like in the case of negative OOM tests.

Even in cases where the test aggregation is expected to be benign, e.g. in the case of the JIT methodical tests, we still need to address the question of aggregation hampering developer productivity, typically in various diagnostic scenarios. @AndyAyersMS proposed a dual system where the tests would be aggregated by default in bulk testing but the developer could explicitly request the build of a single test case to mitigate the aforementioned complications.

Proposed solution

I have yet to make any real experiments in this space but it seems to me that we might be able to solve much of this puzzle by introduction of group projects. My initial thinking is that, for a particular test project, e.g. JIT\Methodical\divrem\div\i4div_cs_do.csproj, we would use a new property to declare that the test is a part of the test group project, say, JIT\Methodical\divrem\divrem_do.csproj (JIT tests often come in groups that require different optimization flags so that would need preserving in the groupings). Hopefully it should be possible to tweak msbuild to normally build just the group projects; these would need to use either some form of code generators or reflection to run all the relevant test “cases” represented by the grouped projects but that should no longer blow up msbuild as we could easily build the individual group projects serially.

I already have a work item on adding a new command-line option to src\tests\build.cmd/sh to let developers build just a particular test project or project subtree. It should be trivial to consolidate this option with the proposed project grouping such that in bulk testing we’d end up with just the group projects whereas targeted local scenarios would end up producing a single-test executable (as before) with the caveat that trying to build the entire tree in this “separate” mode would likely trigger an msbuild OOM or some other failure.

Proposed sequencing

  1. I’m going to perform at least a series of local experiments to measure how much of the running time of the individual tests is coming from runtime initialization vs. actual test code execution and I’ll share them on this issue thread. I have yet to see whether this approach can be easily applied in the lab. Locally it might suffice to tweak R2RTest to use ETW mode to monitor at which point Main got executed.

  2. Assuming the perf experiments do confirm a perf win in test grouping (especially for tiny tests like the JIT unit tests) and we agree on this proposal in some form, I’ll look into implementing its basic underpinnings in the CoreCLR test build / execution infra scripts and I’ll test the approach on a small suite of JIT tests.

  3. Once the PR per (2) is merged in, we can trigger a “quality-week-like” combined effort to apply the technique to additional CoreCLR test areas. At this point we would be still using the pre-existing infrastructure including the XUnit wrappers and test partitionings, we’d just gradually reduce the number of test apps being run. (The proposed conservative approach doesn’t address actual test code merging i.e. the test build time win will likely be smaller if any. This is further aggravated by the fact that many of the JIT unit tests come in form of IL source code.)

  4. The work per (3) should yield gradually accumulating benefits in form of reducing the total CoreCLR test running time, both locally and in the lab. Once the work advances enough so that we get under the envisioned 1K test projects, we can proceed to experimenting with removal of the test partitionings. At that point we may be also able to consider removing the Pri0 / Pri1 distinction and always run all the tests.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

ghost commented 3 years ago

Tagging subscribers to this area: @hoyosjs See info in area-owners.md if you want to be subscribed.

Issue Details
**Problem description** Current CoreCLR Pri1 test set has over 10K individual test projects. This is beyond the means of a single msbuild execution and is mitigated by partitioning the test projects into subgroups. Today at least three such partitionings exist (partitioning during test build, partitioning into XUnit wrappers, partitioning for Helix execution). While @echesakov did his best to make the Helix partitioning as good as possible, the entire logic adds enormous complexity to the test system, complicates developer ramp-up and is a constant cause of developer complaints. The 10K separate apps also mean 10K .NET Core runtime startups incurring enormous testing cost, it's not hard to imagine that the repeated .NET Core runtime initializations take an equal or greater amount of time than the actual test code execution. *Caveat* - we don't yet have any hard data to substantiate this claim. I'm working on figuring out how to produce it in some form. **Ideal state** As I personally heard in presentations by @jaredpar and @stephentoub, perf optimization of Roslyn and libraries tests that took place several years ago involved the reduction of the number of separate test apps as a key step. I believe we should take the same route in CoreCLR testing; in bulk testing (local or lab Pri0 / Pri1 testing) we should run fewer than 1K test apps, ideally less than 500. Once that happens, we should be able to remove all the partitioning goo and just run the tests one by one, both locally and in Helix. **Downsides, challenges and problems to solve** Today, about 3/4 of the test suite corresponds to the JIT unit tests - a search in my runtime repo clone under src\tests\JIT for *.csproj/ilproj yields 7312 matches. If we're serious about this effort, we must tackle JIT tests first. According to the proposed ideal state, we should strive to reduce the number of separate apps to about 300~400. I think that roughly corresponds to two subdirectory levels under JIT (e.g. Methodical\divrem) but I have yet to provide more precise numbers. While the test aggregation is expected to solve a known set of problems (test system complexity caused by the partitioning systems, performance of test build and execution), it has the potential to introduce a new set of problems we should plan ahead of and work on fixing or mitigating as part of the proposal. In particular, a larger number tests being run as a single app can complicate debugging, profiling, TTT analysis, and JIT dump analysis; runtime and / or hard crash in one test tears down the subsequent tests in an aggregated test app, reducing test coverage in the presence of failures. The counter-arguments clearly highlight sets of tests that are unsuitable for aggregation - typically interop tests where the individual tests sometimes tamper with the machine state (e.g. by registering COM classes), perhaps also the GC tests that are often lengthy and / or have the potential to tear down the app like in the case of negative OOM tests. Even in cases where the test aggregation is expected to be benign, e.g. in the case of the JIT methodical tests, we still need to address the question of aggregation hampering developer productivity, typically in various diagnostic scenarios. @AndyAyersMS proposed a dual system where the tests would be aggregated by default in bulk testing but the developer could explicitly request the build of a single test case to mitigate the aforementioned complications. **Proposed solution** I have yet to make any real experiments in this space but it seems to me that we might be able to solve much of this puzzle by introduction of *group projects*. My initial thinking is that, for a particular test project, e.g. JIT\Methodical\divrem\div\i4div_cs_do.csproj, we would use a new property to declare that the test is a part of the test group project, say, JIT\Methodical\divrem\divrem_do.csproj (JIT tests often come in groups that require different optimization flags so that would need preserving in the groupings). Hopefully it should be possible to tweak msbuild to normally build just the group projects; these would need to use either some form of code generators or reflection to run all the relevant test “cases” represented by the grouped projects but that should no longer blow up msbuild as we could easily build the individual group projects serially. I already have a work item on adding a new command-line option to src\tests\build.cmd/sh to let developers build just a particular test project or project subtree. It should be trivial to consolidate this option with the proposed project grouping such that in bulk testing we’d end up with just the group projects whereas targeted local scenarios would end up producing a single-test executable (as before) with the caveat that trying to build the entire tree in this “separate” mode would likely trigger an msbuild OOM or some other failure. **Proposed sequencing** 1. I’m going to perform at least a series of local experiments to measure how much of the running time of the individual tests is coming from runtime initialization vs. actual test code execution and I’ll share them on this issue thread. I have yet to see whether this approach can be easily applied in the lab. Locally it might suffice to tweak R2RTest to use ETW mode to monitor at which point Main got executed. 2. Assuming the perf experiments do confirm a perf win in test grouping (especially for tiny tests like the JIT unit test) and we agree on this proposal in some form, I’ll look into implementing its basic underpinnings in the CoreCLR test build / execution infra scripts and I’ll test the approach on a small suite of JIT tests. 3. Once the PR per (2) is merged in, we can trigger a “quality-week-like” combined effort to apply the technique to additional CoreCLR test areas. At this point we would be still using the pre-existing infrastructure including the XUnit wrappers and test partitionings, we’d just gradually reduce the number of test apps being run. (The proposed conservative approach doesn’t address actual test code merging i.e. the test build time win will likely be smaller if any. This is further aggravated by the fact that many of the JIT unit test come in form of IL source code.) 4. The work per (3) should yield gradually accumulating benefits in form of reducing the total CoreCLR test running time, both locally and in the lab. Once the work advances enough so that we get under the envisioned 1K test projects, we can proceed to experimenting with removal of the test partitionings. At that point we may be also able to consider removing the Pri0 / Pri1 distinction and always run all the tests. Thanks Tomas /cc @dotnet/runtime-infrastructure
Author: trylek
Assignees: -
Labels: `area-Infrastructure-coreclr`
Milestone: -
trylek commented 3 years ago

/cc @dotnet/jit-contrib

trylek commented 3 years ago

/cc @dotnet/gc

trylek commented 3 years ago

/cc @agocke @jkotas @janvorli @mangod9

trylek commented 3 years ago

/cc @tommcdon @hoyosjs

trylek commented 3 years ago

/cc @naricc @fanyang-mono

jkotas commented 3 years ago

we should strive to reduce the number of separate apps to about 300~400.

This feels still way too much. I think we should be shooting for < 40.

It is common to have thousand of tests per tests app in the libraries partition. Having a few hundred of tests per test app would still be less that what you regularly see in libraries.

the test aggregation

There are two independent aggregations:

I think we should deal with both types of aggregation at the same time, so that it is solved once for good. I think the ideal state is:

We would need to change how the tests are authored to make this happen. The tests cannot use the regular Main method as the entrypoint anymore since you cannot have multiple Main methods per binary.

My proposal would be:

The reason for using source generator and not XUnit runner to discover the tests is debuggability. XUnit runner is a reflection stress test and thus it is not suitable as a test driver for the low-level runtime.

The nice side-effect of using the standard XUnit attributes for runtime tests is that the authoring of core runtime tests will become more similar to authoring of libraries tests.

hoyosjs commented 3 years ago

One thing I was thinking about this approach is: does this mean catastrophic failures in one test will take down the whole work item execution? Maybe this is something the remote executor can help with. Also, with the generated Main approach we would probably need to work out the reporting mechanism + coredump mechanism as what we have today would fall short and helix wouldn't report these.

trylek commented 3 years ago

I can theoretically imagine that we might be able to tweak the test scripts such that, when the aggregate test app crashes in a catastrophic manner, we'd run it a second time to execute the individual test cases one by one as separate apps, I guess that's what roughly corresponds to the remote executor. For the test authoring, I guess the biggest challenge is the JIT IL tests, I was originally thinking we might be able to keep them unchanged but if that doesn't work, I'm definitely open to other ideas.

jkoritzinsky commented 3 years ago

My main request if we go a remote-executor route would be that there is some mode to have the remote executor spit out the command line required to launch the process it is starting. One of the hardest problems with RemoteExecutor is being able to figure out how to debug the child process.

Additionally, if we go the route of a source-generated xunit-esque test execution runner with RemoteExecutor-esque features for tests that require out-of-proc launching, I'd like it if we could design the support such that a test author could also reuse whatever infra we have for launching the child process and capturing diagnostics for specialized cases (like the COM tests with native entry-points that test activation)

trylek commented 3 years ago

Frankly speaking, I think we should work hard to avoid child process executions whenever possible as I believe it makes a crucial difference w.r.t. test perf. For isolation-sensitive tests like interop tests we'll add specific provisions based on auditing where process isolation is required.

naricc commented 3 years ago

We are already doing a kind of build-aggregation of tests for Android and iOS tests, because it was simply impractical to package up each test as a separate app. (@fanyang-mono and @imhameed worked on this repsecitvely) I think this will need to be true for wasm-aot as well, because each individual wasm-app takes a long time to compile.

If we do this "test group" things, we may be able to also put each group in an app, which would simplify the design of those test lanes. But I am not sure if the tradeoffs are the same/compatilbe (i.e. how many tests can go in each app).

BruceForstall commented 3 years ago

A JIT team requirement is to execute as little managed code as possible before getting to the test being debugged. It sounds like the proposal above might mostly achieve this even with aggregated tests, for most debugging scenarios. A counter-example is the libraries tests, where debugging them involves JITing and running gobs of xunit w/ reflection, which is super slow and interferes with debugging (e.g., set a JIT/GC stress mode, xunit is stressed also before even getting to the desired code). I like the proposal that tests could optionally be built standalone, if possible. Small, standalone tests help greatly in platform bring-up scenarios.

I like Jan's suggestion about mass grouping, noting that the build grouping doesn't necessarily need to reflect Helix run-time grouping: if we have X built test assemblies and want to parallelize runs on Y machines, we don't need X == Y : especially if we can choose which subset of tests in a test assembly get run in any particular invocation. E.g., copy X.dll to two Helix machines, run half of the tests in X.dll on one machine, half on the other. This might not work quite so transparently, however, for crossgen tests, which will crossgen the entire test assembly no matter what subset of tests is run.

Grouping the tests probably makes is easier/simpler to copy tests between machines, e.g., from a Linux/x64 box doing cross-compilation to a Linux/arm32 "run" box.

The "test driver" will need to be very clear about which test is being run, which has passed/failed, how to rerun a failure (or pass). Of course, we need the results surfaced to Helix/AzDO properly.

How will per-test timeouts work? Will they only be per-test-assembly? That could cause a misbehaving test early in the run to prevent getting results from tests later in the run sequence.

trylek commented 3 years ago

Thanks @BruceForstall for your detailed and insightful feedback. I don't yet have all the answers to your question; as a first step I'm trying to collect some actual perf numbers and as part of this task I noticed a bunch of test duplicates. Would you be fine with cleaning these up as a preparatory step or is there some more subtle distinction to what I perceive as mere duplication? Examples:

https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/NaN/arithm32_cs_d.csproj https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/NaN/arithm32_d.csproj

(and seven other pairs in the same folder)

https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/xxobj/operand/refanyval.csproj https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/xxobj/operand/_dbgrefanyval.csproj

I see about two dozen similar cases and my local tooling should let me automate their identification. If you agree to the preparatory cleanup, I'll work on putting up a PR.

Thanks

Tomas

BruceForstall commented 3 years ago

Those look like dups to me.

Note that src\tests\JIT\CheckProjects\CheckProjects.cs is a tool to ensure test tests set various properties correctly. I haven't run it recently (and I'm not sure it runs in an automated fashion anywhere).

trylek commented 3 years ago

I have performed a simple experiment to get an initial reading on the perf implications of proposed test merging. The results seem to indicate potential for substantial build time speedup; I'm also seeing some runtime speedup but frankly not as pronounced as I expected. Most of the motivation in the issue description remains in place, I just have less hope that the change will automatically translate to drastic reduction of test running time - there are still chances the change will substantially speed up Helix execution by means of reducing the payloads but that's speculation at this point.

As the target for my initial experiment I picked the JIT\Methodical tests that seem well suited for merging. The Pri1 suite contains about 2K tests in this subtree approximately half of which are csproj and the other half are ilproj projects. I have limited the initial experiment to csproj as the IL is much harder to transform. I have basically open-coded a simple managed app that uses some shortcuts to try to mimic the proposed transformation - changing the Main methods in the individual tests to be normal methods and directly calling them from a generated wrapper project / source file. I have excluded about 50 tests that use constructs incompatible with my simplistic C# / project rewriter. The runtime perf results on my laptop are as follows (x64 release):

1) Vanilla src\tests\run release - 33 seconds. 2) Test wrapper calling into the original test assemblies - 29 seconds. 3) All tests compiled into a single assembly - 26 seconds.

It's probably worth noting that I'm respecting the d/r/do/ro distinctions so I'm actually generating four projects and running them in sequence. As you can see, the "fastest" variant (putting all tests in the subtree in a single assembly) reduces the execution time by about 22%.

On the other hand, putting all tests in a single assembly does reduce test build time in a substantial manner. On my laptop the managed test build portion of src\tests\build release -priority=1 (still limited to csproj projects under JIT\Methodical) takes about 190 seconds while the build of the four combined projects representing JIT\Methodical tests in the four build combinations (d / do / r / do) only takes about 24 seconds i.e. about 8 times improvement.

Summary: merging many tests together does have a drastic effect on total test build time; runtime perf improvement is also measurable but much less pronounced. Please note this could still be a big win for PR / CI runs as the test build constitutes a non-trivial portion of the total running time of these pipelines. In the latest CI run Pri0 test build took about 23 minutes; in the last outerloop run, Pri1 test build took about 47 minutes. It is also worth noting that this part is fundamental as all the Helix runs depend on it.

If purely hypothetically we were able to reduce test build time 8 times as the results for JIT\Methodical tests suggest, i.e. from 23 to ~3 minutes and from 47 to ~6 minutes, that would directly translate into total running times of the CoreCLR pipelines. This estimate has many caveats, e.g. many Interop tests use native components with different build time characteristics, merging multiple ilproj tests into a single assembly requires more complex IL transformations etc. but I believe there definitely is potential for improvement along the lines of this proposal.

BruceForstall commented 3 years ago

Note that we build and run the tests in a Checked config almost 100% of the time, so I'd measure that instead of release. There, I'd expect a bigger improvement.

jkotas commented 3 years ago

Vanilla src\tests\run release - 33 seconds. Test wrapper calling into the original test assemblies - 29 seconds. All tests compiled into a single assembly - 26 seconds.

Was the CPU utilization same between the different cases?

+1 on measuring checked JIT and runtime flavor

trylek commented 3 years ago

Thanks Bruce and Jan for your additional feedback. You're right on both accounts. In checked mode, the three numbers are:

1) Vanilla src\tests\run checked - 202 seconds. 2) Test wrapper calling into the original test assemblies - 96 seconds. 3) All tests compiled into a single assembly - 89 seconds.

For now I just observed CPU utilization in the task manager while running the tests. In (1), xUnit is obviously running the tests in parallel - after the initial test discovery CPU utilization quickly goes to 100% and stays there for the entire test duration. In contrast, both my "new" measurements per (2) and (3) involve CPU happily sitting at 22~25% utilization corresponding to just 1 out of my 4 cores being used. In other words, by fully leveraging parallelism we should be able to further improve case (2) and (3) to 96/4 ~ 24 seconds (about 8 times speedup). I assume that the difference between (2) and (3) is less pronounced in checked mode as the slower JIT and runtime in general dwarf the OS loader time needed to load the multiple assemblies in case (2).

trylek commented 3 years ago

In case anyone's interested in further experiments in this area, I have put the tool I wrote on our internal share

\\clrmain\public\writable\users\trylek\TestGrouping.zip

It basically receives a path into the GIT clone as its command-line argument (e.g. D:\git\runtime\src\tests\JIT\Methodical), rewrites the C# code and projects in the subtree and generates the eight wrapper projects - four projects per (2) and four per (3) - into the folder. This way git checkout / git clean -xdf can be easily used to undo the transformations when experimenting with the tool. The wrapper projects conform to the normal CoreCLR test project style so that they can be individually built using dotnet msbuild ...csproj and executed using the generated cmd script. I'll be happy to carry out any additional measurements using the tool based on your suggestions, at the end of the day it's actually quite easy to use.

trylek commented 3 years ago

As a next step in my experiments I have recently managed to leverage the internal iDNA technology to measure that in the archetypal "tiny JIT" test I'm always mentioning, i4div_cs_do, (on Windows x64 release) we carry out about 70M instructions before entering Main and then about 15M instructions within it. While anecdotal, I believe it further confirms that there is at least some value in test merging for lab testing purposes.

For our oncoming Quality week (next week of 8/23) I have proposed starting the initial preparatory steps, in particular cleaning up test duplicates and renaming tests to remove entrypoint name duplicates (getting rid of pairs of tests with the same qualified entrypoint name i.o.w. where the assembly, class and entrypoint are the same). Once this is done, I'll start working on the next step actually converting tests to XUnit style and on support for their merging.

danmoseley commented 2 years ago

This feels still way too much. I think we should be shooting for < 40.

Just curious, why would we not just have a single assembly for all JIT tests, at that point? The hit on developer loop (even though C# compilation is fast)?

ericstj commented 2 years ago

While we're at it, please consider https://github.com/dotnet/runtime/issues/59821 as an input to this effort.

jkotas commented 2 years ago

Just curious, why would we not just have a single assembly for all JIT tests, at that point?

I do not expect that there will be large efficiency gain between 40 and 1. And if we optimize the system for 1, it has high probability of getting us into troubles on opposite end of the spectrum. One giant binary is as bad as thousands of tiny binaries. The sweet spot for binary size is in the >100kB and <10MB range.

trylek commented 2 years ago

I have finally managed to figure out more detail regarding the test consolidation plan. I am proposing to base the plan on the following assumptions:

I’m looking forward to any feedback and additional suggestions.

Thanks

Tomas

jkotas commented 2 years ago

It is not realistic to convert all 10K tests in a single PR. At least in the short term we need a hybrid infra that will let the various teams gradually audit and convert their tests.

+1

For this reason I propose to limit this effort to merging in-proc test executions, not actually compiling multiple tests into a single managed assembly

There is significant overhead in building the thousands little binaries too. I do not have exact numbers, but just watching the build console makes it very visible.

I think we should agree on the final shape that we want to get to and then decide about the best way to get there.

I am not convinced that doing the conversion in multiple steps will save us anything (it is more likely the opposite).

I’ll write a new managed app that will scan the test output directories after the Roslyn build

Can this be a Roslyn source generator that is usable with any xunit-based test project (ie not specific to clr test infrastructure)? Maybe this source generator can live in http://github.com/dotnet/arcade repo, for now at least.

janvorli commented 2 years ago

One thing that is not clear to me is how we would handle the test cases that use special extra scripts bits generated from their project files via CLRTestBatchPreCommands / BashCLRTestPreCommands, especially the per-test settings of env variables that have to be done before the test process executes. Also, I have thought that we would preserve an optional ability to run individual tests for the cases like new platforms bringups where the runtime is not fully working yet. I am also worried about issues leaking from one test case to another when running many tests in process. What I mean is that a test case can hit some issue in the runtime, e.g. a GC hole, that would stay hidden until some later test case results in triggering a GC. And this is not limited just to a GC, there can be some runtime data structure corruption triggered by one test case and revealed much later in an unrelated one. Or the case when a test issue would not fire if the runtime was already "warmed up". It seems that the new way will make debugging and identifying such issues much more difficult. So I feel like the ability to run individual tests one by one in a way where very little code is executed before the actual test code is very important.

trylek commented 2 years ago

Thanks Jan & Jan for your detailed feedback. Based on your comments I'm trying to reformulate the principles of the conversion.

For source-level merging into larger test apps, I believe we must solve three non-trivial challenges; these will likely require some amount of manual adjustments in the tests:

trylek commented 2 years ago

With help from Tomas Matousek I managed to get a better idea about how the Roslyn source generators work. From this perspective I now think that the test wrapper creation should involve three steps:

While my current understanding confirms that the Roslyn analyzer / generator technology is sufficient for the first and third step, I still need to get my head around the second step as I don't yet see how to integrate it into the Roslyn analyzer / generator framework apart from subjecting the entire test subtree to a single Roslyn compilation and I have no idea what will happen if I do.

jkotas commented 2 years ago

For things like GC holes the merged system might be actually more "efficient" in the sense of being more stressful as short executions of the individual test cases can easily hide GC inconsistencies

+1

a switch over the first command-line parameter specifying the test to run in the generated wrapper would probably go a long way towards satisfying this requirement.

Yep, we should not need to build the tests differently to allow single test execution. Check the sketch of the auto-generated Main method below on what it may look like. We are running quite a bit of managed code during startup, so I would hope that adding one extra managed method should not be a big deal. We can iterate on what the wrapper should look like exactly to make it as little intrusive as possible and still support all required features.

Dealing with IL-based projects.

It is ok to keep the IL-based tests in separate .dlls. However, we should allow multiple tests to exist in one .il file.

For a merged test app there’s no way to run a script between the test cases, in fact it’s one of the main goals to get shell out of the picture and run test cases within the merged app with as little overhead and process launching / switching as possible.

This problem was solved for libraries tests using remote executor. It would be nice to use the same scheme here if possible. Also, we only have a few hundred of these tests that require env variables, etc. We can deal with them last during final cleanup. We do not need to agree on the exact solution now.

I'm most worried about the performance of this step as it ultimately ends up feeding the 10K or more source files to the analyzer

The analyzer should be only running on one group at a time. I think we should be fine as long as we keep the group size under 1K.

Note that it is not unusual for libraries test .dlls to contain thousands of tests. For example, https://github.com/dotnet/runtime/tree/main/src/libraries/System.Runtime/tests contains 3_700+ tests, but it still compiles and executes in a very reasonable amount of time. I do not remember anybody complaining about System.Runtime tests compilation or execution speed hampering productivity.

I still need to get my head around the second step as I don't yet see how to integrate it into the Roslyn analyzer / generator framework

Roslyn generator should be responsible to generating Main method of the entrypoint test .exe. Let's say my test source files look like this:

public class MyTests
{
    [Fact]
    public static void Test1()
    {
        Assert.Equal(0, 0);
    }
}
public class OtherTests
{
    [Fact]
    public static void Test2()
    {
        Assert.Equal(42, 42);
    }
}

The source generator should produce a Main method that looks like this:

[MethodImpl(MethodImplOptions.NoOptimization)]
static void Main(string[] args)
{
    string name = (args.Length > 0) ? args[0] : null;
    if (name == null || name == "MyTests.Test1") MyTests.Test1();
    if (name == null || name == "OtherTests.Test2") OtherTests.Test2();
}

Everything (the tests and the auto-generated Main method) gets compiled into one test .exe as part of single Roslyn compilation.

trylek commented 2 years ago

Thanks Jan for your supportive response. Based on your feedback I believe that as a first step I should experiment with the Roslyn source generator to get an idea about its performance and ability to deal with larger groups of tests. As a first iteration I would look into partitioning corresponding to the existing xUnit wrappers i.e. two directory levels under the test binary root (e.g. JIT\Methodical). I actually think this is the biggest one with about 2K tests so its conversion should provide many insights into the subsequent work.

trylek commented 2 years ago

One other detail I mentioned higher up this thread but haven't emphasized recently is the debug / optimization matrix. For JIT tests in particular, many of these come with d / do / r / ro suffixes to denote debug vs. release and optimization switches. In my summer experiments I ended up generating four wrappers corresponding to each of the categories, I'll be happy for other ideas.

jkotas commented 2 years ago

I believe that as a first step I should experiment with the Roslyn source generator to get an idea about its performance and ability to deal with larger groups of tests.

For inspiration, you can also take a look at https://github.com/dotnet/runtime/tree/main/src/libraries/Common/tests/StaticTestGenerator . It does the same thing as what the source generator should do, except that it is a standalone tool.

Also, you can use the existing tests under libraries to test the source generator.

For JIT tests in particular, many of these come with d / do / r / ro suffixes to denote debug vs. release and optimization switches. In my summer experiments I ended up generating four wrappers corresponding to each of the categories, I'll be happy for other ideas.

Sounds reasonable. If we are building a tests in 4 different ways with current scheme, we will need to build it in 4 different ways in the new scheme as well to maintain same coverage.

jkoritzinsky commented 2 years ago

@trylek if you have any questions about how to build a Roslyn source generator such that it runs efficiently, just shoot me a message and I should be able to help.

BruceForstall commented 2 years ago

I think most of the following concerns have already been considered, above, but let me right them down just to be sure:

Extending JKotas's example above, we could also leave a Main function with every test, but put it under #ifdef:

public class MyTests
{
    [Fact]
    public static void Test1()
    {
        Assert.Equal(0, 0);
    }
#if STANDALONE
    public static int Main()
    {
        int result = 100; // default success code
        Test1();   // or `result = Test1();` for tests that return an `int`
        return result;
    }
#endif // STANDALONE
}
trylek commented 2 years ago

Thanks Bruce for your comments and suggestions.

The wrapper cmd files handle various services, such as supporting (a) ilasm round-trip testing, (b) crossgen2 testing, (c) bailing out on some test configurations, (d) IL linker testing (?), (e) setting stress environment variables before a test is run. The new mechanism will need to handle all of this.

We already assume that some tests won't be mergeable. Similarly we could have a mode that wouldn't merge anything and basically run tests one by one like today; we could use this, at least in the middle term, for jobs exercising some of the special behavior you describe (ILASM / ILDASM roundtrip, IL linker testing, stress runs); our current biggest goal is to provide some headroom to our lab capacity that is currently stretched to the limit and for this purpose it should be sufficient to optimize the most frequent runs i.e. PR / CI and outerloop runs; jobs running a few times a week shouldn't make that much of a difference. In some cases like Crossgen2, merging tests should actually provide perf gain of its own as a larger Crossgen2 compilation is generally much more performant than a ton of tiny compilations incurring repeated managed startup, loading of the framework assemblies and similar semi-constant costs.

I believe the current xunit wrappers handle per-test timeouts and core dumps. Presumably the auto-generated wrapper will handle this?

I don't think the XUnit wrappers deal with timeouts and dumps by themselves, the last time I looked they were just plain process launches marked with the [Fact] attributes; any timeout / dump logic must be handled either by the XUnit console or by the test app itself. In this sense introducing a new form of wrapper for the merged tests shouldn't incur any fundamental regression of its own even though you're certainly right that timeouts and such might need further tweaks. For dumps I would expect that the dump of the merged test app should be equally usable for failure investigations as dumps for single-test apps.

Will test exclusions work the same as today? There are some per-test conditional exclusion attributes in the proj files, e.g., IlasmRoundTripIncompatible, GCStressIncompatible, UnloadabilityIncompatible. How do these get "attached" to the test? I think today these get baked into the wrapper cmd file. Do we partition tests into sets with precisely the same set of proj file attributes?

For some test project attributes we definitely do need to split differing tests, most notably for the DebugType and Optimize attributes heavily used by JIT tests. For the other properties, frankly I don't know yet. As I explained above, I can imagine that some of these tests remain running in the STANDALONE mode; gradual incorporation of these special test modes (aimed at further perf improvements and Helix optimization) would probably require adding support for them to the wrapper generator e.g. by means of conditional statements around the individual test case executions.

Will the existing issues.targets file exclusion mechanism still be used? Similarly, how will tests get marked Pri-0 versus Pri-1?

For now I expect that both issues.targets and P0 / P1 split will remain in place. I am somewhat hopeful that perhaps the optimized execution thanks to grouping may make the tests so much more performant that we'll be able to get rid of the P0 / P1 distinction or perhaps separate out just a few long running tests and run all the rest on PR / CI runs. I'm not yet 100% sure this is realistic but I believe it would be a substantial boost for our code quality monitoring. I believe longer-term discussions about upsides and downsides of issues.targets and its possible replacements are healthy but far beyond the scope of this particular proposal that is already quite complicated.

Where is the test run parallelism going to live? In the (generated) test wrapper program?

That is an interesting question. I hope not. Higher up the thread I and JanK had somewhat different opinions about the expected final number of merged apps where IIRC I proposed several hundred and JanK several dozens; we have yet to identify the sweet spot but "my" guess (let's say 400 wrappers) corresponds to something like 25 tests per wrapper in P1 mode and 6 tests per wrapper in P0 mode while "JanK's" guess (let's say 40 wrappers) corresponds to about 250 tests per wrapper in P1 mode and 60 tests per wrapper in P0 mode. Assuming the sweet spot is somewhere between these two, there's still a lot of wrappers to run so that's where I'd expect the parallelism to occur. Ideally I'd love to initially run the new wrappers using the existing xunit console but I'm not 100% sure that is doable, maybe we'll need some new runner that will just use the xunit console for the legacy wrappers, so that would be the place for the parallelism but I wouldn't expect it to be much more complicated than when, for instance, R2RTest compiles the framework assemblies using Crossgen2 in parallel during the population of CORE_ROOT.

If the test annotations look like xunit, could you actually use xunit to run the tests (if you wanted to?)

That would be just perfect, I think that even if we don't manage to make this work now, we shouldn't make any design decisions that would prevent implementing this in the future.

For your example with the STANDALONE conditional compilation, I'm somewhat worried that this additional bit, while mostly boilerplate code, is incurring an authoring (and to some extent maintenance) cost on every single test; after all, if we spend so much energy pretending that our test cases are "dotnet test"-runnable even though the actual underlying infrastructure will likely be different for performance reasons, those also don't have an #if STANDALONE Main method.

My current thinking is that the wrapper generation needs to be a three step process.

BruceForstall commented 2 years ago

Thanks, @trylek ! I suggested the STANDALONE Main option as a way to achieve the JIT goal of having the absolute minimum amount of JIT-ing before a test case is actually run, so, not even run the generated wrapper Main, which I anticipate could be quite large.

Looking forward to seeing how this all progesses.

jkoritzinsky commented 2 years ago

I have a slightly different proposal for how to organize the test system that (at least to me) feels lighter weight than @trylek's proposal:

Proposal

  1. Convert each C#/IL source file to use XUnit-style [Fact] attributes for each test method.
  2. For platform-specific disable mechanisms, use the attributes defined in the Microsoft.DotNet.XUnitExtensions package (widely used by the libraries team) to disable tests, either with special traits (for general cases like GCStressIncompatible or JitStressIncompatible), or with mechanisms similar to the PlatformDetection type and ConditionalFact/ConditionalTheory in the libraries test tree.
  3. For assembly-wide problems like CrossgenIncompatible or IlasmRoundTripIncompatible, we would use assembly attributes to annotate the incompatibilities.
  4. All issues.targets test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue] attributes and the like.

At this point, we will still have all of the test assemblies as they exist today, but all tests have been changed to use XUnit-style attributes. As a result, we will automatically have the support for the Optimized/Release/Debug combinations that the JIT team uses. Now we need to define how the runner will work:

The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library. The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator that follows the following model:

  1. For each MetadataReference (or ProjectReference in MSBuild terms), discover every public method of every public class that has an XUnit test attribute (Fact/Theory/ConditionalFact/ConditionalTheory, etc.)
  2. Discover each method in the current assembly marked with an XUnit test attribute
  3. Manually generate code to validate that the test should run (ie. the ConditionalFact/ConditionalTheory/ActiveIssue scenario), using the Roslyn APIs to generate direct calls to the different properties instead of using reflection.

As the runner itself is defined in C#, we don't need to do any post-processing of IL assemblies produced from ilproj projects.

From this point onwards, there would be at least 2 modes for the generated code to support:

There would be a post-build step to analyze each assembly in the output. This step would be used to determine which files are eligible for crossgen2/ilasm roundtrip testing. This analysis step would spit out an rsp file for each scenario where each line in the file is the name of an assembly file that supports the test scenario. Then the crossgen/ilasm round trip pre-test steps would be driven by external scripts that we call into and pass the rsp file. Once Roslyn supports generating non-source files from a source generator, these rsp files can be generated by the source generator instead of a post-build step.

To handle setting stress environment variables, we would use CoreRun's new dotenv file support. All of the COMPlus_ environment variables will be set through a dotenv file, which we will directly pass to corerun.

At this point, we will only generate the .cmd/.sh scripts by default for the new XUnit harness projects. These .cmd/.sh files should be able to handle all of our use cases with the above design.

Standalone test support

To enable running standalone tests as we do today, we would update the build system to support a BuildAsStandalone=true MSBuild property. When set, this property would cause a test assembly to reference the "harness generator" in the aformentioned "Standalone-test mode" and to have the rsp, cmd, and sh scripts generated for the harness. This should cleanly support the JIT team's scenario of "minimal IL ran before a test" for dev innerloop scenarios.

To support this in il-scenarios, we will need to use a different mechanism. One option would be to have a simple "repro" project similar to the "DllImportGeneratorSample" project that enables quickly reproducing test failures by providing a simple project that can be changed to call whatever test is required.

Out-of-Proc tests

With this basic model, out-of-proc tests will require manually writing [Fact]-attributed methods that will use RemoteExecutor or a similar mechanism to support out-of-process tests. If we feel that this is not a good design, we could add support for an OutOfProcFact attribute that the source generator would recognize and automatically write the code to execute the test out-of-proc.

dotnet test support

If we decide that support for dotnet test is interesting, we can add another mode of the generator to output a VSTest-adapter implementation that is hard-coded to the tests that are available in the harness. This would enable easier execution in a VS/VSCode scenario, and would be an optional scenario for people who like to run tests directly. As dotnet test has built-in dump support, this might be a useful mechanism to get automatic dump reporting with timeouts without a wrapper project.

trylek commented 2 years ago

Hi Jeremy!

Thanks for your detailed feedback. I believe your "modified proposal" has many interesting characteristics we should use. More detailed comments follow:

All issues.targets test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue] attributes and the like.

I like better aligning runtime tests with library tests. I just think that refactoring of issues.targets is more or less orthogonal to the test grouping and so we can treat it as an optional step - if we don't have enough time to get everything done, this can wait for a later cleanup step. The file has 3K lines i.e. about 1K excludes many of which don't have issues assigned, I would say it's at least a week of work provided you can somehow automate the conversion, doing this manually would take a month and be terribly error prone.

The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library.

This basically means that each test would continue compiling using a separate Roslyn execution into a separate library, I believe higher up this thread JanK called that out as wasteful compared to one big compilation.

The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator

I don't think we can use the Roslyn analyzers for IL source code in ILPROJ tests (there are about 4K such tests in the tree i.o.w. about 40%, just to point out they're not a tiny fraction of the test set); for these we need to figure out something else. Apart from removing the entrypoint, renaming the Main method and marking it with the Fact attribute, there's one more transformation that needs to be made in many of the ilproj tests:

The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated. I think that the easiest way to fix this is to add one more public static class to the bottom of the file with the new public method that would then call the pre-existing internal Main. Using this transformation we could also standardize the naming of the test method in ilproj tests - e.g. having it the same as the name of the test provided we clean them up to be unique across the tree (I think they mostly are already) - with the caveat that this would prevent having multiple test cases in a single IL source file (which is not a regression, we don't have this option today); generating code to call these test entrypoints would then become trivial.

For resolving OS, platform and runtime flavor-incompatible tests we need to keep in mind that in the PR pipeline the managed tests are compiled only once for all combinations and it's just the test wrapper creation that today takes place in the run phase that resolves issues.targets entries pertinent to the targeting platform and filters out the disabled tests. I may be missing some fine points in your plan but the idea about the wrapper project having ProjectReferences to all the tests either means that we'd need to move managed test build to the run phase (i.e. ultimately execute it 30 times instead of once) or that we'd need to generate dynamic code in the wrapper that would perform the relevant checks on the Helix client at runtime.

For the "Standalone test support", am I right to understand you're proposing to basically emit a separate wrapper for each test? While certainly doable that would inject 10K extra C# sources (albeit tiny ones) to the build, I'm not sure about perf implications of this. For one-off local experiments with individual tests it probably shouldn't matter but if we wanted to retain standalone tests in some lab pipelines (e.g. stress ones where the tests are timing out already without merging), we'd need to measure the perf implications.

For "Dotnet test support", that sounds great to me overall; similar to my comments regarding "issues.targets" I think that for now it's sufficient to know that the planned transformation makes this doable and actually doing that can wait for later if at some point we're out of time budget for the infra revamp.

Apart from these comments and clarifying questions, your proposal (I'm not sure if I should call it a counterproposal as I believe that we agree in many aspects and rest is mostly clarifying technical details) sounds great to me and I'll be happy to work with you on fleshing out its remaining details so that we can start the actual work ASAP.

Thanks

Tomas

jkoritzinsky commented 2 years ago

Hi Jeremy!

Thanks for your detailed feedback. I believe your "modified proposal" has many interesting characteristics we should use. More detailed comments follow:

All issues.targets test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue] attributes and the like.

I like better aligning runtime tests with library tests. I just think that refactoring of issues.targets is more or less orthogonal to the test grouping and so we can treat it as an optional step - if we don't have enough time to get everything done, this can wait for a later cleanup step. The file has 3K lines i.e. about 1K excludes many of which don't have issues assigned, I would say it's at least a week of work provided you can somehow automate the conversion, doing this manually would take a month and be terribly error prone.

Yes, we could push this work off until a later date if needed.

The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library.

This basically means that each test would continue compiling using a separate Roslyn execution into a separate library, I believe higher up this thread JanK called that out as wasteful compared to one big compilation.

This was meant to be in alignment with your suggested approach. Alternatively. we can combine many tests into a single assembly since my generator suggestion also looks in the current assembly to generate the test execution calls.

The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator

I don't think we can use the Roslyn analyzers for IL source code in ILPROJ tests (there are about 4K such tests in the tree i.o.w. about 40%, just to point out they're not a tiny fraction of the test set); for these we need to figure out something else. Apart from removing the entrypoint, renaming the Main method and marking it with the Fact attribute, there's one more transformation that needs to be made in many of the ilproj tests:

The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated. I think that the easiest way to fix this is to add one more public static class to the bottom of the file with the new public method that would then call the pre-existing internal Main. Using this transformation we could also standardize the naming of the test method in ilproj tests - e.g. having it the same as the name of the test provided we clean them up to be unique across the tree (I think they mostly are already) - with the caveat that this would prevent having multiple test cases in a single IL source file (which is not a regression, we don't have this option today); generating code to call these test entrypoints would then become trivial.

If we use a C# assembly to be the test runner to wrap the IL tests, we can use the Roslyn source generator/analyzer APIs on the C# runner and have that runner call the assemblies that were compiled from ilproj projects. As mentioned in the standalone case, having a "directly execute the IL assembly without calling though any C#" mechanism would require some additional (likely manual in v1) work on the part of the developer who wants to run an IL test individually.

For resolving OS, platform and runtime flavor-incompatible tests we need to keep in mind that in the PR pipeline the managed tests are compiled only once for all combinations and it's just the test wrapper creation that today takes place in the run phase that resolves issues.targets entries pertinent to the targeting platform and filters out the disabled tests. I may be missing some fine points in your plan but the idea about the wrapper project having ProjectReferences to all the tests either means that we'd need to move managed test build to the run phase (i.e. ultimately execute it 30 times instead of once) or that we'd need to generate dynamic code in the wrapper that would perform the relevant checks on the Helix client at runtime.

My suggestion would be to generate dynamic code in the wrapper to perform the relevant checks and to keep our managed build to one platform, at least for v1. If we eventually merge the many test assemblies (as mentioned above) and get to a relatively quick managed test build, it might be worthwhile to build the managed tests in multiple legs and pre-resolve some of the OS/platform/runtime-flavor/arch checks.

This would make our "test run" phase only have to copy the native bits to the right locations and run the tests, which would help simplify those steps as well.

For the "Standalone test support", am I right to understand you're proposing to basically emit a separate wrapper for each test? While certainly doable that would inject 10K extra C# sources (albeit tiny ones) to the build, I'm not sure about perf implications of this. For one-off local experiments with individual tests it probably shouldn't matter but if we wanted to retain standalone tests in some lab pipelines (e.g. stress ones where the tests are timing out already without merging), we'd need to measure the perf implications.

For the "Standalone test support" case, that is an opt-in scenario where a developer would build an individual test in that mode (using something like ./dotnet build src/tests/path/to/test.csproj /p:BuildAsStandalone=true, which is a slight modification of a common workflow within at least the Interop team). This would be off by default, so we wouldn't be injecting any extra sources into the build.

You are correct that this may have some implications in the lab, but in those cases we'd be generating all of the small source files instead of (though the same if not more involved calculations) generating the larger wrappers. Doing a perf investigation sounds reasonable to me.

For "Dotnet test support", that sounds great to me overall; similar to my comments regarding "issues.targets" I think that for now it's sufficient to know that the planned transformation makes this doable and actually doing that can wait for later if at some point we're out of time budget for the infra revamp.

Yes, this is definitely a stretch goal and is not required for v1.

Apart from these comments and clarifying questions, your proposal (I'm not sure if I should call it a counterproposal as I believe that we agree in many aspects and rest is mostly clarifying technical details) sounds great to me and I'll be happy to work with you on fleshing out its remaining details so that we can start the actual work ASAP.

I'm looking forward to working together on this project! 😁

jkotas commented 2 years ago

The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated.

What are the ripple effects that you are worried about? I would expect it to be extremely rare in src/tests.

issues.targets

Yes, we could push this work off until a later date if needed.

If we push this work off until a later data, how are the tests going be disabled in the meantime? Is the source generator going to read issues.targets somehow?

trylek commented 2 years ago

@jkotas - Hmm, in my summer experiment it looked like making the method and class public often made additional fields or classes start complaining about inconsistent visibility. I tried now to make all Main methods in our .il source files public (not the classes for now) and the Pri1 test build passed so hopefully I was just doing something wrong. One way or another, having a consistent scheme for marking the test entrypoint[s] in ilproj tests might be beneficial in the absence of Roslyn analyzer capabilities.

If we push off issues.targets refactoring, I think the easiest thing we could do in the run legs would be to just emit the list of project exclusions into some text file that would be then read by the test wrapper in Helix and used for test filtering, that should be relatively trivial as the test build already has this information readily available.

jkoritzinsky commented 2 years ago

Another option for a half-step for issues.targets would be to run it through a mechanical transform into a set of exclusions that point at either the .dll or the .csproj files (which shouldn't be too difficult, we've generally followed the same pattern for writing all of the exclusions over the years). Then we could do the text file as @trylek mentioned. Alternatively, if we decide we still want to build the test wrappers per-target, we could have the transformed issues.targets file instead point to the .csproj files that should be excluded from the ProjectReference collection.

We have a few different options here that we can try out if the cost of generating the ActiveIssue attributes is too high.

trylek commented 2 years ago

@jkoritzinsky - I think we're mostly on the same page, comments on some of the remaining bits requiring further work:

I don't think I have any other concerns with your suggestions. I have locally refactored my summer experiment into a tool I'm now using to analyze and rewrite the ilproj tests, my hope is to provide it to the JIT team and to the TypeLoader team (which is actually most likely ours) to use for semi-automatic conversion of these tests. I'm also trying to use it to prototype merged execution of a large number of the ilproj tests to let me provide an initial estimate of the expected savings for Jeff. Please let me know if you have cycles to actually participate in the implementation work here; technically speaking I'm OOF for the second half of next week as I'm using our state holiday next Thursday for a prolonged weekend in the country with friends. Based on our finalized design I can start some scripting / coding work early next week and if you could contribute in its second half, we might have something good enough for fanning out the conversion in the week of November 1st which is what I promised to Jeff on Monday and I still believe is doable.

jkoritzinsky commented 2 years ago

I think we're mostly on the same page, comments on some of the remaining bits requiring further work:

  • For the ilproj tests I'm still not sure if we're in agreement. You say that If we use a C# assembly to be the test runner to wrap the IL tests, we can use the Roslyn source generator/analyzer APIs on the C# runner and have that runner call the assemblies that were compiled from ilproj projects; I generally agree with that but I think the problematic point is how exactly you "call the assemblies that were compiled from ilproj projects". If you add support for merging C# code of multiple tests into a single assembly, you basically need to skip their own C# compilation step (as that will happen as part of the larger merged build). In other words, the wrapper needs generating before or instead of the managed test build. [By the "before" variant I mean that the wrapper is somehow injected into the test list, the "subtests" themselves get skipped in the compilation and the wrapper takes the normal "managed test build" path including creation of the cmd / sh runners using the existing scripts.] It is then the question how the wrapper build locates the ilproj test entrypoints. If it was to look at the ILASM-produced MSIL, we'd need to compile ilproj tests before the csproj tests to achieve the proper ordering; alternatively it must either carry out some manual analysis of the IL source code or count on standardized test entrypoint naming.

My idea is that the wrapper build will reference all test assemblies it includes with a regular <ProjectReference> element in the test wrapper's csproj file. When implemented in this manner, it doesn't matter if the assembly that actually defines the test is written in C# or in IL since the test wrapper project will inspect it as a .NET assembly reference (which does not disambiguate between the two). The test discovery process would be "look for all accessible methods with XUnit test attributes on them in this project's source and in all referenced assemblies", which would include the IL assemblies. As we move more in the direction of merging test suites together, we would move some of the C# tests into the wrapper project itself. For the IL tests, we would start merging them into a single IL assembly, but the runner would still be a separate C# test project.

  • I agree with generating dynamic code using runtime checks and seeing whether at some later point we may be able to make managed test build so fast that it can be put back in the run legs, thus enabling better test and test artifact pruning before sending the run to Helix.

:+1:

  • For the "standalone mode", I think we might actually want to start with that one - implementing something very simple to just allow running the "dotnet test"-style refactored tests one by one; that would be sufficient for fanning out the test refactoring work to the different teams in early November; in parallel we could dive deeper into the test merging / splitting logic.

Yes, we could start with the standalone mode as it requires less features and would provide a comparable experience to the current test system as tests move to use the XUnit attributes.

One other interesting bit that occurred to me in relationship with JanK's issues.targets question is that according to our earlier discussions with the Mono team it would be helpful to have a more dynamic test grouping based on the targeting platform - the browser WASM interpreter, the various Android emulators and such just have different perf and memory characteristics so it would be nice to introduce new degrees of freedom there. If we implemented logic in the generated wrappers (or rather probably in some of their dependencies) to use some emitted test list as the "block-list", we could likely similarly support some "allow-list". The run leg in AzDO could then generate a series of such lists representing the test splitting based on the target and use it to drive the Helix runs.

We could either use a model similar to the PlatformDetection model that the libraries team uses, or we could use XUnit Traits to implement more fine-grained control that the owners of different legs could enable or disable at their preference (allowing both "block-list" and "allow-list" behavior). If we wanted to do more of a load-balancing thing to better size our work items, then I agree, using a list of tests might be a better option.

I don't think I have any other concerns with your suggestions. I have locally refactored my summer experiment into a tool I'm now using to analyze and rewrite the ilproj tests, my hope is to provide it to the JIT team and to the TypeLoader team (which is actually most likely ours) to use for semi-automatic conversion of these tests. I'm also trying to use it to prototype merged execution of a large number of the ilproj tests to let me provide an initial estimate of the expected savings for Jeff. Please let me know if you have cycles to actually participate in the implementation work here; technically speaking I'm OOF for the second half of next week as I'm using our state holiday next Thursday for a prolonged weekend in the country with friends. Based on our finalized design I can start some scripting / coding work early next week and if you could contribute in its second half, we might have something good enough for fanning out the conversion in the week of November 1st which is what I promised to Jeff on Monday and I still believe is doable.

I believe I'll have cycles to work on this next week. For later weeks, I should still have time to work on the Roslyn source generator for test execution (I've gotten quite familiar with the workflow for source generators with the DllImportGenerator project).

jkotas commented 2 years ago

One other interesting bit that occurred to me in relationship with JanK's issues.targets question is that according to our earlier discussions with the Mono team it would be helpful to have a more dynamic test grouping based on the targeting platform - the browser WASM interpreter, the various Android emulators and such just have different perf and memory characteristics so it would be nice to introduce new degrees of freedom there.

What is the estimate for the largest group size that still works everywhere? I would expect that groups of several hundred of average C# tests should be fine in any environment.

It would be best to stick with hardcoded logical groups to keep things simple.

trylek commented 2 years ago

@fanyang-mono / @naricc, can you please respond to JanK's question? I don't have sufficient experience in this space.

trylek commented 2 years ago

OK, so let's say there's a v0 comprising the standalone runner so that we can start gradually merging in PR's switching over the individual tests to xUnit style. I believe the next step is to define what exactly that entails:

What exactly will the test wrapper generate for each test and how it's going to be used in the cmd / sh runner scripts?

Another interesting question is how to incorporate the actual build of these wrappers into the test build pipeline. I believe that my recent work on cleaning up the test build scripts has given us a solid cross-plat foundation we can inject the new work into.

jkoritzinsky commented 2 years ago

OK, so let's say there's a v0 comprising the standalone runner so that we can start gradually merging in PR's switching over the individual tests to xUnit style. I believe the next step is to define what exactly that entails:

  • As a first iteration (before it's proven unworkable) I propose marking the new-style tests as CLRTestKind=BuildAndRun and OutputType=Library. Please let me know if you see off the top of your head that won't work.
  • For now I assume that we switch over all the tests to have some int Entrypoint(string[]), that's what I'm experimenting with with regard to the ilproj tests anyway. Let me know if you have other suggestions.
  • The Entrypoint will be marked with the [Fact] attribute. We need to tweak the props or whatever other files to make the XUnit attributes available for each test build.

I think we should mark them as CLRTestKind=BuildAndRun (which is the default so it doesn't need to be specified) and OutputType=Exe. Naming the entrypoints Entrypoint makes sense, and for v0 we can special-case and allow the int(string[]) signature, though we should move to a void Entrypoint() signature if possible.

What exactly will the test wrapper generate for each test and how it's going to be used in the cmd / sh runner scripts?

  • We can emit a three-line C# wrapper next to each test implementing the Main method by calling into the [Fact] method in the test assembly. We may need to emit a new test project comprising the generated wrapper and the ProjectReference to the original project however that's going to incur one extra Roslyn compilation and one extra assembly load at runtime for each test i.e. some perf degradation. We may be able to justify this in the broader context of the regression being just temporary, in fact we'll likely improve the grouping and perf in parallel with the test transformation process so the gradual improvements may cancel out the temporary regressions.
  • We can emit the wrapper directly as the MSIL EXE using the System.Reflection.Metadata API, it's literally a three-line method (ldarg.0, call, ret). This would likely get rid of the additional Roslyn costs and potential ordering issues regarding the managed build but the additional assembly load at runtime would remain.
  • We can emit the wrapper project as a union of the original project and the generated wrapper source file thus pioneering the merged test execution to some extent. That would get rid of most of the extra Roslyn cost and likely of all the runtime costs.

The source generator would generate a Main method that would call every method marked with [Fact], so making it an exe makes sense. This would effectively fall into something similar to option 3, where we make the test project runnable by generating a "Main" method for the test assembly.

For IL tests, we'd still need to hard-code in the Main method. Maybe even just mark the Main method with the [Fact] attribute instead of re-writing the method name? That would enable the tests to just run as-is and automatically work when we start working on the new runners for the "v1" of this endeavor.

Another interesting question is how to incorporate the actual build of these wrappers into the test build pipeline. I believe that my recent work on cleaning up the test build scripts has given us a solid cross-plat foundation we can inject the new work into.

  • At is seems likely we'll need to generate new test projects, at some place we'll probably need to replace the original projects with the transformed projects.
  • On the one hand I think it would be great if we could delegate most of the cmd / sh runner logic for the wrappers to the existing src/tests/Common/CLRTest.*.targets scripts (just to avoid additional extra work in reimplementing all of those) but there's a nesting bit I don't see how to resolve - I think that the "new wrappers" cannot go through the pre-existing xUnit console logic because they emit "their xUnit results xml" in a different manner; please let me know if you believe we might be consolidate these somehow at least in the short term, I think that in a way it would be much simpler if we could start off with something like "add-ons to the existing xUnit wrappers" rather than a completely new beast requiring immediate adjustments to various parts of the infra.
  • As I mentioned before, if we want to locate the [Fact] attributes on ILPROJ assemblies via Roslyn analyzer i.o.w. probably some incarnation of PEReader, we need to sequence building the wrappers after building the primary source code (in this case building the IL source code using ILASM into the MSIL DLLs). That would however make it more challenging to reuse the existing managed test logic build to generate the cmd / sh scripts. We need to decide whether there's real value in pretending the wrappers to be tests themselves in the test build sense vs. whether we want to build them separately and likewise separately generate their cmd / sh runner scripts vs. whether we want to build them in a separate phase but try to hijack the existing cmd / sh runner script generators to do their job.

For this v0 stage where we're transitioning to the new design, we don't even need to generate new test wrappers. The test behavior should be basically identical to the behavior today, as each project will still be an exe and will still run in the same manner is does today.

jkotas commented 2 years ago

Naming the entrypoints Entrypoint makes sense, and for v0 we can special-case and allow the int(string[]) signature, though we should move to a void Entrypoint() signature if possible.

Yes, it should be void and take no arguments to follow the XUnit conventions.

Nit: I would make it static void Test(). Test is what people tend to use for generic test entrypoint if there are no better options.