Open trylek opened 3 years ago
Tagging subscribers to this area: @hoyosjs See info in area-owners.md if you want to be subscribed.
src\tests\JIT
for *.csproj/ilproj
yields 7312 matches. If we're serious about this effort, we must tackle JIT tests first. According to the proposed ideal state, we should strive to reduce the number of separate apps to about 300~400. I think that roughly corresponds to two subdirectory levels under JIT (e.g. Methodical\divrem) but I have yet to provide more precise numbers.
While the test aggregation is expected to solve a known set of problems (test system complexity caused by the partitioning systems, performance of test build and execution), it has the potential to introduce a new set of problems we should plan ahead of and work on fixing or mitigating as part of the proposal. In particular, a larger number tests being run as a single app can complicate debugging, profiling, TTT analysis, and JIT dump analysis; runtime and / or hard crash in one test tears down the subsequent tests in an aggregated test app, reducing test coverage in the presence of failures.
The counter-arguments clearly highlight sets of tests that are unsuitable for aggregation - typically interop tests where the individual tests sometimes tamper with the machine state (e.g. by registering COM classes), perhaps also the GC tests that are often lengthy and / or have the potential to tear down the app like in the case of negative OOM tests.
Even in cases where the test aggregation is expected to be benign, e.g. in the case of the JIT methodical tests, we still need to address the question of aggregation hampering developer productivity, typically in various diagnostic scenarios. @AndyAyersMS proposed a dual system where the tests would be aggregated by default in bulk testing but the developer could explicitly request the build of a single test case to mitigate the aforementioned complications.
**Proposed solution**
I have yet to make any real experiments in this space but it seems to me that we might be able to solve much of this puzzle by introduction of *group projects*. My initial thinking is that, for a particular test project, e.g. JIT\Methodical\divrem\div\i4div_cs_do.csproj
, we would use a new property to declare that the test is a part of the test group project, say, JIT\Methodical\divrem\divrem_do.csproj
(JIT tests often come in groups that require different optimization flags so that would need preserving in the groupings). Hopefully it should be possible to tweak msbuild to normally build just the group projects; these would need to use either some form of code generators or reflection to run all the relevant test “cases” represented by the grouped projects but that should no longer blow up msbuild as we could easily build the individual group projects serially.
I already have a work item on adding a new command-line option to src\tests\build.cmd/sh
to let developers build just a particular test project or project subtree. It should be trivial to consolidate this option with the proposed project grouping such that in bulk testing we’d end up with just the group projects whereas targeted local scenarios would end up producing a single-test executable (as before) with the caveat that trying to build the entire tree in this “separate” mode would likely trigger an msbuild OOM or some other failure.
**Proposed sequencing**
1. I’m going to perform at least a series of local experiments to measure how much of the running time of the individual tests is coming from runtime initialization vs. actual test code execution and I’ll share them on this issue thread. I have yet to see whether this approach can be easily applied in the lab. Locally it might suffice to tweak R2RTest to use ETW mode to monitor at which point Main got executed.
2. Assuming the perf experiments do confirm a perf win in test grouping (especially for tiny tests like the JIT unit test) and we agree on this proposal in some form, I’ll look into implementing its basic underpinnings in the CoreCLR test build / execution infra scripts and I’ll test the approach on a small suite of JIT tests.
3. Once the PR per (2) is merged in, we can trigger a “quality-week-like” combined effort to apply the technique to additional CoreCLR test areas. At this point we would be still using the pre-existing infrastructure including the XUnit wrappers and test partitionings, we’d just gradually reduce the number of test apps being run. (The proposed conservative approach doesn’t address actual test code merging i.e. the test build time win will likely be smaller if any. This is further aggravated by the fact that many of the JIT unit test come in form of IL source code.)
4. The work per (3) should yield gradually accumulating benefits in form of reducing the total CoreCLR test running time, both locally and in the lab. Once the work advances enough so that we get under the envisioned 1K test projects, we can proceed to experimenting with removal of the test partitionings. At that point we may be also able to consider removing the Pri0 / Pri1 distinction and always run all the tests.
Thanks
Tomas
/cc @dotnet/runtime-infrastructure
Author: | trylek |
---|---|
Assignees: | - |
Labels: | `area-Infrastructure-coreclr` |
Milestone: | - |
/cc @dotnet/jit-contrib
/cc @dotnet/gc
/cc @agocke @jkotas @janvorli @mangod9
/cc @tommcdon @hoyosjs
/cc @naricc @fanyang-mono
we should strive to reduce the number of separate apps to about 300~400.
This feels still way too much. I think we should be shooting for < 40.
It is common to have thousand of tests per tests app in the libraries partition. Having a few hundred of tests per test app would still be less that what you regularly see in libraries.
the test aggregation
There are two independent aggregations:
I think we should deal with both types of aggregation at the same time, so that it is solved once for good. I think the ideal state is:
We would need to change how the tests are authored to make this happen. The tests cannot use the regular Main method as the entrypoint anymore since you cannot have multiple Main methods per binary.
My proposal would be:
[Fact]
attributes to mark test entrypoints. We would bulk edit the tests and change their Main method to be a regular method with [Fact]
attribute.[Fact]
attribute in the project and creates Main method that calls all of them. Optionally, the Main method can take a test name and switch on it so that it is still easy to run individual test (would it be enough to address the diagnostic concern?).
The reason for using source generator and not XUnit runner to discover the tests is debuggability. XUnit runner is a reflection stress test and thus it is not suitable as a test driver for the low-level runtime.
The nice side-effect of using the standard XUnit attributes for runtime tests is that the authoring of core runtime tests will become more similar to authoring of libraries tests.
One thing I was thinking about this approach is: does this mean catastrophic failures in one test will take down the whole work item execution? Maybe this is something the remote executor can help with. Also, with the generated Main
approach we would probably need to work out the reporting mechanism + coredump mechanism as what we have today would fall short and helix wouldn't report these.
I can theoretically imagine that we might be able to tweak the test scripts such that, when the aggregate test app crashes in a catastrophic manner, we'd run it a second time to execute the individual test cases one by one as separate apps, I guess that's what roughly corresponds to the remote executor. For the test authoring, I guess the biggest challenge is the JIT IL tests, I was originally thinking we might be able to keep them unchanged but if that doesn't work, I'm definitely open to other ideas.
My main request if we go a remote-executor route would be that there is some mode to have the remote executor spit out the command line required to launch the process it is starting. One of the hardest problems with RemoteExecutor is being able to figure out how to debug the child process.
Additionally, if we go the route of a source-generated xunit-esque test execution runner with RemoteExecutor-esque features for tests that require out-of-proc launching, I'd like it if we could design the support such that a test author could also reuse whatever infra we have for launching the child process and capturing diagnostics for specialized cases (like the COM tests with native entry-points that test activation)
Frankly speaking, I think we should work hard to avoid child process executions whenever possible as I believe it makes a crucial difference w.r.t. test perf. For isolation-sensitive tests like interop tests we'll add specific provisions based on auditing where process isolation is required.
We are already doing a kind of build-aggregation of tests for Android and iOS tests, because it was simply impractical to package up each test as a separate app. (@fanyang-mono and @imhameed worked on this repsecitvely) I think this will need to be true for wasm-aot as well, because each individual wasm-app takes a long time to compile.
If we do this "test group" things, we may be able to also put each group in an app, which would simplify the design of those test lanes. But I am not sure if the tradeoffs are the same/compatilbe (i.e. how many tests can go in each app).
A JIT team requirement is to execute as little managed code as possible before getting to the test being debugged. It sounds like the proposal above might mostly achieve this even with aggregated tests, for most debugging scenarios. A counter-example is the libraries tests, where debugging them involves JITing and running gobs of xunit w/ reflection, which is super slow and interferes with debugging (e.g., set a JIT/GC stress mode, xunit is stressed also before even getting to the desired code). I like the proposal that tests could optionally be built standalone, if possible. Small, standalone tests help greatly in platform bring-up scenarios.
I like Jan's suggestion about mass grouping, noting that the build grouping doesn't necessarily need to reflect Helix run-time grouping: if we have X built test assemblies and want to parallelize runs on Y machines, we don't need X == Y : especially if we can choose which subset of tests in a test assembly get run in any particular invocation. E.g., copy X.dll to two Helix machines, run half of the tests in X.dll on one machine, half on the other. This might not work quite so transparently, however, for crossgen tests, which will crossgen the entire test assembly no matter what subset of tests is run.
Grouping the tests probably makes is easier/simpler to copy tests between machines, e.g., from a Linux/x64 box doing cross-compilation to a Linux/arm32 "run" box.
The "test driver" will need to be very clear about which test is being run, which has passed/failed, how to rerun a failure (or pass). Of course, we need the results surfaced to Helix/AzDO properly.
How will per-test timeouts work? Will they only be per-test-assembly? That could cause a misbehaving test early in the run to prevent getting results from tests later in the run sequence.
Thanks @BruceForstall for your detailed and insightful feedback. I don't yet have all the answers to your question; as a first step I'm trying to collect some actual perf numbers and as part of this task I noticed a bunch of test duplicates. Would you be fine with cleaning these up as a preparatory step or is there some more subtle distinction to what I perceive as mere duplication? Examples:
https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/NaN/arithm32_cs_d.csproj https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/NaN/arithm32_d.csproj
(and seven other pairs in the same folder)
https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/xxobj/operand/refanyval.csproj https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/xxobj/operand/_dbgrefanyval.csproj
I see about two dozen similar cases and my local tooling should let me automate their identification. If you agree to the preparatory cleanup, I'll work on putting up a PR.
Thanks
Tomas
Those look like dups to me.
Note that src\tests\JIT\CheckProjects\CheckProjects.cs is a tool to ensure test tests set various properties correctly. I haven't run it recently (and I'm not sure it runs in an automated fashion anywhere).
I have performed a simple experiment to get an initial reading on the perf implications of proposed test merging. The results seem to indicate potential for substantial build time speedup; I'm also seeing some runtime speedup but frankly not as pronounced as I expected. Most of the motivation in the issue description remains in place, I just have less hope that the change will automatically translate to drastic reduction of test running time - there are still chances the change will substantially speed up Helix execution by means of reducing the payloads but that's speculation at this point.
As the target for my initial experiment I picked the JIT\Methodical
tests that seem well suited for merging. The Pri1 suite contains about 2K tests in this subtree approximately half of which are csproj and the other half are ilproj projects. I have limited the initial experiment to csproj as the IL is much harder to transform. I have basically open-coded a simple managed app that uses some shortcuts to try to mimic the proposed transformation - changing the Main methods in the individual tests to be normal methods and directly calling them from a generated wrapper project / source file. I have excluded about 50 tests that use constructs incompatible with my simplistic C# / project rewriter. The runtime perf results on my laptop are as follows (x64 release):
1) Vanilla src\tests\run release
- 33 seconds.
2) Test wrapper calling into the original test assemblies - 29 seconds.
3) All tests compiled into a single assembly - 26 seconds.
It's probably worth noting that I'm respecting the d/r/do/ro distinctions so I'm actually generating four projects and running them in sequence. As you can see, the "fastest" variant (putting all tests in the subtree in a single assembly) reduces the execution time by about 22%.
On the other hand, putting all tests in a single assembly does reduce test build time in a substantial manner. On my laptop the managed test build portion of src\tests\build release -priority=1
(still limited to csproj
projects under JIT\Methodical
) takes about 190 seconds while the build of the four combined projects representing JIT\Methodical
tests in the four build combinations (d / do / r / do) only takes about 24 seconds i.e. about 8 times improvement.
Summary: merging many tests together does have a drastic effect on total test build time; runtime perf improvement is also measurable but much less pronounced. Please note this could still be a big win for PR / CI runs as the test build constitutes a non-trivial portion of the total running time of these pipelines. In the latest CI run Pri0 test build took about 23 minutes; in the last outerloop run, Pri1 test build took about 47 minutes. It is also worth noting that this part is fundamental as all the Helix runs depend on it.
If purely hypothetically we were able to reduce test build time 8 times as the results for JIT\Methodical
tests suggest, i.e. from 23 to ~3 minutes and from 47 to ~6 minutes, that would directly translate into total running times of the CoreCLR pipelines. This estimate has many caveats, e.g. many Interop tests use native components with different build time characteristics, merging multiple ilproj tests into a single assembly requires more complex IL transformations etc. but I believe there definitely is potential for improvement along the lines of this proposal.
Note that we build and run the tests in a Checked config almost 100% of the time, so I'd measure that instead of release
. There, I'd expect a bigger improvement.
Vanilla src\tests\run release - 33 seconds. Test wrapper calling into the original test assemblies - 29 seconds. All tests compiled into a single assembly - 26 seconds.
Was the CPU utilization same between the different cases?
+1 on measuring checked JIT and runtime flavor
Thanks Bruce and Jan for your additional feedback. You're right on both accounts. In checked mode, the three numbers are:
1) Vanilla src\tests\run checked - 202 seconds. 2) Test wrapper calling into the original test assemblies - 96 seconds. 3) All tests compiled into a single assembly - 89 seconds.
For now I just observed CPU utilization in the task manager while running the tests. In (1), xUnit is obviously running the tests in parallel - after the initial test discovery CPU utilization quickly goes to 100% and stays there for the entire test duration. In contrast, both my "new" measurements per (2) and (3) involve CPU happily sitting at 22~25% utilization corresponding to just 1 out of my 4 cores being used. In other words, by fully leveraging parallelism we should be able to further improve case (2) and (3) to 96/4 ~ 24 seconds (about 8 times speedup). I assume that the difference between (2) and (3) is less pronounced in checked mode as the slower JIT and runtime in general dwarf the OS loader time needed to load the multiple assemblies in case (2).
In case anyone's interested in further experiments in this area, I have put the tool I wrote on our internal share
\\clrmain\public\writable\users\trylek\TestGrouping.zip
It basically receives a path into the GIT clone as its command-line argument (e.g. D:\git\runtime\src\tests\JIT\Methodical
), rewrites the C# code and projects in the subtree and generates the eight wrapper projects - four projects per (2) and four per (3) - into the folder. This way git checkout
/ git clean -xdf
can be easily used to undo the transformations when experimenting with the tool. The wrapper projects conform to the normal CoreCLR test project style so that they can be individually built using dotnet msbuild ...csproj
and executed using the generated cmd
script. I'll be happy to carry out any additional measurements using the tool based on your suggestions, at the end of the day it's actually quite easy to use.
As a next step in my experiments I have recently managed to leverage the internal iDNA technology to measure that in the archetypal "tiny JIT" test I'm always mentioning, i4div_cs_do, (on Windows x64 release) we carry out about 70M instructions before entering Main and then about 15M instructions within it. While anecdotal, I believe it further confirms that there is at least some value in test merging for lab testing purposes.
For our oncoming Quality week (next week of 8/23) I have proposed starting the initial preparatory steps, in particular cleaning up test duplicates and renaming tests to remove entrypoint name duplicates (getting rid of pairs of tests with the same qualified entrypoint name i.o.w. where the assembly, class and entrypoint are the same). Once this is done, I'll start working on the next step actually converting tests to XUnit style and on support for their merging.
This feels still way too much. I think we should be shooting for < 40.
Just curious, why would we not just have a single assembly for all JIT tests, at that point? The hit on developer loop (even though C# compilation is fast)?
While we're at it, please consider https://github.com/dotnet/runtime/issues/59821 as an input to this effort.
Just curious, why would we not just have a single assembly for all JIT tests, at that point?
I do not expect that there will be large efficiency gain between 40 and 1. And if we optimize the system for 1, it has high probability of getting us into troubles on opposite end of the spectrum. One giant binary is as bad as thousands of tiny binaries. The sweet spot for binary size is in the >100kB and <10MB range.
I have finally managed to figure out more detail regarding the test consolidation plan. I am proposing to base the plan on the following assumptions:
It is not realistic to convert all 10K tests in a single PR. At least in the short term we need a hybrid infra that will let the various teams gradually audit and convert their tests. At some point we may become able to delete the legacy portions of the infra. This will also let us implement and demonstrate the work on a small subset of CoreCLR tests; once we agree on the structure of the change and iron out all issues, we can move towards converting larger batches of tests as appropriate.
According to my initial experiments higher up this thread, the perf difference between running tests in-process vs. today out-of-process is much more pronounced than the perf difference between tests compiled into multiple assemblies (just executed in-proc) and tests compiled into a single assembly. For this reason I propose to limit this effort to merging in-proc test executions, not actually compiling multiple tests into a single managed assembly; we can consider that as a next step but today the biggest perf win to be had is in Helix especially considering that a single Roslyn compilation of managed test components gets reused for dozens of OS / architecture variations of Helix executions.
The test conversion process will involve modifying the test source code and build project. I believe it should be possible to write a managed tool capable of converting something like 90% of the tests but the process will be semi-automatic, it will require manual auditing. My current thinking is that we’ll just modify the projects to use OutputType=Library
and CLRTestKind=BuildAndRun
and we’ll rename the Main
method to some other identifier marked with the xUnit [Fact]
attribute; thus the new test assemblies will formally conform to xUnit / dotnet test style even though the run.cmd/sh script won’t be using runtime reflection for the test discovery and execution. In about 5% cases the tests will need additional manual editing to fix qualified entrypoint clashes and internal / private entry points most notably in ilproj projects.
The CLRTest.*.targets
scripts will identify new-style tests (Library / BuildAndRun) and adjust the cmd / sh execution script generator to use a managed wrapper app to run the test in question (as it will no longer have a Main entrypoint by itself). This will ensure that we’re still able to run the tests one by one locally as was a fundamental ask of the JIT team. Alternatively we might consider modifying the corerun app to directly support dll's without an entrypoint by looking for the [Fact] attribute.
I’ll write a new managed app that will scan the test output directories after the Roslyn build and create “new-style xUnit wrappers” for them. It will enumerate all dll files in the subtree and scan their managed types and methods to locate the methods marked with the [Fact] attribute and slice them into sections (where we can optimize the size of each section per OS / architecture / runtime flavor) that will generate C# source code and project scripts similar to today xUnit wrapper scripts, just using in-process execution to launch the individual test entry points. We could define a heuristic characterizing runtime duration of the test to affect grouping (singling out long-running tests to be put in smaller groups or in single-test groups).
The “normal” xUnit wrapper generation will skip the new-style tests; thus once we trigger test execution, locally or in Helix, the “old-style” xUnit wrappers should blend with the “new-style” xUnit wrappers and provide for a smooth transition between the modes; at some future point we may be able to switch over all tests and delete the legacy xUnit wrapper creation logic.
My current thinking is that, assuming we agree on the general direction of this proposal, I’ll implement initial versions of the proposed tooling and script changes and I’ll demonstrate them in a PR switching over a small initial subset of JIT tests. We can further refine the design details on that PR thread; once it’s merged in, we can fan out to the individual teams owning the CoreCLR test portfolio and have them negotiate and fund the switchover of their test subsets.
My current proposal basically removes the role of the cmd / sh test execution scripts in the merged scenario. For JIT tests this should be mostly acceptable; if at some point we decide to consider converting some interop tests, for instance, we’ll need additional tricks to define test initialization requirements (like registering COM servers and such) that will ultimately end up in the new-style xUnit wrapper or in some script calling it. Similarly, if we decide to shoot for a complete conversion, we’ll need means for marking tests that cannot be merged at all and must be run one by one, this will be probably necessary for extreme cases like GC OOM negative tests.
I’m looking forward to any feedback and additional suggestions.
Thanks
Tomas
It is not realistic to convert all 10K tests in a single PR. At least in the short term we need a hybrid infra that will let the various teams gradually audit and convert their tests.
+1
For this reason I propose to limit this effort to merging in-proc test executions, not actually compiling multiple tests into a single managed assembly
There is significant overhead in building the thousands little binaries too. I do not have exact numbers, but just watching the build console makes it very visible.
I think we should agree on the final shape that we want to get to and then decide about the best way to get there.
I am not convinced that doing the conversion in multiple steps will save us anything (it is more likely the opposite).
I’ll write a new managed app that will scan the test output directories after the Roslyn build
Can this be a Roslyn source generator that is usable with any xunit-based test project (ie not specific to clr test infrastructure)? Maybe this source generator can live in http://github.com/dotnet/arcade repo, for now at least.
One thing that is not clear to me is how we would handle the test cases that use special extra scripts bits generated from their project files via CLRTestBatchPreCommands
/ BashCLRTestPreCommands
, especially the per-test settings of env variables that have to be done before the test process executes.
Also, I have thought that we would preserve an optional ability to run individual tests for the cases like new platforms bringups where the runtime is not fully working yet. I am also worried about issues leaking from one test case to another when running many tests in process. What I mean is that a test case can hit some issue in the runtime, e.g. a GC hole, that would stay hidden until some later test case results in triggering a GC. And this is not limited just to a GC, there can be some runtime data structure corruption triggered by one test case and revealed much later in an unrelated one. Or the case when a test issue would not fire if the runtime was already "warmed up". It seems that the new way will make debugging and identifying such issues much more difficult. So I feel like the ability to run individual tests one by one in a way where very little code is executed before the actual test code is very important.
Thanks Jan & Jan for your detailed feedback. Based on your comments I'm trying to reformulate the principles of the conversion.
We seem to be in agreement that we need a hybrid model to facilitate gradual transformation of the tests and we believe there's no easy way to mitigate Helix costs without at least some amount of test merging as the repeated runtime startup costs including the cost of the generated scripts participating in the out-of-proc test execution multiplied by the number of tests substantially contributes to the total test duration.
Related concern raised by JanV with regard to runtime state leaking from one test to the other, I think there's just no way to completely avoid this risk if we want some test merging, all I'm proposing are mitigations - introducing options, perhaps in form of additional attributes, to mark tests unsuitable for merging or perhaps allow richer hints for the test splitting. For things like GC holes the merged system might be actually more "efficient" in the sense of being more stressful as short executions of the individual test cases can easily hide GC inconsistencies in cases where the test finishes quickly without a GC taking place.
We're in agreement that we need to retain the ability to run tests one by one; higher up this thread the JIT team emphasized the same ask. My original thinking was that, if we just transform the entrypoints and keep separate managed assembly compilation, very little change would be needed to run the tests in basically their original form. As JanK believes we should tackle C# compilation time as part of this task too, a switch over the first command-line parameter specifying the test to run in the generated wrapper would probably go a long way towards satisfying this requirement.
Just to clarify to avoid any misunderstandings, I do believe that the wrappers should be auto-generated, not just pre-generated and subsequently hand-maintained. Autogenerated wrappers remove the need for error-prone manual registration of a newly authored test and wrapper generation should let us further tune the payloads based on targeting OS, architecture and runtime flavor. I must admit I’m not yet familiar with Roslyn source generators, it remains to be seen to what extent they satisfy the outlined requirements and / or whether we’d need additional repo-specific extensions on top of “switching xUnit tests to use compile-time reflection”. In particular I’m thinking about the environment variable override raised by JanV, about rules for test grouping and so on.
For source-level merging into larger test apps, I believe we must solve three non-trivial challenges; these will likely require some amount of manual adjustments in the tests:
Dealing with IL-based projects. To my knowledge ILASM doesn’t support assembling multiple .il source files into a single module / assembly.
Cleaning up management of support assemblies. Many tests use managed support assemblies; sometimes multiple tests use the same support assembly, sometimes groups of tests use different support assemblies that just happen to be named the same (e.g. we have five different common.csproj projects in the test tree). Some tests tamper with its own or with the support assembly paths. We should make sure that tests marked as non-mergeable generate a single-test wrapper as similar as possible to the previous style of running tests as standalone apps.
Expressing external dependencies using additional attributes. For a merged test app there’s no way to run a script between the test cases, in fact it’s one of the main goals to get shell out of the picture and run test cases within the merged app with as little overhead and process launching / switching as possible. It should be easy to have an attribute for setting environment variables for each test, we could have attributes for registering COM servers prior to running the tests, we could also have a global initialization section that would take place once before we execute any tests in the merged app, e.g. by means of an extra generated cmd / sh script that would perform the initialization and then execute the wrapper.
With help from Tomas Matousek I managed to get a better idea about how the Roslyn source generators work. From this perspective I now think that the test wrapper creation should involve three steps:
Using a Roslyn analyzer to scan the source code and collect the [Fact] attributed methods. I'm most worried about the performance of this step as it ultimately ends up feeding the 10K or more source files to the analyzer. We might need to introduce some caching support so that local test build innerloop doesn't incur this penalty (at least not upon each build).
Using the collected metadata to produce the test splitting and plan out the individual test wrapper compilations.
Running the Roslyn source generator on each test group to emit the corresponding test wrapper.
While my current understanding confirms that the Roslyn analyzer / generator technology is sufficient for the first and third step, I still need to get my head around the second step as I don't yet see how to integrate it into the Roslyn analyzer / generator framework apart from subjecting the entire test subtree to a single Roslyn compilation and I have no idea what will happen if I do.
For things like GC holes the merged system might be actually more "efficient" in the sense of being more stressful as short executions of the individual test cases can easily hide GC inconsistencies
+1
a switch over the first command-line parameter specifying the test to run in the generated wrapper would probably go a long way towards satisfying this requirement.
Yep, we should not need to build the tests differently to allow single test execution. Check the sketch of the auto-generated Main
method below on what it may look like. We are running quite a bit of managed code during startup, so I would hope that adding one extra managed method should not be a big deal. We can iterate on what the wrapper should look like exactly to make it as little intrusive as possible and still support all required features.
Dealing with IL-based projects.
It is ok to keep the IL-based tests in separate .dlls. However, we should allow multiple tests to exist in one .il file.
For a merged test app there’s no way to run a script between the test cases, in fact it’s one of the main goals to get shell out of the picture and run test cases within the merged app with as little overhead and process launching / switching as possible.
This problem was solved for libraries tests using remote executor. It would be nice to use the same scheme here if possible. Also, we only have a few hundred of these tests that require env variables, etc. We can deal with them last during final cleanup. We do not need to agree on the exact solution now.
I'm most worried about the performance of this step as it ultimately ends up feeding the 10K or more source files to the analyzer
The analyzer should be only running on one group at a time. I think we should be fine as long as we keep the group size under 1K.
Note that it is not unusual for libraries test .dlls to contain thousands of tests. For example, https://github.com/dotnet/runtime/tree/main/src/libraries/System.Runtime/tests contains 3_700+ tests, but it still compiles and executes in a very reasonable amount of time. I do not remember anybody complaining about System.Runtime tests compilation or execution speed hampering productivity.
I still need to get my head around the second step as I don't yet see how to integrate it into the Roslyn analyzer / generator framework
Roslyn generator should be responsible to generating Main
method of the entrypoint test .exe. Let's say my test source files look like this:
public class MyTests
{
[Fact]
public static void Test1()
{
Assert.Equal(0, 0);
}
}
public class OtherTests
{
[Fact]
public static void Test2()
{
Assert.Equal(42, 42);
}
}
The source generator should produce a Main
method that looks like this:
[MethodImpl(MethodImplOptions.NoOptimization)]
static void Main(string[] args)
{
string name = (args.Length > 0) ? args[0] : null;
if (name == null || name == "MyTests.Test1") MyTests.Test1();
if (name == null || name == "OtherTests.Test2") OtherTests.Test2();
}
Everything (the tests and the auto-generated Main
method) gets compiled into one test .exe as part of single Roslyn compilation.
Thanks Jan for your supportive response. Based on your feedback I believe that as a first step I should experiment with the Roslyn source generator to get an idea about its performance and ability to deal with larger groups of tests. As a first iteration I would look into partitioning corresponding to the existing xUnit wrappers i.e. two directory levels under the test binary root (e.g. JIT\Methodical). I actually think this is the biggest one with about 2K tests so its conversion should provide many insights into the subsequent work.
One other detail I mentioned higher up this thread but haven't emphasized recently is the debug / optimization matrix. For JIT tests in particular, many of these come with d / do / r / ro
suffixes to denote debug vs. release and optimization switches. In my summer experiments I ended up generating four wrappers corresponding to each of the categories, I'll be happy for other ideas.
I believe that as a first step I should experiment with the Roslyn source generator to get an idea about its performance and ability to deal with larger groups of tests.
For inspiration, you can also take a look at https://github.com/dotnet/runtime/tree/main/src/libraries/Common/tests/StaticTestGenerator . It does the same thing as what the source generator should do, except that it is a standalone tool.
Also, you can use the existing tests under libraries to test the source generator.
For JIT tests in particular, many of these come with d / do / r / ro suffixes to denote debug vs. release and optimization switches. In my summer experiments I ended up generating four wrappers corresponding to each of the categories, I'll be happy for other ideas.
Sounds reasonable. If we are building a tests in 4 different ways with current scheme, we will need to build it in 4 different ways in the new scheme as well to maintain same coverage.
@trylek if you have any questions about how to build a Roslyn source generator such that it runs efficiently, just shoot me a message and I should be able to help.
I think most of the following concerns have already been considered, above, but let me right them down just to be sure:
Extending JKotas's example above, we could also leave a Main
function with every test, but put it under #ifdef
:
public class MyTests
{
[Fact]
public static void Test1()
{
Assert.Equal(0, 0);
}
#if STANDALONE
public static int Main()
{
int result = 100; // default success code
Test1(); // or `result = Test1();` for tests that return an `int`
return result;
}
#endif // STANDALONE
}
Thanks Bruce for your comments and suggestions.
The wrapper cmd files handle various services, such as supporting (a) ilasm round-trip testing, (b) crossgen2 testing, (c) bailing out on some test configurations, (d) IL linker testing (?), (e) setting stress environment variables before a test is run. The new mechanism will need to handle all of this.
We already assume that some tests won't be mergeable. Similarly we could have a mode that wouldn't merge anything and basically run tests one by one like today; we could use this, at least in the middle term, for jobs exercising some of the special behavior you describe (ILASM / ILDASM roundtrip, IL linker testing, stress runs); our current biggest goal is to provide some headroom to our lab capacity that is currently stretched to the limit and for this purpose it should be sufficient to optimize the most frequent runs i.e. PR / CI and outerloop runs; jobs running a few times a week shouldn't make that much of a difference. In some cases like Crossgen2, merging tests should actually provide perf gain of its own as a larger Crossgen2 compilation is generally much more performant than a ton of tiny compilations incurring repeated managed startup, loading of the framework assemblies and similar semi-constant costs.
I believe the current xunit wrappers handle per-test timeouts and core dumps. Presumably the auto-generated wrapper will handle this?
I don't think the XUnit wrappers deal with timeouts and dumps by themselves, the last time I looked they were just plain process launches marked with the [Fact] attributes; any timeout / dump logic must be handled either by the XUnit console or by the test app itself. In this sense introducing a new form of wrapper for the merged tests shouldn't incur any fundamental regression of its own even though you're certainly right that timeouts and such might need further tweaks. For dumps I would expect that the dump of the merged test app should be equally usable for failure investigations as dumps for single-test apps.
Will test exclusions work the same as today? There are some per-test conditional exclusion attributes in the proj files, e.g., IlasmRoundTripIncompatible, GCStressIncompatible, UnloadabilityIncompatible. How do these get "attached" to the test? I think today these get baked into the wrapper cmd file. Do we partition tests into sets with precisely the same set of proj file attributes?
For some test project attributes we definitely do need to split differing tests, most notably for the DebugType and Optimize attributes heavily used by JIT tests. For the other properties, frankly I don't know yet. As I explained above, I can imagine that some of these tests remain running in the STANDALONE
mode; gradual incorporation of these special test modes (aimed at further perf improvements and Helix optimization) would probably require adding support for them to the wrapper generator e.g. by means of conditional statements around the individual test case executions.
Will the existing issues.targets file exclusion mechanism still be used? Similarly, how will tests get marked Pri-0 versus Pri-1?
For now I expect that both issues.targets and P0 / P1 split will remain in place. I am somewhat hopeful that perhaps the optimized execution thanks to grouping may make the tests so much more performant that we'll be able to get rid of the P0 / P1 distinction or perhaps separate out just a few long running tests and run all the rest on PR / CI runs. I'm not yet 100% sure this is realistic but I believe it would be a substantial boost for our code quality monitoring. I believe longer-term discussions about upsides and downsides of issues.targets and its possible replacements are healthy but far beyond the scope of this particular proposal that is already quite complicated.
Where is the test run parallelism going to live? In the (generated) test wrapper program?
That is an interesting question. I hope not. Higher up the thread I and JanK had somewhat different opinions about the expected final number of merged apps where IIRC I proposed several hundred and JanK several dozens; we have yet to identify the sweet spot but "my" guess (let's say 400 wrappers) corresponds to something like 25 tests per wrapper in P1 mode and 6 tests per wrapper in P0 mode while "JanK's" guess (let's say 40 wrappers) corresponds to about 250 tests per wrapper in P1 mode and 60 tests per wrapper in P0 mode. Assuming the sweet spot is somewhere between these two, there's still a lot of wrappers to run so that's where I'd expect the parallelism to occur. Ideally I'd love to initially run the new wrappers using the existing xunit console but I'm not 100% sure that is doable, maybe we'll need some new runner that will just use the xunit console for the legacy wrappers, so that would be the place for the parallelism but I wouldn't expect it to be much more complicated than when, for instance, R2RTest compiles the framework assemblies using Crossgen2 in parallel during the population of CORE_ROOT.
If the test annotations look like xunit, could you actually use xunit to run the tests (if you wanted to?)
That would be just perfect, I think that even if we don't manage to make this work now, we shouldn't make any design decisions that would prevent implementing this in the future.
For your example with the STANDALONE conditional compilation, I'm somewhat worried that this additional bit, while mostly boilerplate code, is incurring an authoring (and to some extent maintenance) cost on every single test; after all, if we spend so much energy pretending that our test cases are "dotnet test"-runnable even though the actual underlying infrastructure will likely be different for performance reasons, those also don't have an #if STANDALONE
Main method.
My current thinking is that the wrapper generation needs to be a three step process.
In the first step we traverse the test tree and obtain metadata about the test projects and [Fact] methods; for C# tests we can implement a Roslyn analyzer for this purpose, for ILPROJ tests we'll either need a custom IL source analyzer or we'll just compile the tests using ILASM and analyze the output MSIL (ending up with something like "compile-time reflection on the [Fact] attribute"). In both cases we basically end up with a list of all the available test projects, methods representing their test cases and additional info about the project files (test priority, DebugType / Optimize flags and possibly additional properties). Ideally we should materialize this list somewhere in the build outputs, possibly as a JSON or XML file, and publish it as an artifact to be available for investigation of issues with the new infra.
In the second step we use the list obtained in the first step to actually partition tests into the grouped apps using the collected metadata; exact behavior of this step may be controllable by additional options like the abovementioned no-merge mode, limits for the number of merged tests per OS / architecture / stress mode etc. Again, the generated test grouping should be materialized and published as a JSON / XML metadata file.
In the third step we actually emit the new wrappers and cmd / sh scripts for running them; in some cases like the GC stress environment variables, we should be easily able to set these variables globally at the top of the scripts much like we do in today cmd / sh test execution scripts. In fact, while I have yet to verify if this is actually doable, we may be able to generate these scripts using exactly the same code we're using for generating the individual test runner scripts today - in other words, at some level in the test build infra we'd basically pretend that the generated wrappers are just additional tests to build and run albeit with somewhat special properties. For C# test projects this step could be probably implemented as a Roslyn source generator in accordance with JanK's suggestion, I however doubt that's doable for ILPROJ tests so for now I'm thinking about just using the Roslyn analyzers for collection of the primary test metadata and using some common but custom tooling to emit the wrappers for both categories of tests.
Thanks, @trylek ! I suggested the STANDALONE Main
option as a way to achieve the JIT goal of having the absolute minimum amount of JIT-ing before a test case is actually run, so, not even run the generated wrapper Main
, which I anticipate could be quite large.
Looking forward to seeing how this all progesses.
I have a slightly different proposal for how to organize the test system that (at least to me) feels lighter weight than @trylek's proposal:
[Fact]
attributes for each test method.Microsoft.DotNet.XUnitExtensions
package (widely used by the libraries team) to disable tests, either with special traits (for general cases like GCStressIncompatible
or JitStressIncompatible
), or with mechanisms similar to the PlatformDetection
type and ConditionalFact
/ConditionalTheory
in the libraries test tree.CrossgenIncompatible
or IlasmRoundTripIncompatible
, we would use assembly attributes to annotate the incompatibilities.issues.targets
test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue]
attributes and the like.At this point, we will still have all of the test assemblies as they exist today, but all tests have been changed to use XUnit-style attributes. As a result, we will automatically have the support for the Optimized/Release/Debug combinations that the JIT team uses. Now we need to define how the runner will work:
The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library. The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator that follows the following model:
As the runner itself is defined in C#, we don't need to do any post-processing of IL assemblies produced from ilproj projects.
From this point onwards, there would be at least 2 modes for the generated code to support:
Main
method that calls all XUnit-attributed methods directly, checking any conditional states (in many cases, a simple if
statement like "is Windows?") before calling a method . This mode does not have support for test filtering on name or for trait filtering, and would satisfy the JIT team's request to be able to run test assemblies individually similar to today.Main
method that supports test filtering based on test name, trait filtering based on trait names, and checking any conditional states. This mode would be the primary use case for CI, and the mode that the new runner project would use.There would be a post-build step to analyze each assembly in the output. This step would be used to determine which files are eligible for crossgen2/ilasm roundtrip testing. This analysis step would spit out an rsp file for each scenario where each line in the file is the name of an assembly file that supports the test scenario. Then the crossgen/ilasm round trip pre-test steps would be driven by external scripts that we call into and pass the rsp file. Once Roslyn supports generating non-source files from a source generator, these rsp files can be generated by the source generator instead of a post-build step.
To handle setting stress environment variables, we would use CoreRun's new dotenv file support. All of the COMPlus_
environment variables will be set through a dotenv file, which we will directly pass to corerun.
At this point, we will only generate the .cmd/.sh scripts by default for the new XUnit harness projects. These .cmd/.sh files should be able to handle all of our use cases with the above design.
To enable running standalone tests as we do today, we would update the build system to support a BuildAsStandalone=true
MSBuild property. When set, this property would cause a test assembly to reference the "harness generator" in the aformentioned "Standalone-test mode" and to have the rsp, cmd, and sh scripts generated for the harness. This should cleanly support the JIT team's scenario of "minimal IL ran before a test" for dev innerloop scenarios.
To support this in il-scenarios, we will need to use a different mechanism. One option would be to have a simple "repro" project similar to the "DllImportGeneratorSample" project that enables quickly reproducing test failures by providing a simple project that can be changed to call whatever test is required.
With this basic model, out-of-proc tests will require manually writing [Fact]
-attributed methods that will use RemoteExecutor
or a similar mechanism to support out-of-process tests. If we feel that this is not a good design, we could add support for an OutOfProcFact
attribute that the source generator would recognize and automatically write the code to execute the test out-of-proc.
If we decide that support for dotnet test
is interesting, we can add another mode of the generator to output a VSTest-adapter implementation that is hard-coded to the tests that are available in the harness. This would enable easier execution in a VS/VSCode scenario, and would be an optional scenario for people who like to run tests directly. As dotnet test
has built-in dump support, this might be a useful mechanism to get automatic dump reporting with timeouts without a wrapper project.
Hi Jeremy!
Thanks for your detailed feedback. I believe your "modified proposal" has many interesting characteristics we should use. More detailed comments follow:
All issues.targets test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue] attributes and the like.
I like better aligning runtime tests with library tests. I just think that refactoring of issues.targets is more or less orthogonal to the test grouping and so we can treat it as an optional step - if we don't have enough time to get everything done, this can wait for a later cleanup step. The file has 3K lines i.e. about 1K excludes many of which don't have issues assigned, I would say it's at least a week of work provided you can somehow automate the conversion, doing this manually would take a month and be terribly error prone.
The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library.
This basically means that each test would continue compiling using a separate Roslyn execution into a separate library, I believe higher up this thread JanK called that out as wasteful compared to one big compilation.
The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator
I don't think we can use the Roslyn analyzers for IL source code in ILPROJ tests (there are about 4K such tests in the tree i.o.w. about 40%, just to point out they're not a tiny fraction of the test set); for these we need to figure out something else. Apart from removing the entrypoint, renaming the Main method and marking it with the Fact attribute, there's one more transformation that needs to be made in many of the ilproj tests:
The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated. I think that the easiest way to fix this is to add one more public static class to the bottom of the file with the new public method that would then call the pre-existing internal Main. Using this transformation we could also standardize the naming of the test method in ilproj tests - e.g. having it the same as the name of the test provided we clean them up to be unique across the tree (I think they mostly are already) - with the caveat that this would prevent having multiple test cases in a single IL source file (which is not a regression, we don't have this option today); generating code to call these test entrypoints would then become trivial.
For resolving OS, platform and runtime flavor-incompatible tests we need to keep in mind that in the PR pipeline the managed tests are compiled only once for all combinations and it's just the test wrapper creation that today takes place in the run phase that resolves issues.targets entries pertinent to the targeting platform and filters out the disabled tests. I may be missing some fine points in your plan but the idea about the wrapper project having ProjectReferences to all the tests either means that we'd need to move managed test build to the run phase (i.e. ultimately execute it 30 times instead of once) or that we'd need to generate dynamic code in the wrapper that would perform the relevant checks on the Helix client at runtime.
For the "Standalone test support", am I right to understand you're proposing to basically emit a separate wrapper for each test? While certainly doable that would inject 10K extra C# sources (albeit tiny ones) to the build, I'm not sure about perf implications of this. For one-off local experiments with individual tests it probably shouldn't matter but if we wanted to retain standalone tests in some lab pipelines (e.g. stress ones where the tests are timing out already without merging), we'd need to measure the perf implications.
For "Dotnet test support", that sounds great to me overall; similar to my comments regarding "issues.targets" I think that for now it's sufficient to know that the planned transformation makes this doable and actually doing that can wait for later if at some point we're out of time budget for the infra revamp.
Apart from these comments and clarifying questions, your proposal (I'm not sure if I should call it a counterproposal as I believe that we agree in many aspects and rest is mostly clarifying technical details) sounds great to me and I'll be happy to work with you on fleshing out its remaining details so that we can start the actual work ASAP.
Thanks
Tomas
Hi Jeremy!
Thanks for your detailed feedback. I believe your "modified proposal" has many interesting characteristics we should use. More detailed comments follow:
All issues.targets test disable mechanisms should move to follow the same mechanism as the libraries tests with [ActiveIssue] attributes and the like.
I like better aligning runtime tests with library tests. I just think that refactoring of issues.targets is more or less orthogonal to the test grouping and so we can treat it as an optional step - if we don't have enough time to get everything done, this can wait for a later cleanup step. The file has 3K lines i.e. about 1K excludes many of which don't have issues assigned, I would say it's at least a week of work provided you can somehow automate the conversion, doing this manually would take a month and be terribly error prone.
Yes, we could push this work off until a later date if needed.
The runner will be a regular C# project with hard-coded (possibly with MSBuild item globs) ProjectReferences to each test library.
This basically means that each test would continue compiling using a separate Roslyn execution into a separate library, I believe higher up this thread JanK called that out as wasteful compared to one big compilation.
This was meant to be in alignment with your suggested approach. Alternatively. we can combine many tests into a single assembly since my generator suggestion also looks in the current assembly to generate the test execution calls.
The runner project itself will be entirely (or almost entirely) source generated using a Roslyn source generator
I don't think we can use the Roslyn analyzers for IL source code in ILPROJ tests (there are about 4K such tests in the tree i.o.w. about 40%, just to point out they're not a tiny fraction of the test set); for these we need to figure out something else. Apart from removing the entrypoint, renaming the Main method and marking it with the Fact attribute, there's one more transformation that needs to be made in many of the ilproj tests:
The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated. I think that the easiest way to fix this is to add one more public static class to the bottom of the file with the new public method that would then call the pre-existing internal Main. Using this transformation we could also standardize the naming of the test method in ilproj tests - e.g. having it the same as the name of the test provided we clean them up to be unique across the tree (I think they mostly are already) - with the caveat that this would prevent having multiple test cases in a single IL source file (which is not a regression, we don't have this option today); generating code to call these test entrypoints would then become trivial.
If we use a C# assembly to be the test runner to wrap the IL tests, we can use the Roslyn source generator/analyzer APIs on the C# runner and have that runner call the assemblies that were compiled from ilproj projects. As mentioned in the standalone case, having a "directly execute the IL assembly without calling though any C#" mechanism would require some additional (likely manual in v1) work on the part of the developer who wants to run an IL test individually.
For resolving OS, platform and runtime flavor-incompatible tests we need to keep in mind that in the PR pipeline the managed tests are compiled only once for all combinations and it's just the test wrapper creation that today takes place in the run phase that resolves issues.targets entries pertinent to the targeting platform and filters out the disabled tests. I may be missing some fine points in your plan but the idea about the wrapper project having ProjectReferences to all the tests either means that we'd need to move managed test build to the run phase (i.e. ultimately execute it 30 times instead of once) or that we'd need to generate dynamic code in the wrapper that would perform the relevant checks on the Helix client at runtime.
My suggestion would be to generate dynamic code in the wrapper to perform the relevant checks and to keep our managed build to one platform, at least for v1. If we eventually merge the many test assemblies (as mentioned above) and get to a relatively quick managed test build, it might be worthwhile to build the managed tests in multiple legs and pre-resolve some of the OS/platform/runtime-flavor/arch checks.
This would make our "test run" phase only have to copy the native bits to the right locations and run the tests, which would help simplify those steps as well.
For the "Standalone test support", am I right to understand you're proposing to basically emit a separate wrapper for each test? While certainly doable that would inject 10K extra C# sources (albeit tiny ones) to the build, I'm not sure about perf implications of this. For one-off local experiments with individual tests it probably shouldn't matter but if we wanted to retain standalone tests in some lab pipelines (e.g. stress ones where the tests are timing out already without merging), we'd need to measure the perf implications.
For the "Standalone test support" case, that is an opt-in scenario where a developer would build an individual test in that mode (using something like ./dotnet build src/tests/path/to/test.csproj /p:BuildAsStandalone=true
, which is a slight modification of a common workflow within at least the Interop team). This would be off by default, so we wouldn't be injecting any extra sources into the build.
You are correct that this may have some implications in the lab, but in those cases we'd be generating all of the small source files instead of (though the same if not more involved calculations) generating the larger wrappers. Doing a perf investigation sounds reasonable to me.
For "Dotnet test support", that sounds great to me overall; similar to my comments regarding "issues.targets" I think that for now it's sufficient to know that the planned transformation makes this doable and actually doing that can wait for later if at some point we're out of time budget for the infra revamp.
Yes, this is definitely a stretch goal and is not required for v1.
Apart from these comments and clarifying questions, your proposal (I'm not sure if I should call it a counterproposal as I believe that we agree in many aspects and rest is mostly clarifying technical details) sounds great to me and I'll be happy to work with you on fleshing out its remaining details so that we can start the actual work ASAP.
I'm looking forward to working together on this project! 😁
The class in which Main resides is often internal and making it public sometimes has annoying ripple effects making the transformation complicated.
What are the ripple effects that you are worried about? I would expect it to be extremely rare in src/tests.
issues.targets
Yes, we could push this work off until a later date if needed.
If we push this work off until a later data, how are the tests going be disabled in the meantime? Is the source generator going to read issues.targets somehow?
@jkotas - Hmm, in my summer experiment it looked like making the method and class public often made additional fields or classes start complaining about inconsistent visibility. I tried now to make all Main methods in our .il
source files public (not the classes for now) and the Pri1 test build passed so hopefully I was just doing something wrong. One way or another, having a consistent scheme for marking the test entrypoint[s] in ilproj tests might be beneficial in the absence of Roslyn analyzer capabilities.
If we push off issues.targets refactoring, I think the easiest thing we could do in the run legs would be to just emit the list of project exclusions into some text file that would be then read by the test wrapper in Helix and used for test filtering, that should be relatively trivial as the test build already has this information readily available.
Another option for a half-step for issues.targets would be to run it through a mechanical transform into a set of exclusions that point at either the .dll
or the .csproj
files (which shouldn't be too difficult, we've generally followed the same pattern for writing all of the exclusions over the years). Then we could do the text file as @trylek mentioned. Alternatively, if we decide we still want to build the test wrappers per-target, we could have the transformed issues.targets
file instead point to the .csproj
files that should be excluded from the ProjectReference collection.
We have a few different options here that we can try out if the cost of generating the ActiveIssue attributes is too high.
@jkoritzinsky - I think we're mostly on the same page, comments on some of the remaining bits requiring further work:
For the ilproj tests I'm still not sure if we're in agreement. You say that If we use a C# assembly to be the test runner to wrap the IL tests, we can use the Roslyn source generator/analyzer APIs on the C# runner and have that runner call the assemblies that were compiled from ilproj projects; I generally agree with that but I think the problematic point is how exactly you "call the assemblies that were compiled from ilproj projects". If you add support for merging C# code of multiple tests into a single assembly, you basically need to skip their own C# compilation step (as that will happen as part of the larger merged build). In other words, the wrapper needs generating before or instead of the managed test build. [By the "before" variant I mean that the wrapper is somehow injected into the test list, the "subtests" themselves get skipped in the compilation and the wrapper takes the normal "managed test build" path including creation of the cmd / sh runners using the existing scripts.] It is then the question how the wrapper build locates the ilproj test entrypoints. If it was to look at the ILASM-produced MSIL, we'd need to compile ilproj tests before the csproj tests to achieve the proper ordering; alternatively it must either carry out some manual analysis of the IL source code or count on standardized test entrypoint naming.
I agree with generating dynamic code using runtime checks and seeing whether at some later point we may be able to make managed test build so fast that it can be put back in the run legs, thus enabling better test and test artifact pruning before sending the run to Helix.
For the "standalone mode", I think we might actually want to start with that one - implementing something very simple to just allow running the "dotnet test"-style refactored tests one by one; that would be sufficient for fanning out the test refactoring work to the different teams in early November; in parallel we could dive deeper into the test merging / splitting logic. One other interesting bit that occurred to me in relationship with JanK's issues.targets question is that according to our earlier discussions with the Mono team it would be helpful to have a more dynamic test grouping based on the targeting platform - the browser WASM interpreter, the various Android emulators and such just have different perf and memory characteristics so it would be nice to introduce new degrees of freedom there. If we implemented logic in the generated wrappers (or rather probably in some of their dependencies) to use some emitted test list as the "block-list", we could likely similarly support some "allow-list". The run leg in AzDO could then generate a series of such lists representing the test splitting based on the target and use it to drive the Helix runs.
I don't think I have any other concerns with your suggestions. I have locally refactored my summer experiment into a tool I'm now using to analyze and rewrite the ilproj tests, my hope is to provide it to the JIT team and to the TypeLoader team (which is actually most likely ours) to use for semi-automatic conversion of these tests. I'm also trying to use it to prototype merged execution of a large number of the ilproj tests to let me provide an initial estimate of the expected savings for Jeff. Please let me know if you have cycles to actually participate in the implementation work here; technically speaking I'm OOF for the second half of next week as I'm using our state holiday next Thursday for a prolonged weekend in the country with friends. Based on our finalized design I can start some scripting / coding work early next week and if you could contribute in its second half, we might have something good enough for fanning out the conversion in the week of November 1st which is what I promised to Jeff on Monday and I still believe is doable.
I think we're mostly on the same page, comments on some of the remaining bits requiring further work:
- For the ilproj tests I'm still not sure if we're in agreement. You say that If we use a C# assembly to be the test runner to wrap the IL tests, we can use the Roslyn source generator/analyzer APIs on the C# runner and have that runner call the assemblies that were compiled from ilproj projects; I generally agree with that but I think the problematic point is how exactly you "call the assemblies that were compiled from ilproj projects". If you add support for merging C# code of multiple tests into a single assembly, you basically need to skip their own C# compilation step (as that will happen as part of the larger merged build). In other words, the wrapper needs generating before or instead of the managed test build. [By the "before" variant I mean that the wrapper is somehow injected into the test list, the "subtests" themselves get skipped in the compilation and the wrapper takes the normal "managed test build" path including creation of the cmd / sh runners using the existing scripts.] It is then the question how the wrapper build locates the ilproj test entrypoints. If it was to look at the ILASM-produced MSIL, we'd need to compile ilproj tests before the csproj tests to achieve the proper ordering; alternatively it must either carry out some manual analysis of the IL source code or count on standardized test entrypoint naming.
My idea is that the wrapper build will reference all test assemblies it includes with a regular <ProjectReference>
element in the test wrapper's csproj file. When implemented in this manner, it doesn't matter if the assembly that actually defines the test is written in C# or in IL since the test wrapper project will inspect it as a .NET assembly reference (which does not disambiguate between the two). The test discovery process would be "look for all accessible methods with XUnit test attributes on them in this project's source and in all referenced assemblies", which would include the IL assemblies. As we move more in the direction of merging test suites together, we would move some of the C# tests into the wrapper project itself. For the IL tests, we would start merging them into a single IL assembly, but the runner would still be a separate C# test project.
- I agree with generating dynamic code using runtime checks and seeing whether at some later point we may be able to make managed test build so fast that it can be put back in the run legs, thus enabling better test and test artifact pruning before sending the run to Helix.
:+1:
- For the "standalone mode", I think we might actually want to start with that one - implementing something very simple to just allow running the "dotnet test"-style refactored tests one by one; that would be sufficient for fanning out the test refactoring work to the different teams in early November; in parallel we could dive deeper into the test merging / splitting logic.
Yes, we could start with the standalone mode as it requires less features and would provide a comparable experience to the current test system as tests move to use the XUnit attributes.
One other interesting bit that occurred to me in relationship with JanK's issues.targets question is that according to our earlier discussions with the Mono team it would be helpful to have a more dynamic test grouping based on the targeting platform - the browser WASM interpreter, the various Android emulators and such just have different perf and memory characteristics so it would be nice to introduce new degrees of freedom there. If we implemented logic in the generated wrappers (or rather probably in some of their dependencies) to use some emitted test list as the "block-list", we could likely similarly support some "allow-list". The run leg in AzDO could then generate a series of such lists representing the test splitting based on the target and use it to drive the Helix runs.
We could either use a model similar to the PlatformDetection
model that the libraries team uses, or we could use XUnit Trait
s to implement more fine-grained control that the owners of different legs could enable or disable at their preference (allowing both "block-list" and "allow-list" behavior). If we wanted to do more of a load-balancing thing to better size our work items, then I agree, using a list of tests might be a better option.
I don't think I have any other concerns with your suggestions. I have locally refactored my summer experiment into a tool I'm now using to analyze and rewrite the ilproj tests, my hope is to provide it to the JIT team and to the TypeLoader team (which is actually most likely ours) to use for semi-automatic conversion of these tests. I'm also trying to use it to prototype merged execution of a large number of the ilproj tests to let me provide an initial estimate of the expected savings for Jeff. Please let me know if you have cycles to actually participate in the implementation work here; technically speaking I'm OOF for the second half of next week as I'm using our state holiday next Thursday for a prolonged weekend in the country with friends. Based on our finalized design I can start some scripting / coding work early next week and if you could contribute in its second half, we might have something good enough for fanning out the conversion in the week of November 1st which is what I promised to Jeff on Monday and I still believe is doable.
I believe I'll have cycles to work on this next week. For later weeks, I should still have time to work on the Roslyn source generator for test execution (I've gotten quite familiar with the workflow for source generators with the DllImportGenerator project).
One other interesting bit that occurred to me in relationship with JanK's issues.targets question is that according to our earlier discussions with the Mono team it would be helpful to have a more dynamic test grouping based on the targeting platform - the browser WASM interpreter, the various Android emulators and such just have different perf and memory characteristics so it would be nice to introduce new degrees of freedom there.
What is the estimate for the largest group size that still works everywhere? I would expect that groups of several hundred of average C# tests should be fine in any environment.
It would be best to stick with hardcoded logical groups to keep things simple.
@fanyang-mono / @naricc, can you please respond to JanK's question? I don't have sufficient experience in this space.
OK, so let's say there's a v0 comprising the standalone runner so that we can start gradually merging in PR's switching over the individual tests to xUnit style. I believe the next step is to define what exactly that entails:
As a first iteration (before it's proven unworkable) I propose marking the new-style tests as CLRTestKind=BuildAndRun
and OutputType=Library
. Please let me know if you see off the top of your head that won't work.
For now I assume that we switch over all the tests to have some int Entrypoint(string[]), that's what I'm experimenting with with regard to the ilproj tests anyway. Let me know if you have other suggestions.
The Entrypoint will be marked with the [Fact] attribute. We need to tweak the props or whatever other files to make the XUnit attributes available for each test build.
What exactly will the test wrapper generate for each test and how it's going to be used in the cmd / sh runner scripts?
We can emit a three-line C# wrapper next to each test implementing the Main method by calling into the [Fact] method in the test assembly. We may need to emit a new test project comprising the generated wrapper and the ProjectReference to the original project however that's going to incur one extra Roslyn compilation and one extra assembly load at runtime for each test i.e. some perf degradation. We may be able to justify this in the broader context of the regression being just temporary, in fact we'll likely improve the grouping and perf in parallel with the test transformation process so the gradual improvements may cancel out the temporary regressions.
We can emit the wrapper directly as the MSIL EXE using the System.Reflection.Metadata API, it's literally a three-line method (ldarg.0, call, ret). This would likely get rid of the additional Roslyn costs and potential ordering issues regarding the managed build but the additional assembly load at runtime would remain.
We can emit the wrapper project as a union of the original project and the generated wrapper source file thus pioneering the merged test execution to some extent. That would get rid of most of the extra Roslyn cost and likely of all the runtime costs.
Another interesting question is how to incorporate the actual build of these wrappers into the test build pipeline. I believe that my recent work on cleaning up the test build scripts has given us a solid cross-plat foundation we can inject the new work into.
At is seems likely we'll need to generate new test projects, at some place we'll probably need to replace the original projects with the transformed projects.
On the one hand I think it would be great if we could delegate most of the cmd / sh runner logic for the wrappers to the existing src/tests/Common/CLRTest.*.targets
scripts (just to avoid additional extra work in reimplementing all of those) but there's a nesting bit I don't see how to resolve - I think that the "new wrappers" cannot go through the pre-existing xUnit console logic because they emit "their xUnit results xml" in a different manner; please let me know if you believe we might be able to consolidate these somehow at least in the short term, I think that in a way it would be much simpler if we could start off with something like "add-ons to the existing xUnit wrappers" rather than a completely new beast requiring immediate adjustments to various parts of the infra.
As I mentioned before, if we want to locate the [Fact] attributes on ILPROJ assemblies via Roslyn analyzer i.o.w. probably some incarnation of PEReader, we need to sequence building the wrappers after building the primary source code (in this case building the IL source code using ILASM into the MSIL DLLs). That would however make it more challenging to reuse the existing managed test logic build to generate the cmd / sh scripts. We need to decide whether there's real value in pretending the wrappers to be tests themselves in the test build sense vs. whether we want to build them separately and likewise separately generate their cmd / sh runner scripts vs. whether we want to build them in a separate phase but try to hijack the existing cmd / sh runner script generators to do their job.
OK, so let's say there's a v0 comprising the standalone runner so that we can start gradually merging in PR's switching over the individual tests to xUnit style. I believe the next step is to define what exactly that entails:
- As a first iteration (before it's proven unworkable) I propose marking the new-style tests as
CLRTestKind=BuildAndRun
andOutputType=Library
. Please let me know if you see off the top of your head that won't work.- For now I assume that we switch over all the tests to have some int Entrypoint(string[]), that's what I'm experimenting with with regard to the ilproj tests anyway. Let me know if you have other suggestions.
- The Entrypoint will be marked with the [Fact] attribute. We need to tweak the props or whatever other files to make the XUnit attributes available for each test build.
I think we should mark them as CLRTestKind=BuildAndRun
(which is the default so it doesn't need to be specified) and OutputType=Exe
. Naming the entrypoints Entrypoint
makes sense, and for v0 we can special-case and allow the int(string[])
signature, though we should move to a void Entrypoint()
signature if possible.
What exactly will the test wrapper generate for each test and how it's going to be used in the cmd / sh runner scripts?
- We can emit a three-line C# wrapper next to each test implementing the Main method by calling into the [Fact] method in the test assembly. We may need to emit a new test project comprising the generated wrapper and the ProjectReference to the original project however that's going to incur one extra Roslyn compilation and one extra assembly load at runtime for each test i.e. some perf degradation. We may be able to justify this in the broader context of the regression being just temporary, in fact we'll likely improve the grouping and perf in parallel with the test transformation process so the gradual improvements may cancel out the temporary regressions.
- We can emit the wrapper directly as the MSIL EXE using the System.Reflection.Metadata API, it's literally a three-line method (ldarg.0, call, ret). This would likely get rid of the additional Roslyn costs and potential ordering issues regarding the managed build but the additional assembly load at runtime would remain.
- We can emit the wrapper project as a union of the original project and the generated wrapper source file thus pioneering the merged test execution to some extent. That would get rid of most of the extra Roslyn cost and likely of all the runtime costs.
The source generator would generate a Main
method that would call every method marked with [Fact]
, so making it an exe makes sense. This would effectively fall into something similar to option 3, where we make the test project runnable by generating a "Main" method for the test assembly.
For IL tests, we'd still need to hard-code in the Main method. Maybe even just mark the Main method with the [Fact]
attribute instead of re-writing the method name? That would enable the tests to just run as-is and automatically work when we start working on the new runners for the "v1" of this endeavor.
Another interesting question is how to incorporate the actual build of these wrappers into the test build pipeline. I believe that my recent work on cleaning up the test build scripts has given us a solid cross-plat foundation we can inject the new work into.
- At is seems likely we'll need to generate new test projects, at some place we'll probably need to replace the original projects with the transformed projects.
- On the one hand I think it would be great if we could delegate most of the cmd / sh runner logic for the wrappers to the existing
src/tests/Common/CLRTest.*.targets
scripts (just to avoid additional extra work in reimplementing all of those) but there's a nesting bit I don't see how to resolve - I think that the "new wrappers" cannot go through the pre-existing xUnit console logic because they emit "their xUnit results xml" in a different manner; please let me know if you believe we might be consolidate these somehow at least in the short term, I think that in a way it would be much simpler if we could start off with something like "add-ons to the existing xUnit wrappers" rather than a completely new beast requiring immediate adjustments to various parts of the infra.- As I mentioned before, if we want to locate the [Fact] attributes on ILPROJ assemblies via Roslyn analyzer i.o.w. probably some incarnation of PEReader, we need to sequence building the wrappers after building the primary source code (in this case building the IL source code using ILASM into the MSIL DLLs). That would however make it more challenging to reuse the existing managed test logic build to generate the cmd / sh scripts. We need to decide whether there's real value in pretending the wrappers to be tests themselves in the test build sense vs. whether we want to build them separately and likewise separately generate their cmd / sh runner scripts vs. whether we want to build them in a separate phase but try to hijack the existing cmd / sh runner script generators to do their job.
For this v0 stage where we're transitioning to the new design, we don't even need to generate new test wrappers. The test behavior should be basically identical to the behavior today, as each project will still be an exe and will still run in the same manner is does today.
Naming the entrypoints Entrypoint makes sense, and for v0 we can special-case and allow the int(string[]) signature, though we should move to a void Entrypoint() signature if possible.
Yes, it should be void
and take no arguments to follow the XUnit conventions.
Nit: I would make it static void Test()
. Test
is what people tend to use for generic test entrypoint if there are no better options.
Problem description
Current CoreCLR Pri1 test set has over 10K individual test projects. This is beyond the means of a single msbuild execution and is mitigated by partitioning the test projects into subgroups. Today at least three such partitionings exist (partitioning during test build, partitioning into XUnit wrappers, partitioning for Helix execution). While @echesakov did his best to make the Helix partitioning as good as possible, the entire logic adds enormous complexity to the test system, complicates developer ramp-up and is a constant cause of developer complaints. The 10K separate apps also mean 10K .NET Core runtime startups incurring enormous testing cost, it's not hard to imagine that the repeated .NET Core runtime initializations take an equal or greater amount of time than the actual test code execution.
Caveat - we don't yet have any hard data to substantiate this claim. I'm working on figuring out how to produce it in some form.
Ideal state
As I personally heard in presentations by @jaredpar and @stephentoub, perf optimization of Roslyn and libraries tests that took place several years ago involved the reduction of the number of separate test apps as a key step. I believe we should take the same route in CoreCLR testing; in bulk testing (local or lab Pri0 / Pri1 testing) we should run fewer than 1K test apps, ideally less than 500. Once that happens, we should be able to remove all the partitioning goo and just run the tests one by one, both locally and in Helix.
Downsides, challenges and problems to solve
Today, about 3/4 of the test suite corresponds to the JIT unit tests - a search in my runtime repo clone under
src\tests\JIT
for*.csproj/ilproj
yields 7312 matches. If we're serious about this effort, we must tackle JIT tests first. According to the proposed ideal state, we should strive to reduce the number of separate apps to about 300~400. I think that roughly corresponds to two subdirectory levels under JIT (e.g. Methodical\divrem) but I have yet to provide more precise numbers.While the test aggregation is expected to solve a known set of problems (test system complexity caused by the partitioning systems, performance of test build and execution), it has the potential to introduce a new set of problems we should plan ahead of and work on fixing or mitigating as part of the proposal. In particular, a larger number of tests being run as a single app can complicate debugging, profiling, TTT analysis, and JIT dump analysis; runtime and / or hard crash in one test tears down the subsequent tests in an aggregated test app, reducing test coverage in the presence of failures.
The counter-arguments clearly highlight sets of tests that are unsuitable for aggregation - typically interop tests where the individual tests sometimes tamper with the machine state (e.g. by registering COM classes), perhaps also the GC tests that are often lengthy and / or have the potential to tear down the app like in the case of negative OOM tests.
Even in cases where the test aggregation is expected to be benign, e.g. in the case of the JIT methodical tests, we still need to address the question of aggregation hampering developer productivity, typically in various diagnostic scenarios. @AndyAyersMS proposed a dual system where the tests would be aggregated by default in bulk testing but the developer could explicitly request the build of a single test case to mitigate the aforementioned complications.
Proposed solution
I have yet to make any real experiments in this space but it seems to me that we might be able to solve much of this puzzle by introduction of group projects. My initial thinking is that, for a particular test project, e.g.
JIT\Methodical\divrem\div\i4div_cs_do.csproj
, we would use a new property to declare that the test is a part of the test group project, say,JIT\Methodical\divrem\divrem_do.csproj
(JIT tests often come in groups that require different optimization flags so that would need preserving in the groupings). Hopefully it should be possible to tweak msbuild to normally build just the group projects; these would need to use either some form of code generators or reflection to run all the relevant test “cases” represented by the grouped projects but that should no longer blow up msbuild as we could easily build the individual group projects serially.I already have a work item on adding a new command-line option to
src\tests\build.cmd/sh
to let developers build just a particular test project or project subtree. It should be trivial to consolidate this option with the proposed project grouping such that in bulk testing we’d end up with just the group projects whereas targeted local scenarios would end up producing a single-test executable (as before) with the caveat that trying to build the entire tree in this “separate” mode would likely trigger an msbuild OOM or some other failure.Proposed sequencing
I’m going to perform at least a series of local experiments to measure how much of the running time of the individual tests is coming from runtime initialization vs. actual test code execution and I’ll share them on this issue thread. I have yet to see whether this approach can be easily applied in the lab. Locally it might suffice to tweak R2RTest to use ETW mode to monitor at which point Main got executed.
Assuming the perf experiments do confirm a perf win in test grouping (especially for tiny tests like the JIT unit tests) and we agree on this proposal in some form, I’ll look into implementing its basic underpinnings in the CoreCLR test build / execution infra scripts and I’ll test the approach on a small suite of JIT tests.
Once the PR per (2) is merged in, we can trigger a “quality-week-like” combined effort to apply the technique to additional CoreCLR test areas. At this point we would be still using the pre-existing infrastructure including the XUnit wrappers and test partitionings, we’d just gradually reduce the number of test apps being run. (The proposed conservative approach doesn’t address actual test code merging i.e. the test build time win will likely be smaller if any. This is further aggravated by the fact that many of the JIT unit tests come in form of IL source code.)
The work per (3) should yield gradually accumulating benefits in form of reducing the total CoreCLR test running time, both locally and in the lab. Once the work advances enough so that we get under the envisioned 1K test projects, we can proceed to experimenting with removal of the test partitionings. At that point we may be also able to consider removing the Pri0 / Pri1 distinction and always run all the tests.
Thanks
Tomas
/cc @dotnet/runtime-infrastructure