Investigate ways to reduce overhead of the hardware intrinsics test matrix

tannergooding commented 2 years ago

As called out on https://github.com/dotnet/runtime/pull/53450:

src/tests is about 60% of the dotnet/runtime repo disk size footprint. Yes, src/tests sources are 1.5x more than sources for both runtimes, all libraries and all libraries tests combined.

In particular, a good portion of this is the src/tests/JIT/HardwareIntrinsics folder which provides coverage for the several thousand hardware intrinsic APIs that correspond roughly 1-to-1 with various SIMD instructions on supported platforms.

The tests for these APIs are largely auto-generated today and cover a range of patterns to try and ensure proper coverage of the intrinsics exist. The patterns attempt to ensure that the various "important" trees get created and execute interesting code paths in the JIT. However, because these are so high level there is a lot of cost required to get "high confidence" the product code is correct.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

As called out on https://github.com/dotnet/runtime/pull/53450: > src/tests is about 60% of the dotnet/runtime repo disk size footprint. Yes, src/tests sources are 1.5x more than sources for both runtimes, all libraries and all libraries tests combined. In particular, a good portion of this is the `src/tests/JIT/HardwareIntrinsics` folder which provides coverage for the several thousand hardware intrinsic APIs that correspond roughly 1-to-1 with various SIMD instructions on supported platforms. The tests for these APIs are largely auto-generated today and cover a range of patterns to try and ensure proper coverage of the intrinsics exist. The patterns attempt to ensure that the various "important" trees get created and execute interesting code paths in the JIT. However, because these are so high level there is a lot of cost required to get "high confidence" the product code is correct.

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `untriaged`
Milestone:	-

tannergooding commented 2 years ago

The quickest thing to reduce overhead of these tests is to move many of them to be "outer loop" so they do not have to be built and run on PRs that aren't touching JIT code. PRs that do touch this code would explicitly trigger the outer loop jobs.

However, I think the ideal scenario would be:

Emitter is unit testable; we can actually write tests for scenarios like INS_addps xmm0, xmm0, INS_addps xmm0 xmm15, and INS_addps xmm0, [addr] directly
- Such tests would be significantly cheaper to run and validate; the also give a higher bar of confidence with less work for core scenarios
- We can directly test the known cases where encoding differs and is therefore important (e.g. there is no real difference for xmm0-xmm7, but there is for xmm8-xmm15)
All of the existing src/tests/JIT/HardwareIntrinsics jobs are removed and replaced with Library tests that do basic functionality validation (e.g. ensure Sse.Add is behaving as expected and does per-element addition).

This would cover the majority of bugs and issues we've hit so far at a much cheaper cost. There would still be some things like lowering, containment, and value numbering that wouldn't be covered here; but we've not hit any issues there that weren't immediately caught by general use of the intrinsics in the BCL.

tannergooding commented 2 years ago

CC. @jkotas, @BruceForstall, @echesakov, @kunalspathak

BruceForstall commented 2 years ago

The JIT has very rudimentary emitter tests with genArm64EmitterUnitTests and genAmd64EmitterUnitTests; the former are much more extensive than the latter. They've been super useful during bring-up work. However, we've never figured out how to automate these (or decided that automating them is worthwhile), as the only "oracle" to determine if the generated result is correct is comparing JIT disassembly versus windbg disassembly.

Moving all the hwintrinsics tests to some specific outerloop makes sense (leaving some "spot test" subset in pri-0 or pri-1).

If one problem is that the tests simply take too much space in the repo, since they're auto-generated, I've wondered in the past if they could be auto-generated at (pre-)build time, and not have their auto-generated source live in the repo.

tannergooding commented 2 years ago

I think time it currently takes is the larger problem overall.

The simplest thing might just be to run the individual tests in Parallel, but that might conflict with the larger parallelization that already happens across individual "test projects"
Outside of that, splitting the tests into more sub-projects avoids the timeout issues we are currently seeing and should also allow faster overall execution time; at the cost of increased build-time.
Moving most tests to be outerloop; outside maybe AVX/AVX2 would also help for x86; but not so much for ARM where everything is in "one class" (AdvSimd).

jkotas commented 2 years ago

Note that we are working towards merging the runtime tests into larger binaries in https://github.com/dotnet/runtime/issues/54512

cc @trylek

tannergooding commented 2 years ago

Note that we are working towards merging the runtime tests into larger binaries in #54512

Thanks! That would imply that the middle approach isn't desirable right now. If we are going to be merging more test binaries, the hardware intrinsic tests in particular should be relatively easy to merge together however we need given that its currently all auto-generated.

I think the two main issues then remain:

There are a lot of tests here and most only need to be run if specific paths in the JIT are touched
- This ideally gets handled by considering most tests "outer-loop"
There are enough tests in single projects that we hit timeouts when running GC Stress and similar
- I think parallelization or increasing the timeout is the only way to mitigate this

Another option here is to move all of the HWIntrinsic tests out of the JIT test folder and into the libraries side. They should still validate all the same things when running under xunit.

This likely improves build time, setting category/prioritization of tests so they are "outerloop" only, etc. But makes it slightly more involved to test/validate these changes and doesn't play well with COMPlus_* switches today.

tannergooding commented 2 years ago

@trylek, given this is an active issue and it sounds like you have some initial ideas on the approach to go (coupled with these being auto-generated and so fairly trivially updateable to different shapes), I'm happy to work with you on trying some things out.

trylek commented 2 years ago

Thanks Tanner, that indeed sounds interesting. My current proposal only deals with generating wrappers to call the individual test cases; I'm currently working on prototyping its behavior for ilproj tests that naturally cannot use the Roslyn analyzers. If you're thinking about generating test cases on the fly, that would be a new concept we'd likely need to prepare infrastructure for; if you're just thinking about a tool that would emit a bunch of tests once in a while, there are likely no changes needed. Can you be more specific as to whether you're thinking about CS / IL projects, what are the variations / degrees of freedom and such?

tannergooding commented 2 years ago

@trylek, the hardware intrinsics tests today are generated ahead of time via a script + some "template" data.

e.g. There is a GenerateTests.csx script (C# Interactive) which pulls metadata out of a dictionary to insert into various "test templates" (files that are formatted to work with C# string interpolation): https://github.com/dotnet/runtime/blob/main/src/tests/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx

Each project contains somewhere between a handful of tests and a few hundred tests; where each test covers about 5-6 key scenarios.

The projects and number of tests/scenarios are large enough today that we hit timeouts in the GCStress jobs. We aren't currently running any tests in parallel because it might conflict with the parallelization mechanics that already exist elsewhere.

BruceForstall commented 2 years ago

@tannergooding In @trylek 's proposed system, a wrapper/driver program will be running individual tests in-process, either sequentially or in parallel. Presumably each of these "individual" tests would be subject to per-test timeout. One of the issues with the hwintrinsics tests is that there are just so many tests within a single test assembly, and that test assembly is currently considered a single "individual" test. Perhaps in the propose system, it would be easier to split up execution so smaller subsets of hwintrinsics tests are considered an "individual" test.

tanveerbadar commented 2 months ago

Have you considered using a source generator in place of csx for creating these tests? That way, the tests take no space in the repo and you get all the benefits you have currently.

The problem was about the size contribution rather than run frequency, as I understood it stated in the first post.

tannergooding commented 2 months ago

They are already generated dynamically and no longer checked in. The use of a source generator vs a CSX file makes no real difference here, other than the former having not existed when the work first started and being overall more complex to setup for this fairly isolated thing we need to support.

dotnet / runtime

Investigate ways to reduce overhead of the hardware intrinsics test matrix #60233