Optimization Test Framework

noryb009 commented 7 years ago

Currently, there are 2 ways to test optimizations:

Optimize a program, and ensure it gives the same output as without the optimization.
Run with traceFull,log=logfile and verify the IL was correctly modified.

Nether of these are great tests. While the former usually ensures the output is valid, it doesn't check if the optimization was applied correctly, or even at all. The latter addresses this shortcoming, but the process is manual, time consuming, and one needs to understand the optimization to properly verify the optimization was correctly applied.

This is a proposed method to allow the latter method to be run as an automated test. It is essentially a framework for running a given sequence of optimizations on an ILInjector, and exposes a hook to allow the test to verify IL. Doing codegen is optional.

An example test can be seen here. Essentially, the test sets a MethodInfo (which itself takes an IlInjector), a verifier, and a list of optimizations (in this case, the list is in the test fixture constructor).

Verifiers have very few restrictions - they are only expected to return a non-zero status code on an error. In test cases, they are expected to use gtest's EXPECT and ASSERT macros to fail the test, if required.

Verifiers can also be re-purposed for general IL verification - verify{Trees,Blocks} could be rewritten as verifiers, and specified when required by a downstream project. This also allows downstream projects to apply their own constraints to IL, if required.

The full change set can be seen in the last few commits on my opt-test branch.

I'm interested to what the community thinks about this. Thoughts and constructive criticism is welcomed.

I would like to thank Filip and Mark for their ideas in #382, as well as Matthew and Luc for their suggestions made offline.

mgaudet commented 7 years ago

It's off to a really nice start I think :confetti_ball:

A couple of comments:

I wonder if it's possible to try and dramatically reduce the number of files by moving most of the test related headers and CPP files into the same file -- It's not clear to me that the separations you have provide benefits for writing test cases.
It would be good to also seen an example of how to use a NoCodegenVerifier
Yay! A use of the IL iteratiors :tada:

Leonardo2718 commented 7 years ago

Awesome work! :+1:

I really think this is a great start.

As an item for later work, I'd like to see if it's possible to further decouple the verification from the compilation process (i.e. be able to "hide" the test compiler a little more). I think this would help clean up and reduce the size of test cases, hopefully making it easier to create better tests.

noryb009 commented 7 years ago

@mgaudet

That is definitely possible. I'll switch it soon. Edit: I've amended a change to move everything to one file. It does remove a lot of boilerplate code and context switches when switching files, which is great.
The verifier is used here. Tests can choose to do codegen or not through using VerifyAndInvoke or Verify, respectively.

@Leonardo2718 The test cases themselves are quite short, but the test driver is more complex - I agree we need to look at making the test compiler more friendly to test cases. A lot of the complexity in these test cases (eg. the NoCodegenVerifier) could be resolved by making the compiler more friendly to mocking - but that is an issue in itself.

mgaudet commented 7 years ago

Thanks for inlining a lot of that into one file. Helps a lot.

I think it'd be a good thing to get a lot more comments about the expected flow in the TestDriver, precisely because it is complex. I think I see most of it... but I'm not 100% sure.

noryb009 commented 7 years ago

@mgaudet FYI, I've added some inline documentation to the framework.

andrewcraik commented 7 years ago

One obvious addition to the suite would be support for an optFile style of specifying optimizations to run. Devin has used optFile to great effect to help force certain tree structures through to specific optimizations. Obviously with such great power comes great responsibility, but sometimes well meaning opts like dead store elimination can make it very hard to get the right trees to where you need them in the optimizer. Something like optFile gives you the ability to override this in a controlled way so the test pattern you want arrives at the opt you want.

As a side note optimizations have a lot of dependencies between each other. Each opt will have a normal form in which it expects the trees and it relies on the placement of the optimization in the opt strategy to ensure the canonical form is achieved. It would be more realistic, therefore, to base testing on known opt strategies and omit problematic steps like the optFile approach allows. As a specific example some opts require block extension to have run while others expect it to not have run so ensuring appropriate ordering relative to this key step in the optimizer is important.

Allowing for both functional and tree verification would be nice. Trees can fluctuate from run to run even while achieving the same overall effect. Allowing the compiled code to be invoked with test inputs would allow for some functional verification which could be less brittle or an easier way to achieve the end goal. Given that we have a C linkage now it should be relatively easy to arrange to pass some values into a compiled assembly snippet and verify the output. This is especially the case where you are testing something like idiom recognition where feeding loops of carefully crafted data is an important way to test corner case behaviour that is not immediately obvious in the trees.

Moving the tree verification more towards something akin to the Clang AST pattern matchers would probably make the tests less brittle. Note that in the IL it is possible to have multiple different sequences of treetops that all evaluate to the same thing - do I anchor every node under its own treetop or anchor them only under one is the kind thing to think of when considering this equivalent forms question.

The major thing missing from this approach is the notion of block frequencies and this is the most significant thing that I think needs to be added. Every opt is one part tree manipulation - the actual transformation - and one part heuristics about deciding when and where to do the transformation. The JIT heritage of the technology means that the global currency for these heuristics is block frequencies. Low frequency blocks are generally ignored while high frequency blocks are super interesting. If you only test the tree manipulation you leave out a large part of the opt which are the control heuristics embedded with the transformation logic. You, therefore, need to supply realistic block frequencies to be able to test the decisions these heuristics make and exercise the opt fully.

Overall a great start, but I think the issues of fragility, block frequencies, and functional verification need to be addressed before we can start looking to use the framework to test the optimizer more broadly. This is a great start :)

andrewcraik commented 7 years ago

One final thing that verifying the output of an opt is not just the trees, but also the block frequencies on any new blocks generated. For example tail duplication (aka block splitting in the OMR optimizer) can generate functionally correct trees, but then screw up the rest of the optimizer by setting back block frequencies (I have spent many weeks working through those kind of issues and they are much harder to debug and can be much more devastating than functional issues which usually show up quickly in testing provided the opt is being exercised.

mgaudet commented 7 years ago

One note: This can, and does, do functional testing as well (see VerifyAndInvoke, in addition to invokeTest)

EDIT I mean the PR associate w/ this issue. Didn't notice this was the issue, not pull. Sorry!

andrewcraik commented 7 years ago

Thanks for the pointer to that - I missed it in reading the code. Nice to know a wish is granted before it is made - nice work!

noryb009 commented 7 years ago

Thank you for the feedback, Andrew!

I investigated using optFile, however it didn't work very well with the test compiler - since the compiler can only be initialized once, to use optFile we would need another process to run, which makes debugging a bit harder. Alternatively, we could add hooks to change the optFile on the fly, which is somewhat complex due to the way we currently store and use options. Instead, I opted for a mocking system which can be controlled inside the test. Having testers write optFiles causes optimization lists and tests to be separated, and the test driver converting to an optFile is more complex (however, if benefits to using optFiles arise, the test driver can change to writting optFiles without changing its interface).

I agree, better IL comparison should be implemented to make writing verifiers easier. This has influenced some of my design decisions - you can inherit from IlVerifier, take a pattern in the constructor, and check for the pattern in verify - this reduces writing verifiers into writing a pattern. Other verification options, such as comparing to an exported version of the IL, are also possible in verifiers using the same framework. However, implementing a pattern matcher likely requires deep knowledge of the IL and optimizations - neither of which I have.

The question of what input should be given depends on the optimization and what is being tested. If the test is testing whether the optimization can look at multiple treetops at once, nodes should probably be in their own treetop. If the optimization is usually run late in the optimization process, you might want to emulate that by having multiple nodes under a few treetops. Looking at the actual input IL to the optimization in normal operation should answer most questions of what the IL should look like for most tests.

I haven't heard of or used block frequencies before, so forgive me if my answer is doesn't make sense. My naive answer is to simply set the frequency using block(b)->setFrequency(n). However, this could cause issues elsewhere in the compilation process, which would need to be investigated. Similar information, like recompilation info, could be provided by mocking methods (similar to selecting optimizations, or using a system like googlemock).

Let me know if I didn't fully address any of your concerns.

andrewcraik commented 7 years ago

I do see your mocking system. My comment about optFile is more that we would want that kind of behaviour so saying you want to run a test with opt strategy X and then skips opts a, b and c would ensure the test is run in a consistent manner and that changes to the opt strategies (ie when things get rearranged or we consider rearranging things) reflect properly in all tests without having to update the tests.

I fully agree that a fuzzy matcher is no easy thing - I was more pointing out the need for it to reduce fragility than criticizing your eminently reasonable initial implementation which is fine start on this journey.

Very few opts in the optimizer proceed one treetop at a time. The point of what input you want is that you want to write some IL which resembles the program structure you want to optimize and you want that structure to undergo standard optimization so your opt is running in context. How nodes are pinned and commoned etc is very complex which is why I advocate for basing opt strategies of ones used in practice.

Setting block frequencies is a complex operation - they need to be consistent across the program and need to show the hot path you expect to be optimzied. setFrequency is the means by which you can directly manipulate it, but for this framework to be usable to test things in production situations making it easy to get the right frequencies and that those frequencies are consistent is important. I agree that mocking is the way to go since beyond block frequencies some opts like inlining rely on value profiling information and other environmental data. I just point out that without this contextual environment tests will not work well because of the emphasis on control heuristics in production optimizations. You are certainly thinking on the same lines as me.

None of what I said is a criticism of your initial steps - it is a great start and a very cool attempt at helping solve the problem of how to test an optimizer. Someone asked me what would it take to use this in production and that is the genesis of my feedback. Thank you for the great work and keep it up! I am very happy to discuss more around how the major optimizations work and the context they require from the environment to be able to operate in a realistic manner. I look forward to seeing how this test framework evolves going forward.

smlambert commented 7 years ago

As one would suspect, googlemock pairs nicely with the googletest framework, so if you are looking for more sophisticated mocking system, it is one to consider. We didn't pull it in as we set up the OMR tests, as there were yet no cases that needed it at the time.

This conversation (some of which I can't claim to follow) and the excellent initiative to test optimizations is reminiscent of some challenges in testing the GC. What do you use for test input? How do you create test input that is more meaningful or more related to real customer scenarios? For the GC input, when you set up a test config file, you can create an object model of your own design, but we also populated some by pulling object model info from field core files. They were dramatically different shapes than the ones we were building.

Would populating some form of test input with actual data (from core files or JIT logs) be useful/feasible? Or maybe you are already there and I am missing it...

mgaudet commented 7 years ago

Would populating some form of test input with actual data (from core files or JIT logs) be useful/feasible? Or maybe you are already there and I am missing it...

We're definitely nowhere near that. I actually would love it if we hit a point one day where it wouldn't be too hard to take the output from our tracelogs and convert it into a test case. There's a long list of reasons why that's hard, but it's also I think a fantastic direction to pursue.

noryb009 commented 7 years ago

Andrew, I assure you I am taking your comments positively. Thank you again for reviewing this.

My concern about running multiple optimizations in one test is that a change in one optimization, such as simplifying an extra expression, can cause many tests to fail. Caching the input to an optimization, instead of running extra optimizations, might be a better answer.

I imagine one way to address both of your points is to extend the IlInjector class to perform actions like marking frequencies and adding value profiling in a safe and easy way. Again, I have a limited knowledge of the compilation process, so this might not be possible. This would also allow you to test edge cases without other optimizations removing the edge-case-ness of it.

noryb009 commented 7 years ago

@ymanton - I'm responding to your comment here:

Are you referring to one test putting multiple functions together, for (eg.) testing inlining? Or do you mean testing multiple functions individually?

Together: I have given it some thought. For the time being, I believe it would be best to simply override compileTestMethods. The current compileTestMethods is based on removing boilerplate code - if similar boilerplate code arises from multiple methods being compiled, we can abstract it away in a similar manner.

Individually: I believe the answer would be to have multiple tests, i.e. multiple TEST_F's. If the tests are sufficiently similar, a parameter can be given to (eg.) the Info that changes it as required. If the tests are distinct, it might be easier to write a different Info class.

ymanton commented 7 years ago

I meant individually, for tests that are sufficiently similar -- for example testing something involving loads or comparisons, where you may want to generate multiple compiled methods, one for each load data type or each type of comparison, where the tests basically do the same thing. The current (now old) framework made it obvious to see where the compilation was actually happening and how to generate more methods, but since you've wrapped part that it wasn't clear to me if it was still possible or straightforward.

ymanton commented 7 years ago

For those interested, my original comment in the pull request was:

I happened to be doing something that would benefit from this sort of testing so I gave it a try and it was pretty simple to get it working.

One comment, have you given any thought on how a test can generate multiple compiled methods to test? I can sort of see how this might work -- add some sort of interface to the Info class that influences the function it defines and what the IL injector emits and make multiple calls to that together with Verify*() in your TEST_F() -- but I haven't tried it.

Having a test be able to generate multiple compiled methods is pretty useful, you don't want to force someone to generate one giant function if they have a lot of variations to test, you would prefer that each variation be tested in it's own compiled.

mgaudet commented 7 years ago

We probably need to have a larger discussion about strategies for handling heuristics.

Thinking about it, to me it seems like there's a kind of harmony with the performTransformation machinery. While it certainly needs more flesh, what are first opinions on something like replacing naked heuristic checks like

if (block->getFrequency() > threshold) { ... }

with

if (heuristic(block->getFrequency() > threshold)) { }

The idea here would be that heuristic would be a macro that in some circumstances might use a different check. The contract users would guarantee with heuristic would be that the heuristic is always safe to replace with true.

So perhaps you could have

#ifdef JITTEST
#define heuristic(x) true 
#endif

or

#define heuristic(x) _countedHeuristicCheckFunc( (x) )  // Returns (x) N times, then FALSE.

My thinking here is that we need to have as much ability as possible to control what branches are being exercised in the compiler in order to get test coverage where possible, and most heuristics are of the sort that we're trying to control for performance... but in a functional test scenario, it's more important to hit the path, than it is to produce high performance code.

In some sense, heuristic is the polar opposite of performTransformation -- the latter marks regions that can always be safely elided: the former marks regions that can always be safely performed (but you might not want to for performance reasons.

eclipse / omr

Optimization Test Framework #569