dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.02k stars 4.67k forks source link

JIT Hints- Parameter conditional inlining directives, or, improve call-site inlining of funcs with constant args feeding structs #10658

Open Zhentar opened 6 years ago

Zhentar commented 6 years ago

In dotnet/corefx#30934 we have this code:

        public static bool TryParse(ReadOnlySpan<byte> source, out int value, out int bytesConsumed, char standardFormat = default)
        {
            switch (standardFormat)
            {
                case default(char):
                case 'g':
                case 'G':
                case 'd':
                case 'D':
                    return TryParseInt32D(source, out value, out bytesConsumed);

                case 'n':
                case 'N':
                    return TryParseInt32N(source, out value, out bytesConsumed);

                case 'x':
                case 'X':
                    value = default;
                    return TryParseUInt32X(source, out Unsafe.As<int, uint>(ref value), out bytesConsumed);

                default:
                    return ThrowHelper.TryParseThrowFormatException(out value, out bytesConsumed);
            }
        }

If the standardFormat parameter has a constant value at the callsite, we definitely want this function inlined; it can be entirely evaluated by the JIT and inlined to a call slightly smaller than the original. But if it's not, it's not a particularly small function to be inlining; it's going to be the wrong choice in at least some cases. [MethodImpl(MethodImplOptions.AggressiveInlining)] is a rather blunt tool for this kind of scenario. It would be much better if we could do something like public static bool TryParse(ReadOnlySpan<byte> source, out int value, out int bytesConsumed, [ParamImpl(ParamImplOptions.InlineMethodWhenConstant)] char standardFormat = default) to give more nuance to the JIT hints.

category:cq theme:inlining skill-level:expert cost:extra-large

mikedn commented 6 years ago

The JIT inline heuristic already looks at constant arguments but only in connection to branches, not switches. Basically switch is a bit of a black sheep when it comes to inlining.

Zhentar commented 6 years ago

The body of that function doesn't have any actual switch statements in it (since the cases don't line up to anything remotely close to a decent jump table); as written currently it gets excluded for too many IL bytes, but it also has too many basic blocks. I was able to prevent it from being excluded from inlining outright by breaking it up into two functions (as in the linked issue), but the inner function still wasn't getting inlined (deemed unprofitable) - I don't know if the constant argument value wasn't propagated to the consideration of the inner function, or even with the constant the heuristic wasn't scoring it high enough.

Regardless, while the ideal certainly would be the JIT always figuring out the right choice in cases like this (and at no runtime cost, and also it should do my taxes and give me a pony while it's at it), we're definitely not there yet.

mikedn commented 6 years ago

I don't know if the constant argument value wasn't propagated to the consideration of the inner function, or even with the constant the heuristic wasn't scoring it high enough.

Yes, the JIT does take constant arguments into consideration but it's not a carte blanche for inlining so other factors such a method size still contribute to the final score. Also, the JIT does not attempt to do any actual optimizations on neither the inliner nor the inlinee at the time of inlining. It's just "hey, I see a constant argument used in an if - let's try a bit harder", it doesn't actually attempt to do dead code elimination to figure out that a ton of IL code will go away.

AndyAyersMS commented 6 years ago

That's not entirely true -- we do a bit of optimization on the inlinee. The jit will retype arguments and forward substitute argument expressions into the inlinee, to make propagated constants available to nested calls and to enable things like devirtualization.

As I've noted elsewhere the initial analysis of the inlinee as an inline candidate is very crude and based on a simple pass over the inlinee's IL. Because of this it is hard to project how much size savings might arise from particulars at a call site (doing this for switches is fairly tricky business anyways, even with more involved modelling). And the stack model we use when analyzing the IL is quite crude and will both miss cases and overstate cases.

The simple modelling and the underlying conservationism of the inliner are hallmarks of the historical effort to balance jit throughput versus code quality. When you have one shot at jitting a method you can't afford to go overboard in either direction.

The hope is that tiering will eventually allow us to relax some of the throughput constraints and that upper tier jitting can afford more expansive modelling, but we're not quite there yet. The remainder of the jit isn't well set up for larger method bodies yet either. So we have to work at improving it bit by bit.

Zhentar commented 6 years ago

And for every function like this example which can be entirely jitted away, there are probably two dozen others that are just checking for valid input. I'm quite excited by the rather significant progress I see being made in the right direction, but for where we are right now, better annotations seem like the more reasonable ask (and better use of resources) than figuring out the perfect balance of detecting the right choice.

AndyAyersMS commented 6 years ago

Here's a prototype change to unblock inlining methods with switches, and a bit of work on a simple profitability estimate: master...AndyAyersMS:Explore18863.

Running PMI diffs over the framework assemblies jits around 170K methods. Over all the calls in these methods, the jit sees about 180 call sites where constant caller arguments feeds a switch in the callee and inlines at maybe 50 of them:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -42 (0.00% of base)
    diff is an improvement.
Top file regressions by size (bytes):
          36 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00% of base)
          32 : Microsoft.CodeAnalysis.dasm (0.00% of base)
          17 : System.Private.DataContractSerialization.dasm (0.00% of base)
Top file improvements by size (bytes):
         -87 : System.Private.Xml.dasm (0.00% of base)
         -36 : System.Data.Common.dasm (0.00% of base)
          -4 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
6 total files with size differences (3 improved, 3 regressed), 124 unchanged.
Top method regressions by size (bytes):
          55 : System.Private.Xml.dasm - CopyNodeSetAction:Execute(ref,ref):this
          52 : System.Private.Xml.dasm - CopyAction:Execute(ref,ref):this
          37 : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseSubExpressionCore(int):ref:this
          27 : Microsoft.CodeAnalysis.dasm - SwitchIntegralJumpTableEmitter:EmitSwitchBuckets(struct,int,int):this
          27 : Microsoft.CodeAnalysis.VisualBasic.dasm - CodeGenerator:EmitConversionExpression(ref,bool):this
Top method improvements by size (bytes):
        -113 : System.Private.Xml.dasm - OptimizerPatterns:Inherit(ref,ref,int)
         -40 : Microsoft.CodeAnalysis.CSharp.dasm - SourceNamedTypeSymbol:MakeOneDeclaredBases(ref,ref,ref):ref:this
         -36 : System.Data.Common.dasm - ExpressionParser:ScanNumeric():this
         -20 : System.Private.Xml.dasm - XmlILOptimizerVisitor:VisitFilter(ref):ref:this
         -11 : System.Private.Xml.dasm - XmlILOptimizerVisitor:AddStepPattern(ref,ref):this
27 total methods with size differences (15 improved, 12 regressed), 170655 unchanged.

So at least in framework code this pattern is really not very common. If you know of any code where it might show up more frequently, please send me pointers.

As an example of what can happen: in System.Private.Xml this method now gets inlined a fair amount:

https://github.com/dotnet/corefx/blob/bffef76f6af208e2042a2f27bc081ee908bb390b/src/System.Private.Xml/src/System/Xml/Xsl/IlGen/OptimizerPatterns.cs#L208-L221

Zhentar commented 6 years ago

Running a search against a large codebase I work on (~1GB of assemblies)...

Roughly 1/3 of C# switch statements ended up as switch opcodes in the IL. Of those, about 1/3 were switching on an argument, and of those, about 1/3 were small enough to be considered for inlining, so about 3% of the total.

In that 3%, every one I looked at was effectively a lookup table for an enum, e.g.

public bool NeedsFrobulating(MyEnum value)
{
    switch(value)
    {
        case MyEnum.ValueA: return true;
        case MyEnum.ValueC: return true;
        default: return false;
    }
}

Don't know how many of them have callsites with constants yet, but a lot of them would likely benefit from inlining regardless.

Aside from that, there were a large number of generated IDisposable.Dispose functions switching on the iterator state to handle the finally clauses for iterators in nested iterator calls (so they depend on devirtualization before inlining can matter).

AndyAyersMS commented 6 years ago

If you can run on a locally built CoreCLR, feel free to use my fork above, it prints "@@@@ CONSTANT FEEDS SWITCH @@@@" to stdout if the jit sees the constant case.

You can get jfull jit coverage over your assemblies via PMI, see the jitutils repo via something like this:

  1. Build CoreCLR checked (make sure to also build tests or at least the test overlay)
  2. Clone jitutils repo, and in that repo run the bootstrap script. This will publish PMI to the bin directory
  3. Invoke PMI via your locally built corerun (found under bin\tests...\core_root)

For example:

d:\repos\coreclr\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root\corerun.exe d:\repos\jitutils\bin\PMI.dll prepall-quiet d:\repos\coreclr\bin\tests\Windows_NT.x64.Checked\tests\Core_Root\System.Private.xml.dll

@@@@ CONSTANT FEEDS SWITCH @@@@

...  etc ...

@@@@ CONSTANT FEEDS SWITCH @@@@

@@@@ CONSTANT FEEDS SWITCH @@@@
Completed assembly System.Private.Xml - #types: 1686, #methods: 18044, skipped types: 22, skipped methods: 221

If you find this useful I can augment this with the names of the root method, inline parent, and inline candidate.

AndyAyersMS commented 5 years ago

I went and took another look at this, and things are a bit more complicated now.

StandardFormat is (has become?) a struct. CSC is expressing the switch in TryFormat as a series of cascaded tests. And so the value being switched on is a field of a struct.

So to see the potential size savings a constant argument offers, the jit would need to realize the input was indeed constant (currently there is no real concept in the jit of a "constant struct") and then aggregate the potential benefit of the constituent constant fields as they feed a series of tests against other known constants in the inlinee, and get all of the accounting right (including cases that rejoin or branch to one another, etc). That is well beyond what we can do today. Amd my prototype changes above don't help as there is no switch in the IL.

We currently aggressively inline TryFormat and after struct promotion, constant prop, etc, the jit eventually realizes that if you pass in the default StandardFormat (or presumably other "constant" format options) that only one switch case is reachable and so the jit removes the code for the rest of the cases. But it's quite a bit of behind the scenes work as a lot of IR is built up and then ultimately discarded.

For non-default constant cases the IR growth before the winning path becomes clear is probaby even worse than for the default case, as the jit would start aggressively inlining down through the default case code before realizing (one hopes; I have not looked) all that inlined code is uneachable and can be tossed out. There's always a risk that pulling in a bunch of extra IR like this may confuse some analysis in the jit or cause it to hit one of its internal tripwires.

From the jit's standpoint things would have worked out more smoothly if there were two separate int or byte option fields as constants; these would feed early branch folding in the importer and guide the jit to only import at the code that is reachable. If these kind of "constant" struct option bundles become popular we may need to look for ways to recognize them earlier and act on them sooner.