ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
23.45k stars 5.8k forks source link

Performance improvements of individual Yul Optimizer steps #15508

Open cameel opened 1 month ago

cameel commented 1 month ago

A catch-all issue for the specific ideas for performance improvement in individual optimizer steps + a place to post profiling results and discuss them.

cameel commented 1 month ago

Are two passes of the FullInliner necessary?

FullInliner does two passes through the code, potentially duplicating some work. The first pass (inlining only tiny functions) seems to be a subset of the second so I decided to check if it makes that much differnce and how it impacts running time.

Results

Removing the first pass did shave off ~1 s from inliner's running time in my benchmarks (on a 120 s compilation of Uniswap).

Here's what it looks like in detail:

Time % Time Calls Scope
100% 4.183 s 810 run()
78% 3.085 s 810 ├─run(Pass::InlineRest)
13% 0.561 s 810 ├─run(Pass::InlineTiny)
14% 0.458 s 810 └─FullInliner()
4% 0.160 s 810 └─ReferencesCounter::countReferences()

So it may be closer to 0.5 s.

Gas impact

I don't have much data on runtime cost, but what I'm seeing is not bad. In most tests in our repo the impact is slightly negative but minimal. At the same time there are a few specific tests where we it goes down by a lot.

I created #15509 to see external test results, but runtime gas results are very sparse. It does have a very positive impact on bytecode size though.

ir-optimize-evm+yul: project bytecode_size deployment_gas method_gas
brink 0%
colony +0%
elementfi -0.27% ✅
ens -1.77% ✅ -0.66% ✅ -0%
euler -0.14% ✅
gnosis
gp2 +0.15% ❌
pool-together +0.22% ❌
uniswap -4.62% ✅
yield_liquidator -1.54% ✅ -1.62% ✅ -0.11% ✅
zeppelin -2.62% ✅

Conclusions

cameel commented 1 month ago

FullInliner's BodyCopier takes ~60% of inliner's running time.

Profiling FullInliner::tryInlineStatement()

Time % Time Calls Scope
100% 4.183 s 810 run()
68% 2.860 s 1566132 └─tryInlineStatement()
54% 2.278 s 36552 ├─performInline()
48% 1.990 s 36552 │ ├─BodyCopier
25% 1.060 s 772025 │ │ └─NameDispenser::newName()
5% 0.216 s 62898 │ └─newVariable()
4% 0.187 s 62898 │ └─NameDispenser::newName()
13% 0.536 s 909031 └─shallInline()
11% 0.468 s 156399 └─ReferencesCounter::countReferences()

The numbers come from the same Uniswap run as in https://github.com/ethereum/solidity/issues/15508#issuecomment-2407974008 above. Just note that they are from both passes combined.

@ekpyron had a hypothesis that it's NameDispenser::newName() that takes most of of the time in BodyCopier. This is a bit tricky to verify because a single call of this function is relatively lightweight. From the numbers above it looks like it accounts only for half of BodyCopier's running time. However, if you look at the table with steps at the bottom of this comment, you'll see that the normal running time of FullInliner is ~3 s, while here it blew up to ~4 s. I think the missing second is the overhead of the profiler probe I placed inside the BodyCopier's operator(). With ~800k invocations it adds up. This would confirm the hypothesis.

Optimizing NameDispenser::newName() the quick and dirty way

I suspected that there may be a way to address the bottleneck locally and get the speed up already, even before we're done with the YulName stuff.

https://github.com/ethereum/solidity/blob/7893614a31fbeacd1966994e310ed4f760772658/libyul/optimiser/NameDispenser.cpp#L47-L57

Identifying which part of newName() is the heaviest was more difficult than I expected though. At first I suspected illegalName(), but profiling led me to believe that it's the string conversion instead. I wrote a faster, optimized conversion but it turned out to be a red herring. It was not the bottleneck. In the end I had to resort to artificially repeating each part multiple times in a loop to find out how heavy are the parts relative to each other:

Time % Element
61.9% YulString conversion
22.0% illegalName()
9.4% std::string concatenation
3.7% YulString::str()
3.0% std::to_string()

Looks like it's mostly in YulString and we'll have to wait for #15215 after all.

Impact of NameDispenser on optimizer as a whole

I measured this by replacing the implementation with this:

static size_t counter = 0;
++counter;
return YulString("___UNIQUE_NAME___" + std::to_string(counter));

The result is that the total running time of the optimizer when compiling Uniswap went down from 60s to 53s. Here's a detailed breakdown between steps:

Time % (before) Time % (after) Time (before) Time (after) Step
1.1% 0.9% 0.685 s 0.469 s ExpressionSplitter
2.8% 2.7% 1.693 s 1.429 s LoopInvariantCodeMotion
3.6% 2.1% 2.200 s 1.088 s FunctionSpecializer
3.7% 3.7% 2.240 s 1.981 s UnusedStoreEliminator
3.9% 4.1% 2.327 s 2.192 s ConditionalSimplifier
4.1% 4.4% 2.485 s 2.307 s LoadResolver
4.6% 4.7% 2.795 s 2.496 s DeadCodeEliminator
4.7% 3.8% 2.843 s 2.006 s SSATransform
5.1% 5.3% 3.090 s 2.797 s ConditionalUnsimplifier
5.3% 3.7% 3.193 s 1.947 s FullInliner
8.2% 8.8% 4.963 s 4.649 s LiteralRematerialiser
8.7% 8.8% 5.246 s 4.626 s UnusedAssignEliminator
8.7% 8.6% 5.252 s 4.525 s UnusedPruner
9.8% 10.7% 5.902 s 5.672 s ExpressionSimplifier
20.7% 22.7% 12.497 s 11.991 s CommonSubexpressionEliminator
100.0% 100.0% 60.385 s 52.862 s TOTAL

The steps which became faster by 0.5 s or more are marked in bold. I omitted steps that take less than 1% of running time since there's nothing interesting happening here.

Conclusions

cameel commented 1 month ago

Oversized time impact of UnusedPruner

Context

UnusedPruner is relatively simple, but takes a surprisingly large share of the total optimization time.

Here's what it looks like for Uniswap:

Time % Time Calls Scope
100.0% 62.251 s 38316 TOTAL
20.5% 12.731 s 3281 CommonSubexpressionEliminator
9.5% 5.898 s 1391 ExpressionSimplifier
8.8% 5.473 s 1350 UnusedAssignEliminator
8.4% 5.251 s 2700 UnusedPruner

With total compilation time being ~120 s.

Top-level breakdown

Time % Time Calls Scope
100% 5.286 s 2700 run()
100% 5.285 s 2700 runUntilStabilisedOnFullAST()
72% 3.793 s 2719 runUntilStabilised()
34% 1.821 s 741371 operator()(Block)
27% 1.409 s 4551 constructor
27% 1.408 s 4551 ReferencesCounter::countReferences()
15% 0.782 s 2700 SideEffectsPropagator::sideEffects()
8% 0.444 s 2700 MSizeFinder::containsMSize()
5% 0.284 s 2700 CallGraphGenerator::callGraph()

Pretty much all of the time is spent in general utilities and in operator(), which contains the main logic.

Note that runUntilStabilised() here has more calls than run(). It's because it's also invoked by StackCompressor. The time spent in those calls is negligible (0.04 s), so I'm omitting it.

operator()

This is the main custom bit in the pruner.

Time % Time Calls Scope
100% 5.767 s 2700 run()
74% 4.273 s 2719 runUntilStabilised()
39% 2.247 s 741371 operator()(Block)
13% 0.766 s 295388 FunctionDefinition case
1% 0.064 s 3010287 VariableDeclaration case
0% 0.024 s 914453 ExpressionStatement case
0% 0.015 s 741371 removeEmptyBlocks()

The operator is relatively lightweight, but invoked a lot so profiling adds visible overhead (~0.5 s). However, even after subtracting the overhead, there's still almost 1 s unaccounted for. I did not manage to explain where it's going. It looks like it must be coming from ASTModifier or the for loop itself, but when profiled in isolation those two do not take nearly that much time.

The way the visitation is done does seem a bit redundant - we loop over statements in every block, but then run the original operator, which visits all that again. This is because the pruner needs to delete statements or replace them with statements of a different type, which is difficult to do in normal visitation. I spent a moment trying to rewrite, but ultimately gave up.

Redundant reruns

One potential source of redundancy is the code that loops stable. I checked how often that happens and how much time it takes:

Time % Time Calls Scope
100% 3.659 s 2719 runUntilStabilised()
65% 2.378 s 2719 first pass
35% 1.279 s 1384 remaining passes

So it seems that the loop runs ~2 times on average and that the second and later passes are faster than the first one. This would indicate that the second pass rarely does anything. I did a quick experiment by hard-coding it to one pass to see if it affects the results. Unfortunately, I do see a significant number of left-over unused statements in Yul tests, so this does not seem to be a viable simplification.

Impact of general utilities on whole compilation

I checked how much time the utility classes consume cumulatively over the whole compilation (i.e. in all of the steps and the Yul->EVM transform), to see if focusing on optimizing any single one of them would make sense.

Time % Time Calls Scope
100.0% 60.691 s Optimizer steps (not including Yul->EVM)
12.2% 7.378 s TOTAL
4.3% 2.622 s 8411 SideEffectsPropagator::sideEffects()
3.6% 2.169 s ReferencesCounter::countReferences()
2.9% 1.781 s 88083 ReferencesCounter::countReferences(Block)
0.6% 0.387 s 156399 ReferencesCounter::countReferences(FunctionDefinition)
0.0% 0.001 s 785498 ReferencesCounter::countReferences(Expression)
2.6% 1.595 s 13767 CallGraphGenerator::callGraph()
1.6% 0.992 s 5963 MSizeFinder::containsMSize(Block)

The one that might be conceptually simplest to address is MSizeFinder - since the optimizer never introduces msize itself, we could save ~1 s by checking it once before optimization and just passing the flag on via context.

Overall, their combined impact seems non-negligible, but it's spread over several of them. Focusing on optimizer steps first is probably more worthwhile.

Conclusions

cameel commented 1 month ago

CSE and DataflowAnalyzer slowness (initial investigation)

Timing of steps that use DataflowAnalyzer

Time % Time Calls Scope
100.0% 64.251 s TOTAL OPTIMIZATION TIME
19.8% 12.712 s 3281 CommonSubexpressionEliminator
7.3% 4.713 s 2741 LiteralRematerialiser
3.6% 2.288 s 810 LoadResolver
0.9% 0.586 s 810 ExpressionSplitter
0.7% 0.455 s 270 EqualStoreEliminator
0.4% 0.260 s 810 Rematerialiser

That's 21 s in total, ~33% of the optimizer running time. The whole compilation took 109 s (Uniswap).

Structure of CSE

CommonSubexpressionEliminator inherits from DataFlowAnalyzer and does not have all that much code of its own. In either of them the execution time does not seem to be very concentrated.

These are the functions I could identify as taking the most time, but they account for less than half of the total:

Time % Time Calls Scope
100% 14.191 s 3281 CommonSubexpressionEliminator::run()
17% 2.405 s 16564075 DataFlowAnalyzer::handleAssignment()
14% 1.958 s 2026470 DataFlowAnalyzer::popScope()
13% 1.778 s 2371866 DataFlowAnalyzer::clearValues()
9% 1.239 s 3281 SideEffectsPropagator::sideEffects()

There is also a number of caveats to these numbers:

All the other non-operator methods of DataFlowAnalyzer taken together take less than 0.2 s. I also tried to profile the parts of the operators in isolation, but the time does not add up. The slowest were the ones processing ForLoop and If taking around 0.2-0.4 s each. It's as of the visitation itself was taking a substantial amount of time.

Cost of structural equality

CSE uses SyntacticallyEqual to hash expressions it keeps track of. I checked what's the impact of that part alone. This includes both direct use and lookups/insertions in the map.

Time Calls Scope
1.094 s 14858176 SyntacticallyEqual

For some context, CSE's visit(Expression), where some of this hashing happens takes around 2.5 s (excluding the visit() call to base class). That's in a compilation that took 91 s total.

Number of calls

What stands out is the sheer number of calls. Here are the numbers for AST visitation:

Calls Scope
15070420 VariableDeclaration (DFA)
12153239 Expression (CSE)
2220705 ExpressionStatement (DFA)
1571439 Block (DFA)
1165662 Assignment (DFA)
819967 If (DFA)
455031 FunctionDefinition (DFA)
160753 FunctionDefinition (CSE)
55949 Switch (DFA)
54470 ForLoop (DFA)

And for helpers of DataflowAnalyzer:

Calls Scope
32246804 isSimpleLoad()
19362962 clearKnowledgeIfInvalidated(Expression)
16564075 handleAssignment()
16085616 isKeccak()
14731783 assignValue()
3675605 inScope()
2371866 clearValues()
2026470 pushScope()
2026470 popScope()
986845 joinKnowledge()
913264 isSimpleStore()
183470 joinKnowledgeHelper()
156057 valueOfIdentifier()
46571 memoryValue()
19775 storageValue()
9341 DataFlowAnalyzer()
3006 keccakValue()

It looks like these components are just processing a lot of input. DFA is visiting e.g. 15 million variable declarations, but with 8722 invocations of DFA based steps, that's just 1728 declarations per Yul object on average, which seems realistic for large contracts. Also, with 3281 runs of CSE, each run takes ~4 ms which does not seem particularly slow. It all just adds up to large numbers over many runs.

I took a look at the overall structure of loops and recursion in the DFA implementation and did not notice anything with pathological complexity. There's linear progression through AST nodes, sometimes with calls to helpers that may loop over current variables and scopes. popScope() and clearValues() look the heaviest in that regard (with nested loops in the latter), but the loops perform simple operations. These helpers do take non-negligible amount of time, but are still not the bottleneck. Far more time is spent elsewhere.

My overall impression is that with so many calls small things just add up. The running time is spread over relatively simple operations and AST visitation. Speeding it up may actually require lots of tiny optimizations rather than looking for a big bottleneck. Tweaking the data structures might also be worthwhile, though we already did a lot of that in the past, so I'm assuming there's no low-hanging fruit there. Other than that, we could try to reduce the number of calls to these components or the size of input they're getting from other steps.

Conclusions

Note that these are just initial observations. I did not find anything spectacular, but given how slow these components are, it's definitely worth poking at them more, so I'm not done yet.