Open cameel opened 1 month ago
FullInliner
necessary?FullInliner
does two passes through the code, potentially duplicating some work. The first pass (inlining only tiny functions) seems to be a subset of the second so I decided to check if it makes that much differnce and how it impacts running time.
Removing the first pass did shave off ~1 s from inliner's running time in my benchmarks (on a 120 s compilation of Uniswap).
Here's what it looks like in detail:
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 4.183 s | 810 | run() |
78% | 3.085 s | 810 | ├─run(Pass::InlineRest) |
13% | 0.561 s | 810 | ├─run(Pass::InlineTiny) |
14% | 0.458 s | 810 | └─FullInliner() |
4% | 0.160 s | 810 | └─ReferencesCounter::countReferences() |
So it may be closer to 0.5 s.
I don't have much data on runtime cost, but what I'm seeing is not bad. In most tests in our repo the impact is slightly negative but minimal. At the same time there are a few specific tests where we it goes down by a lot.
I created #15509 to see external test results, but runtime gas results are very sparse. It does have a very positive impact on bytecode size though.
ir-optimize-evm+yul : |
project | bytecode_size | deployment_gas | method_gas |
---|---|---|---|---|
brink | 0% |
|||
colony | +0% |
|||
elementfi | -0.27% ✅ |
|||
ens | -1.77% ✅ |
-0.66% ✅ |
-0% |
|
euler | -0.14% ✅ |
|||
gnosis | ||||
gp2 | +0.15% ❌ |
|||
pool-together | +0.22% ❌ |
|||
uniswap | -4.62% ✅ |
|||
yield_liquidator | -1.54% ✅ |
-1.62% ✅ |
-0.11% ✅ |
|
zeppelin | -2.62% ✅ |
FullInliner::tryInlineStatement()
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 4.183 s | 810 | run() |
68% | 2.860 s | 1566132 | └─tryInlineStatement() |
54% | 2.278 s | 36552 | ├─performInline() |
48% | 1.990 s | 36552 | │ ├─BodyCopier |
25% | 1.060 s | 772025 | │ │ └─NameDispenser::newName() |
5% | 0.216 s | 62898 | │ └─newVariable() |
4% | 0.187 s | 62898 | │ └─NameDispenser::newName() |
13% | 0.536 s | 909031 | └─shallInline() |
11% | 0.468 s | 156399 | └─ReferencesCounter::countReferences() |
The numbers come from the same Uniswap run as in https://github.com/ethereum/solidity/issues/15508#issuecomment-2407974008 above. Just note that they are from both passes combined.
@ekpyron had a hypothesis that it's NameDispenser::newName()
that takes most of of the time in BodyCopier
. This is a bit tricky to verify because a single call of this function is relatively lightweight. From the numbers above it looks like it accounts only for half of BodyCopier
's running time. However, if you look at the table with steps at the bottom of this comment, you'll see that the normal running time of FullInliner
is ~3 s, while here it blew up to ~4 s. I think the missing second is the overhead of the profiler probe I placed inside the BodyCopier
's operator()
. With ~800k invocations it adds up. This would confirm the hypothesis.
NameDispenser::newName()
the quick and dirty wayI suspected that there may be a way to address the bottleneck locally and get the speed up already, even before we're done with the YulName
stuff.
Identifying which part of newName()
is the heaviest was more difficult than I expected though. At first I suspected illegalName()
, but profiling led me to believe that it's the string conversion instead. I wrote a faster, optimized conversion but it turned out to be a red herring. It was not the bottleneck. In the end I had to resort to artificially repeating each part multiple times in a loop to find out how heavy are the parts relative to each other:
Time % | Element |
---|---|
61.9% | YulString conversion |
22.0% | illegalName() |
9.4% | std::string concatenation |
3.7% | YulString::str() |
3.0% | std::to_string() |
Looks like it's mostly in YulString
and we'll have to wait for #15215 after all.
NameDispenser
on optimizer as a wholeI measured this by replacing the implementation with this:
static size_t counter = 0;
++counter;
return YulString("___UNIQUE_NAME___" + std::to_string(counter));
The result is that the total running time of the optimizer when compiling Uniswap went down from 60s to 53s. Here's a detailed breakdown between steps:
Time % (before) | Time % (after) | Time (before) | Time (after) | Step |
---|---|---|---|---|
1.1% | 0.9% | 0.685 s | 0.469 s | ExpressionSplitter |
2.8% | 2.7% | 1.693 s | 1.429 s | LoopInvariantCodeMotion |
3.6% | 2.1% | 2.200 s | 1.088 s | FunctionSpecializer |
3.7% | 3.7% | 2.240 s | 1.981 s | UnusedStoreEliminator |
3.9% | 4.1% | 2.327 s | 2.192 s | ConditionalSimplifier |
4.1% | 4.4% | 2.485 s | 2.307 s | LoadResolver |
4.6% | 4.7% | 2.795 s | 2.496 s | DeadCodeEliminator |
4.7% | 3.8% | 2.843 s | 2.006 s | SSATransform |
5.1% | 5.3% | 3.090 s | 2.797 s | ConditionalUnsimplifier |
5.3% | 3.7% | 3.193 s | 1.947 s | FullInliner |
8.2% | 8.8% | 4.963 s | 4.649 s | LiteralRematerialiser |
8.7% | 8.8% | 5.246 s | 4.626 s | UnusedAssignEliminator |
8.7% | 8.6% | 5.252 s | 4.525 s | UnusedPruner |
9.8% | 10.7% | 5.902 s | 5.672 s | ExpressionSimplifier |
20.7% | 22.7% | 12.497 s | 11.991 s | CommonSubexpressionEliminator |
100.0% | 100.0% | 60.385 s | 52.862 s | TOTAL |
The steps which became faster by 0.5 s or more are marked in bold. I omitted steps that take less than 1% of running time since there's nothing interesting happening here.
NameDispenser
seems to be the biggest bottleneck of the inliner.NameDispenser
will substantially speed up not only the inliner but also many other steps.NameDispenser
seems to be tied mostly to YulString
and not some simpler things that could be quickly optimized locally. We do need #15215 for this.UnusedPruner is relatively simple, but takes a surprisingly large share of the total optimization time.
Here's what it looks like for Uniswap:
Time % | Time | Calls | Scope |
---|---|---|---|
100.0% | 62.251 s | 38316 | TOTAL |
20.5% | 12.731 s | 3281 | CommonSubexpressionEliminator |
9.5% | 5.898 s | 1391 | ExpressionSimplifier |
8.8% | 5.473 s | 1350 | UnusedAssignEliminator |
8.4% | 5.251 s | 2700 | UnusedPruner |
With total compilation time being ~120 s.
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 5.286 s | 2700 | run() |
100% | 5.285 s | 2700 | runUntilStabilisedOnFullAST() |
72% | 3.793 s | 2719 | runUntilStabilised() |
34% | 1.821 s | 741371 | operator()(Block) |
27% | 1.409 s | 4551 | constructor |
27% | 1.408 s | 4551 | ReferencesCounter::countReferences() |
15% | 0.782 s | 2700 | SideEffectsPropagator::sideEffects() |
8% | 0.444 s | 2700 | MSizeFinder::containsMSize() |
5% | 0.284 s | 2700 | CallGraphGenerator::callGraph() |
Pretty much all of the time is spent in general utilities and in operator()
, which contains the main logic.
Note that runUntilStabilised()
here has more calls than run()
. It's because it's also invoked by StackCompressor
. The time spent in those calls is negligible (0.04 s), so I'm omitting it.
operator()
This is the main custom bit in the pruner.
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 5.767 s | 2700 | run() |
74% | 4.273 s | 2719 | runUntilStabilised() |
39% | 2.247 s | 741371 | operator()(Block) |
13% | 0.766 s | 295388 | FunctionDefinition case |
1% | 0.064 s | 3010287 | VariableDeclaration case |
0% | 0.024 s | 914453 | ExpressionStatement case |
0% | 0.015 s | 741371 | removeEmptyBlocks() |
The operator is relatively lightweight, but invoked a lot so profiling adds visible overhead (~0.5 s). However, even after subtracting the overhead, there's still almost 1 s unaccounted for. I did not manage to explain where it's going. It looks like it must be coming from ASTModifier
or the for
loop itself, but when profiled in isolation those two do not take nearly that much time.
The way the visitation is done does seem a bit redundant - we loop over statements in every block, but then run the original operator, which visits all that again. This is because the pruner needs to delete statements or replace them with statements of a different type, which is difficult to do in normal visitation. I spent a moment trying to rewrite, but ultimately gave up.
One potential source of redundancy is the code that loops stable. I checked how often that happens and how much time it takes:
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 3.659 s | 2719 | runUntilStabilised() |
65% | 2.378 s | 2719 | first pass |
35% | 1.279 s | 1384 | remaining passes |
So it seems that the loop runs ~2 times on average and that the second and later passes are faster than the first one. This would indicate that the second pass rarely does anything. I did a quick experiment by hard-coding it to one pass to see if it affects the results. Unfortunately, I do see a significant number of left-over unused statements in Yul tests, so this does not seem to be a viable simplification.
I checked how much time the utility classes consume cumulatively over the whole compilation (i.e. in all of the steps and the Yul->EVM transform), to see if focusing on optimizing any single one of them would make sense.
Time % | Time | Calls | Scope |
---|---|---|---|
100.0% | 60.691 s | Optimizer steps (not including Yul->EVM) |
|
12.2% | 7.378 s | TOTAL |
|
4.3% | 2.622 s | 8411 | SideEffectsPropagator::sideEffects() |
3.6% | 2.169 s | ReferencesCounter::countReferences() |
|
2.9% | 1.781 s | 88083 | ReferencesCounter::countReferences(Block) |
0.6% | 0.387 s | 156399 | ReferencesCounter::countReferences(FunctionDefinition) |
0.0% | 0.001 s | 785498 | ReferencesCounter::countReferences(Expression) |
2.6% | 1.595 s | 13767 | CallGraphGenerator::callGraph() |
1.6% | 0.992 s | 5963 | MSizeFinder::containsMSize(Block) |
The one that might be conceptually simplest to address is MSizeFinder
- since the optimizer never introduces msize
itself, we could save ~1 s by checking it once before optimization and just passing the flag on via context.
Overall, their combined impact seems non-negligible, but it's spread over several of them. Focusing on optimizer steps first is probably more worthwhile.
operator()
). Could be the overhead of visitation or something else.msize
once and storing a flag.Time % | Time | Calls | Scope |
---|---|---|---|
100.0% | 64.251 s | TOTAL OPTIMIZATION TIME | |
19.8% | 12.712 s | 3281 | CommonSubexpressionEliminator |
7.3% | 4.713 s | 2741 | LiteralRematerialiser |
3.6% | 2.288 s | 810 | LoadResolver |
0.9% | 0.586 s | 810 | ExpressionSplitter |
0.7% | 0.455 s | 270 | EqualStoreEliminator |
0.4% | 0.260 s | 810 | Rematerialiser |
That's 21 s in total, ~33% of the optimizer running time. The whole compilation took 109 s (Uniswap).
CommonSubexpressionEliminator
inherits from DataFlowAnalyzer
and does not have all that much code of its own.
In either of them the execution time does not seem to be very concentrated.
These are the functions I could identify as taking the most time, but they account for less than half of the total:
Time % | Time | Calls | Scope |
---|---|---|---|
100% | 14.191 s | 3281 | CommonSubexpressionEliminator::run() |
17% | 2.405 s | 16564075 | DataFlowAnalyzer::handleAssignment() |
14% | 1.958 s | 2026470 | DataFlowAnalyzer::popScope() |
13% | 1.778 s | 2371866 | DataFlowAnalyzer::clearValues() |
9% | 1.239 s | 3281 | SideEffectsPropagator::sideEffects() |
There is also a number of caveats to these numbers:
DataFlowAnalyzer
's calls here are from the whole compilation, not just from CSE step (profiler does not distinguish them).handleAssignment()
sometimes calls clearValues()
.All the other non-operator methods of DataFlowAnalyzer
taken together take less than 0.2 s. I also tried to profile the parts of the operators in isolation, but the time does not add up. The slowest were the ones processing ForLoop
and If
taking around 0.2-0.4 s each. It's as of the visitation itself was taking a substantial amount of time.
CSE uses SyntacticallyEqual
to hash expressions it keeps track of. I checked what's the impact of that part alone. This includes both direct use and lookups/insertions in the map.
Time | Calls | Scope |
---|---|---|
1.094 s | 14858176 | SyntacticallyEqual |
For some context, CSE's visit(Expression)
, where some of this hashing happens takes around 2.5 s (excluding the visit()
call to base class). That's in a compilation that took 91 s total.
What stands out is the sheer number of calls. Here are the numbers for AST visitation:
Calls | Scope |
---|---|
15070420 | VariableDeclaration (DFA) |
12153239 | Expression (CSE) |
2220705 | ExpressionStatement (DFA) |
1571439 | Block (DFA) |
1165662 | Assignment (DFA) |
819967 | If (DFA) |
455031 | FunctionDefinition (DFA) |
160753 | FunctionDefinition (CSE) |
55949 | Switch (DFA) |
54470 | ForLoop (DFA) |
And for helpers of DataflowAnalyzer:
Calls | Scope |
---|---|
32246804 | isSimpleLoad() |
19362962 | clearKnowledgeIfInvalidated(Expression) |
16564075 | handleAssignment() |
16085616 | isKeccak() |
14731783 | assignValue() |
3675605 | inScope() |
2371866 | clearValues() |
2026470 | pushScope() |
2026470 | popScope() |
986845 | joinKnowledge() |
913264 | isSimpleStore() |
183470 | joinKnowledgeHelper() |
156057 | valueOfIdentifier() |
46571 | memoryValue() |
19775 | storageValue() |
9341 | DataFlowAnalyzer() |
3006 | keccakValue() |
It looks like these components are just processing a lot of input. DFA is visiting e.g. 15 million variable declarations, but with 8722 invocations of DFA based steps, that's just 1728 declarations per Yul object on average, which seems realistic for large contracts. Also, with 3281 runs of CSE, each run takes ~4 ms which does not seem particularly slow. It all just adds up to large numbers over many runs.
I took a look at the overall structure of loops and recursion in the DFA implementation and did not notice anything with pathological complexity. There's linear progression through AST nodes, sometimes with calls to helpers that may loop over current variables and scopes. popScope()
and clearValues()
look the heaviest in that regard (with nested loops in the latter), but the loops perform simple operations. These helpers do take non-negligible amount of time, but are still not the bottleneck. Far more time is spent elsewhere.
My overall impression is that with so many calls small things just add up. The running time is spread over relatively simple operations and AST visitation. Speeding it up may actually require lots of tiny optimizations rather than looking for a big bottleneck. Tweaking the data structures might also be worthwhile, though we already did a lot of that in the past, so I'm assuming there's no low-hanging fruit there. Other than that, we could try to reduce the number of calls to these components or the size of input they're getting from other steps.
Note that these are just initial observations. I did not find anything spectacular, but given how slow these components are, it's definitely worth poking at them more, so I'm not done yet.
A catch-all issue for the specific ideas for performance improvement in individual optimizer steps + a place to post profiling results and discuss them.
BodyCopier
takes ~60% of inliner's running time.