Performance improvements of individual Yul Optimizer steps

A catch-all issue for the specific ideas for performance improvement in individual optimizer steps + a place to post profiling results and discuss them.

[x] FullInliner's BodyCopier takes ~60% of inliner's running time.
[x] Are two passes of FullInliner necessary?
[x] Oversized time impact of UnusedPruner
[ ] CSE and DataflowAnalyzer slowness

Are two passes of the `FullInliner` necessary?

FullInliner does two passes through the code, potentially duplicating some work. The first pass (inlining only tiny functions) seems to be a subset of the second so I decided to check if it makes that much differnce and how it impacts running time.

Results

Removing the first pass did shave off ~1 s from inliner's running time in my benchmarks (on a 120 s compilation of Uniswap).

Here's what it looks like in detail:

Time %	Time	Calls	Scope
100%	4.183 s	810	`run()`
78%	3.085 s	810	`├─run(Pass::InlineRest)`
13%	0.561 s	810	`├─run(Pass::InlineTiny)`
14%	0.458 s	810	`└─FullInliner()`
4%	0.160 s	810	`└─ReferencesCounter::countReferences()`

So it may be closer to 0.5 s.

Gas impact

I don't have much data on runtime cost, but what I'm seeing is not bad. In most tests in our repo the impact is slightly negative but minimal. At the same time there are a few specific tests where we it goes down by a lot.

I created #15509 to see external test results, but runtime gas results are very sparse. It does have a very positive impact on bytecode size though.

`ir-optimize-evm+yul`:	project	bytecode_size	deployment_gas
brink	`0%`
colony	`+0%`
elementfi	`-0.27% ✅`
ens	`-1.77% ✅`	`-0.66% ✅`	`-0%`
euler	`-0.14% ✅`
gnosis
gp2	`+0.15% ❌`
pool-together	`+0.22% ❌`
uniswap	`-4.62% ✅`
yield_liquidator	`-1.54% ✅`	`-1.62% ✅`	`-0.11% ✅`
zeppelin	`-2.62% ✅`

Conclusions

Mild improvement in terms of running time.
Surprisingly nice improvement in terms of bytecode size.
Impact on runtime gas is uncertain. We need more data.

FullInliner's BodyCopier takes ~60% of inliner's running time.

Profiling `FullInliner::tryInlineStatement()`

Time %	Time	Calls	Scope
100%	4.183 s	810	`run()`
68%	2.860 s	1566132	`└─tryInlineStatement()`
54%	2.278 s	36552	`├─performInline()`
48%	1.990 s	36552	`│ ├─BodyCopier`
25%	1.060 s	772025	`│ │ └─NameDispenser::newName()`
5%	0.216 s	62898	`│ └─newVariable()`
4%	0.187 s	62898	`│ └─NameDispenser::newName()`
13%	0.536 s	909031	`└─shallInline()`
11%	0.468 s	156399	`└─ReferencesCounter::countReferences()`

The numbers come from the same Uniswap run as in https://github.com/ethereum/solidity/issues/15508#issuecomment-2407974008 above. Just note that they are from both passes combined.

@ekpyron had a hypothesis that it's NameDispenser::newName() that takes most of of the time in BodyCopier. This is a bit tricky to verify because a single call of this function is relatively lightweight. From the numbers above it looks like it accounts only for half of BodyCopier's running time. However, if you look at the table with steps at the bottom of this comment, you'll see that the normal running time of FullInliner is ~3 s, while here it blew up to ~4 s. I think the missing second is the overhead of the profiler probe I placed inside the BodyCopier's operator(). With ~800k invocations it adds up. This would confirm the hypothesis.

Optimizing `NameDispenser::newName()` the quick and dirty way

I suspected that there may be a way to address the bottleneck locally and get the speed up already, even before we're done with the YulName stuff.

https://github.com/ethereum/solidity/blob/7893614a31fbeacd1966994e310ed4f760772658/libyul/optimiser/NameDispenser.cpp#L47-L57

Identifying which part of newName() is the heaviest was more difficult than I expected though. At first I suspected illegalName(), but profiling led me to believe that it's the string conversion instead. I wrote a faster, optimized conversion but it turned out to be a red herring. It was not the bottleneck. In the end I had to resort to artificially repeating each part multiple times in a loop to find out how heavy are the parts relative to each other:

Time %	Element
61.9%	`YulString` conversion
22.0%	`illegalName()`
9.4%	`std::string` concatenation
3.7%	`YulString::str()`
3.0%	`std::to_string()`

Looks like it's mostly in YulString and we'll have to wait for #15215 after all.

Impact of `NameDispenser` on optimizer as a whole

I measured this by replacing the implementation with this:

static size_t counter = 0;
++counter;
return YulString("___UNIQUE_NAME___" + std::to_string(counter));

The result is that the total running time of the optimizer when compiling Uniswap went down from 60s to 53s. Here's a detailed breakdown between steps:

Time % (before)	Time % (after)	Time (before)	Time (after)	Step
1.1%	0.9%	0.685 s	0.469 s	ExpressionSplitter
2.8%	2.7%	1.693 s	1.429 s	LoopInvariantCodeMotion
3.6%	2.1%	2.200 s	1.088 s	FunctionSpecializer
3.7%	3.7%	2.240 s	1.981 s	UnusedStoreEliminator
3.9%	4.1%	2.327 s	2.192 s	ConditionalSimplifier
4.1%	4.4%	2.485 s	2.307 s	LoadResolver
4.6%	4.7%	2.795 s	2.496 s	DeadCodeEliminator
4.7%	3.8%	2.843 s	2.006 s	SSATransform
5.1%	5.3%	3.090 s	2.797 s	ConditionalUnsimplifier
5.3%	3.7%	3.193 s	1.947 s	FullInliner
8.2%	8.8%	4.963 s	4.649 s	LiteralRematerialiser
8.7%	8.8%	5.246 s	4.626 s	UnusedAssignEliminator
8.7%	8.6%	5.252 s	4.525 s	UnusedPruner
9.8%	10.7%	5.902 s	5.672 s	ExpressionSimplifier
20.7%	22.7%	12.497 s	11.991 s	CommonSubexpressionEliminator
100.0%	100.0%	60.385 s	52.862 s	TOTAL

The steps which became faster by 0.5 s or more are marked in bold. I omitted steps that take less than 1% of running time since there's nothing interesting happening here.

Conclusions

The NameDispenser seems to be the biggest bottleneck of the inliner.
Eliminating NameDispenser will substantially speed up not only the inliner but also many other steps.
The overhead of NameDispenser seems to be tied mostly to YulString and not some simpler things that could be quickly optimized locally. We do need #15215 for this.

Oversized time impact of UnusedPruner

Context

UnusedPruner is relatively simple, but takes a surprisingly large share of the total optimization time.

Here's what it looks like for Uniswap:

Time %	Time	Calls	Scope
100.0%	62.251 s	38316	TOTAL
20.5%	12.731 s	3281	CommonSubexpressionEliminator
9.5%	5.898 s	1391	ExpressionSimplifier
8.8%	5.473 s	1350	UnusedAssignEliminator
8.4%	5.251 s	2700	UnusedPruner

With total compilation time being ~120 s.

Top-level breakdown

Time %	Time	Calls	Scope
100%	5.286 s	2700	`run()`
100%	5.285 s	2700	`runUntilStabilisedOnFullAST()`
72%	3.793 s	2719	`runUntilStabilised()`
34%	1.821 s	741371	`operator()(Block)`
27%	1.409 s	4551	`constructor`
27%	1.408 s	4551	`ReferencesCounter::countReferences()`
15%	0.782 s	2700	`SideEffectsPropagator::sideEffects()`
8%	0.444 s	2700	`MSizeFinder::containsMSize()`
5%	0.284 s	2700	`CallGraphGenerator::callGraph()`

Pretty much all of the time is spent in general utilities and in operator(), which contains the main logic.

Note that runUntilStabilised() here has more calls than run(). It's because it's also invoked by StackCompressor. The time spent in those calls is negligible (0.04 s), so I'm omitting it.

`operator()`

This is the main custom bit in the pruner.

Time %	Time	Calls	Scope
100%	5.767 s	2700	`run()`
74%	4.273 s	2719	`runUntilStabilised()`
39%	2.247 s	741371	`operator()(Block)`
13%	0.766 s	295388	`FunctionDefinition case`
1%	0.064 s	3010287	`VariableDeclaration case`
0%	0.024 s	914453	`ExpressionStatement case`
0%	0.015 s	741371	`removeEmptyBlocks()`

The operator is relatively lightweight, but invoked a lot so profiling adds visible overhead (~0.5 s). However, even after subtracting the overhead, there's still almost 1 s unaccounted for. I did not manage to explain where it's going. It looks like it must be coming from ASTModifier or the for loop itself, but when profiled in isolation those two do not take nearly that much time.

The way the visitation is done does seem a bit redundant - we loop over statements in every block, but then run the original operator, which visits all that again. This is because the pruner needs to delete statements or replace them with statements of a different type, which is difficult to do in normal visitation. I spent a moment trying to rewrite, but ultimately gave up.

Redundant reruns

One potential source of redundancy is the code that loops stable. I checked how often that happens and how much time it takes:

Time %	Time	Calls	Scope
100%	3.659 s	2719	`runUntilStabilised()`
65%	2.378 s	2719	`first pass`
35%	1.279 s	1384	`remaining passes`

So it seems that the loop runs ~2 times on average and that the second and later passes are faster than the first one. This would indicate that the second pass rarely does anything. I did a quick experiment by hard-coding it to one pass to see if it affects the results. Unfortunately, I do see a significant number of left-over unused statements in Yul tests, so this does not seem to be a viable simplification.

Impact of general utilities on whole compilation

I checked how much time the utility classes consume cumulatively over the whole compilation (i.e. in all of the steps and the Yul->EVM transform), to see if focusing on optimizing any single one of them would make sense.

Time %	Time	Calls	Scope
100.0%	60.691 s		`Optimizer steps (not including Yul->EVM)`
12.2%	7.378 s		`TOTAL`
4.3%	2.622 s	8411	`SideEffectsPropagator::sideEffects()`
3.6%	2.169 s		`ReferencesCounter::countReferences()`
2.9%	1.781 s	88083	`ReferencesCounter::countReferences(Block)`
0.6%	0.387 s	156399	`ReferencesCounter::countReferences(FunctionDefinition)`
0.0%	0.001 s	785498	`ReferencesCounter::countReferences(Expression)`
2.6%	1.595 s	13767	`CallGraphGenerator::callGraph()`
1.6%	0.992 s	5963	`MSizeFinder::containsMSize(Block)`

The one that might be conceptually simplest to address is MSizeFinder - since the optimizer never introduces msize itself, we could save ~1 s by checking it once before optimization and just passing the flag on via context.

Overall, their combined impact seems non-negligible, but it's spread over several of them. Focusing on optimizer steps first is probably more worthwhile.

Conclusions

Most time spent in general utilities, especially finding references to identifiers and calculating side-effects.
1/5 of time is unaccounted for (in operator()). Could be the overhead of visitation or something else.
Repeating until stable takes 1/3 of time and is seemingly redundant, but removing it actually hurts optimization.
We may be able to save ~1 s on the whole compilation by finding msize once and storing a flag.

CSE and DataflowAnalyzer slowness (initial investigation)

Timing of steps that use DataflowAnalyzer

Time %	Time	Calls	Scope
100.0%	64.251 s		TOTAL OPTIMIZATION TIME
19.8%	12.712 s	3281	CommonSubexpressionEliminator
7.3%	4.713 s	2741	LiteralRematerialiser
3.6%	2.288 s	810	LoadResolver
0.9%	0.586 s	810	ExpressionSplitter
0.7%	0.455 s	270	EqualStoreEliminator
0.4%	0.260 s	810	Rematerialiser

That's 21 s in total, ~33% of the optimizer running time. The whole compilation took 109 s (Uniswap).

Structure of CSE

CommonSubexpressionEliminator inherits from DataFlowAnalyzer and does not have all that much code of its own. In either of them the execution time does not seem to be very concentrated.

These are the functions I could identify as taking the most time, but they account for less than half of the total:

Time %	Time	Calls	Scope
100%	14.191 s	3281	`CommonSubexpressionEliminator::run()`
17%	2.405 s	16564075	`DataFlowAnalyzer::handleAssignment()`
14%	1.958 s	2026470	`DataFlowAnalyzer::popScope()`
13%	1.778 s	2371866	`DataFlowAnalyzer::clearValues()`
9%	1.239 s	3281	`SideEffectsPropagator::sideEffects()`

There is also a number of caveats to these numbers:

DataFlowAnalyzer's calls here are from the whole compilation, not just from CSE step (profiler does not distinguish them).
handleAssignment() sometimes calls clearValues().
Note that there are a lot of calls, so this timing also includes some overhead (see the difference with CSE in the previous table). The whole compilation here took 116 s.
It's hard to accurately profile functions that recursively call each other and AST visitor does this a lot, so I could only really profile the standalone helpers.

All the other non-operator methods of DataFlowAnalyzer taken together take less than 0.2 s. I also tried to profile the parts of the operators in isolation, but the time does not add up. The slowest were the ones processing ForLoop and If taking around 0.2-0.4 s each. It's as of the visitation itself was taking a substantial amount of time.

Cost of structural equality

CSE uses SyntacticallyEqual to hash expressions it keeps track of. I checked what's the impact of that part alone. This includes both direct use and lookups/insertions in the map.

Time	Calls	Scope
1.094 s	14858176	SyntacticallyEqual

For some context, CSE's visit(Expression), where some of this hashing happens takes around 2.5 s (excluding the visit() call to base class). That's in a compilation that took 91 s total.

Number of calls

What stands out is the sheer number of calls. Here are the numbers for AST visitation:

Calls	Scope
15070420	`VariableDeclaration` (DFA)
12153239	`Expression` (CSE)
2220705	`ExpressionStatement` (DFA)
1571439	`Block` (DFA)
1165662	`Assignment` (DFA)
819967	`If` (DFA)
455031	`FunctionDefinition` (DFA)
160753	`FunctionDefinition` (CSE)
55949	`Switch` (DFA)
54470	`ForLoop` (DFA)

And for helpers of DataflowAnalyzer:

Calls	Scope
32246804	`isSimpleLoad()`
19362962	`clearKnowledgeIfInvalidated(Expression)`
16564075	`handleAssignment()`
16085616	`isKeccak()`
14731783	`assignValue()`
3675605	`inScope()`
2371866	`clearValues()`
2026470	`pushScope()`
2026470	`popScope()`
986845	`joinKnowledge()`
913264	`isSimpleStore()`
183470	`joinKnowledgeHelper()`
156057	`valueOfIdentifier()`
46571	`memoryValue()`
19775	`storageValue()`
9341	`DataFlowAnalyzer()`
3006	`keccakValue()`

It looks like these components are just processing a lot of input. DFA is visiting e.g. 15 million variable declarations, but with 8722 invocations of DFA based steps, that's just 1728 declarations per Yul object on average, which seems realistic for large contracts. Also, with 3281 runs of CSE, each run takes ~4 ms which does not seem particularly slow. It all just adds up to large numbers over many runs.

I took a look at the overall structure of loops and recursion in the DFA implementation and did not notice anything with pathological complexity. There's linear progression through AST nodes, sometimes with calls to helpers that may loop over current variables and scopes. popScope() and clearValues() look the heaviest in that regard (with nested loops in the latter), but the loops perform simple operations. These helpers do take non-negligible amount of time, but are still not the bottleneck. Far more time is spent elsewhere.

My overall impression is that with so many calls small things just add up. The running time is spread over relatively simple operations and AST visitation. Speeding it up may actually require lots of tiny optimizations rather than looking for a big bottleneck. Tweaking the data structures might also be worthwhile, though we already did a lot of that in the past, so I'm assuming there's no low-hanging fruit there. Other than that, we could try to reduce the number of calls to these components or the size of input they're getting from other steps.

Conclusions

I could not identify any obvious bottlenecks in CSE or DataflowAnalyzer in general. The time seems to be spread over a lot of tiny operations instead.
With things being repeated so many times, trying some microoptimizations may actually be worthwhile here. Especially in DFA's operators, where the seems to be a lot of "dark code" that's hard to profile in isolation but must be taking a large part of the total running time.

Note that these are just initial observations. I did not find anything spectacular, but given how slow these components are, it's definitely worth poking at them more, so I'm not done yet.

ethereum / solidity