Improve default optimizer steps sequence in `via-ir`

nikola-matic commented 1 year ago

The default optimizer sequence we have at the moment is quite large, and thus causes bloating in compilation times.

libsolidity/interface/OptimiserSettings.h

static char constexpr DefaultYulOptimiserSteps[] =
    "dhfoDgvulfnTUtnIf"            // None of these can make stack problems worse
    "["
        "xa[r]EscLM"               // Turn into SSA and simplify
        "cCTUtTOntnfDIul"          // Perform structural simplification
        "Lcul"                     // Simplify again
        "Vcul [j]"                 // Reverse SSA

        // should have good "compilability" property here.

        "Tpeul"                    // Run functional expression inliner
        "xa[rul]"                  // Prune a bit more in SSA
        "xa[r]cL"                  // Turn into SSA again and simplify
        "gvif"                     // Run full inliner
        "CTUca[r]LSsTFOtfDnca[r]Iulc" // SSA plus simplify
    "]"
    "jmul[jul] VcTOcul jmul";      // Make source short and pretty

The outcome should hopefully be a shorter sequence, that achieves the same level of optimization, while reducing the compile times (in the context of the optimizer pipeline). Special attention should be paid to steps that destroy the SSA form of its input (better suited for optimizing), and thus cause subsequent steps to perform poorly.

ekpyron commented 1 year ago

Just to elaborate a bit: the current sequence transforms to SSA and back in a loop. This is partially based on the legacy-code-transform fact that the number of variables translates rather directly to required stack slots. The new code transform does not suffer from this and should be able to generate efficient code directly from SSA.

Based on that, we should, generally, be able to transform to SSA and stay in SSA, up until code generation. However, that would require to ensure that none of the steps that perform meaningful optimizations destroy the SSA property.

Actually changing any potential such steps to preserve SSA form I'd consider a separate tasks. This first task should merely investigate what can be achieved with modifying the sequence without actually touching the steps, keeping in mind that transforming back from SSA should not add value and contribute to "compilability" relative to the current code transform anymore.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to inactivity for the last 90 days. It will be automatically closed in 7 days.

ekpyron commented 1 year ago

While we're at it, we should double-check if the effects of the StackCompressor these days are actually still positive or whether we should just remove it (simplifying our entire pipeline)

cameel commented 8 months ago

For future reference, here are the initial results I showed last week: solc-seqbench-intro-and-initial-results.md.

cameel commented 8 months ago

And here's a bit more analysis:

Now with plots and tables that compare all sequences on the same contract or all contracts with the same sequence.
I added an extra test sequence: small-loops. It's just the default one with the big loop replaced with looping each component individually.

I haven't analyzed it all yet so here are the plots and some casual observations. I still need to go over those plots in more detail.

Results

Contract with different sequences

Sequence with different contracts

Observations

Contract `deposit_contract`

single-pass takes 2x less time and gets only slightly worse results.
others get almost the same results and timing is very close (but small-loops beats default, which beats always-ssa-min)
small-loops executes significantly fewer steps than default and always-ssa-min, but timing differs only minimally
small-loops converges visibly slower, especially in terms of bytecode size. default and always-ssa-min would beat it if we stopped as soon as they converge
Time wasted after convergence: half for all but single-pass in execution, in bytecode size half to 1/4.
Improvements seem to happen sharply at specific steps, and these seem to be the same steps in each sequence. The first ones are: c, M, e, l, T.
Almost all the improvement seems to happen in the first cycle of the main loop.

Contract `ramanujan_pi`

single-pass converges very quickly and is only a little worse in terms of execution time. For bytecode size it's even same as small-loops.
- small-loops takes 4x the time to converge and the other two 10x the time.

cameel commented 8 months ago

Since the single-pass sequence looks very promising, I decided to see how it would fare in our CI before I analyze the results in more detail. Here's a test PR with changes #14887 and I posted benchmarks in comments.

Results look promising. 17-46% decrease in total compilation time. Gas benchmarks also aren't too bad. While I can see up to 25% increase in gas in a few tests, the majority is impacted very little and benchmark diffs for external tests look much more reasonable - most of them below 1.5% for both deployment and runtime. Also, in many cases costs actually decrease. ENS is an outlier with bytecode size increasing 5%, though on the other hand ElementFI had bytecode decrease by 4%.

Overall I think that we might be able to tweak this sequence enough to get results very close to the current default.

Also, worth noting that the running time decrease we're seeing with this sequence may very likely be the bottom line of what we can achieve by tweaking the sequence. The plots show that most of the gains happen in that first pass so we'll need to keep most of it in one form or another.

cameel commented 8 months ago

`no-cse` sequences

Here's the data for sequences I analyzed yesterday. This time I tried to remove some c (CommonSubexpressionEliminator) and j (ExpressionJoiner) steps that seemed to degrade the results for deposit_contract. The results are mixed. It helps in some cases, makes things worse in others.

Contract with different sequences

Sequence with different contracts

cameel commented 8 months ago

And here's some analysis of the data I gathered so far. This is more or less what I presented on the call today.

Analysis

All the sequences I checked out so far differ very little in terms of final bytecode size and runtime cost.
- While small, the differences are not necessarily negligible. The thing is that none of them is consistently always better or always worse than others. It depends on the input.
The thing that does differ is primarily the optimization time:
- The single-pass sequences clearly win over the multi-pass sequences here. Most of the gain is achieved on the first pass. Further passes do improve the result in many cases but only minimally, which is not worth the 2-3x increase in optimization time.
- always-ssa-min is a definitely worse than the default sequence here. It usually takes just as much time and there were cases where it was almost 2x slower for almost no additional gain.
- The no-cse variants are generally slightly faster than their equivalents, due to having fewer steps. This indicates that dropping some steps may be worthwhile.
ramanujan_pi is the contract that converges the slowest. Bytecode size converges only after 6 repetitions of the main loop and runtime cost after 4. Stopping after just a single pass gives worse results, though it's worth noting that subsequent passes make things worse before they make them better so ultimately the single-pass sequences do not end up being as much worse as one might expect. There may be a way to achieve results comparable with multi-pass in a more efficient way.
The small-loops sequence in some cases finishes significantly faster than default (~60% faster for ramanujan_pi) and never much longer. Still, it spends a lot of time not improving, just like default, so the single-pass variants generally seem more promising.

Conclusions

It seems that a single-pass sequence is the way to go. Such sequences are significantly faster than default and very close in terms of optimization. Probably can even be better with enough tweaking.

Even if we don't manage to consistently beat default with such a sequence, a safe option would be to expose the MaxRounds constant for the main loop as an optimization parameter and set it to 1 by default. In most cases this would give enough optimization and users could decide on their own whether they want to spend more time to achieve minimally better results.

cameel commented 8 months ago

Some more loose observations from looking at seqbench results recently:

Cleanup sequence has almost no effect on bytecode size or runtime cost.
On the other hand the part between the main loop and the cleanup sequence (described as "make source short and pretty") is pretty crucial. This is usually what brings the final cost/size to the lowest point, especially in the single-pass sequences. The jmul part seems to have a significant effect on bytecode size.
There seems to be no benefit in looping parts of sequence individually (i.e. like in the small-loops sequence)
- Only "SSA plus simplify" seems to give some limited bytecode size gains with repetition, e.g. in ramanujan_pi contract.
The lowest bytecode size tends to be reached after TU steps but the current sequence always reverses this gain with c. Especially in the cleanup part that seeem undesirable.

cameel commented 8 months ago

`the-good-parts` sequence

This is a sequence I put together by starting with single-pass and then trying to remove the parts that did not contribute much to the final result. Then refined it on the ramanujan_pi contract by distilling the bits of the default sequence that improved its bytecode size with every repetition.

Contract with different sequences

Sequence with different contracts

the-good-parts

cameel commented 8 months ago

And here's the overall comparison of all the sequences I tested so far.

Remarks

Compilation time was measured in a different way than optimization time so it may not add up. Take them as very approximate.
Percentages are based on the initial, unoptimized value (keep that in mind when comparing with percentages from e.g. gas_diff_stats.py, which does not know the original value).
The min column is the minimum reached at any point in sequence. It's reported when it's lower then the final value.

Contract deposit_contract

	runtime gas	min	bytecode size	min	creation gas	min	optimization time	compilation time (unoptimized)	compilation time (optimized)
`default`	-15.5%		-37.0%	-39.4%	-20.1%	-21.8%	343 ms	207 ms	430 ms
`the-good-parts`	-15.3%		-39.3%		-21.7%		160 ms	203 ms	259 ms
`single-pass`	-14.1%	-14.9%	-34.4%	-38.9%	-18.7%	-21.5%	154 ms	184 ms	304 ms
`always-ssa-min`	-15.5%	-15.6%	-36.3%	-39.0%	-19.7%	-21.5%	363 ms	174 ms	466 ms
`always-ssa`	-15.2%		-36.7%	-38.7%	-20.0%	-21.3%	236 ms	184 ms	377 ms
`default-no-cse`	-15.5%		-39.4%		-21.8%		298 ms	210 ms	394 ms
`single-pass-no-cse`	-14.8%		-38.4%		-21.0%		134 ms	182 ms	255 ms
`small-loops`	-15.4%	-15.6%	-36.2%	-39.2%	-19.7%	-21.6%	282 ms	178 ms	441 ms

Contract FixedFeeRegistrar

	runtime gas	min	bytecode size	min	creation gas	min	optimization time	compilation time (unoptimized)	compilation time (optimized)
`default`	-5.6%		-33.8%	-34.3%	-31.0%	-31.5%	187 ms	170 ms	193 ms
`the-good-parts`	-5.6%		-32.4%	-34.3%	-29.7%	-31.5%	92 ms	92 ms	135 ms
`single-pass`	-2.9%	-3.0%	-32.9%	-34.5%	-30.1%	-31.6%	102 ms	87 ms	153 ms
`always-ssa-min`	-3.0%		-34.1%	-35.1%	-31.2%	-32.1%	382 ms	158 ms	248 ms
`always-ssa`	-2.9%	-3.0%	-32.9%	-34.5%	-30.1%	-31.6%	130 ms	88 ms	188 ms
`default-no-cse`	-5.6%		-32.6%	-34.1%	-29.8%	-31.2%	147 ms	78 ms	195 ms
`single-pass-no-cse`	-3.1%		-35.1%	-35.3%	-32.2%	-32.3%	133 ms	94 ms	160 ms
`small-loops`	-2.9%	-3.0%	-33.4%	-35.0%	-30.6%	-32.1%	149 ms	84 ms	196 ms

Contract prbmath_unsigned

	runtime gas	bytecode size	min	creation gas	min	optimization time	compilation time (unoptimized)	compilation time (optimized)
`default`	-44.4%	-30.8%		-30.3%		550 ms	596 ms	600 ms
`the-good-parts`	-44.5%	-29.6%		-29.1%		249 ms	444 ms	409 ms
`single-pass`	-44.4%	-30.7%	-30.8%	-30.2%	-30.3%	259 ms	420 ms	410 ms
`always-ssa-min`	-44.4%	-30.8%		-30.3%		784 ms	564 ms	827 ms
`always-ssa`	-44.4%	-30.8%		-30.3%		455 ms	498 ms	620 ms
`default-no-cse`	-44.4%	-29.5%		-29.0%		510 ms	613 ms	638 ms
`single-pass-no-cse`	-44.3%	-29.7%		-29.2%	-29.3%	241 ms	433 ms	614 ms
`small-loops`	-44.4%	-30.6%	-30.7%	-30.1%		472 ms	403 ms	626 ms

Contract ramanujan_pi

	runtime gas	min	bytecode size	min	creation gas	min	optimization time	compilation time (unoptimized)	compilation time (optimized)
`default`	-67.2%	-67.7%	-39.2%	-39.9%	-36.2%	-36.8%	520 ms	129 ms	622 ms
`the-good-parts`	-67.2%	-67.5%	-42.9%		-39.6%		135 ms	134 ms	233 ms
`single-pass`	-64.1%	-64.8%	-30.1%	-33.5%	-27.8%	-30.9%	109 ms	109 ms	170 ms
`always-ssa-min`	-66.8%	-67.3%	-40.2%		-37.1%		531 ms	114 ms	602 ms
`always-ssa`	-65.7%	-66.4%	-30.8%	-42.2%	-28.4%	-39.0%	160 ms	124 ms	232 ms
`default-no-cse`	-67.7%		-37.3%	-39.9%	-34.5%	-36.8%	542 ms	129 ms	540 ms
`single-pass-no-cse`	-65.0%		-25.9%	-32.1%	-23.9%	-29.6%	89 ms	118 ms	151 ms
`small-loops`	-66.3%	-67.0%	-28.8%	-31.9%	-26.5%	-29.5%	303 ms	109 ms	379 ms

Contract strings

	runtime gas	min	bytecode size	min	creation gas	min	optimization time	compilation time (unoptimized)	compilation time (optimized)
`default`	-11.9%	-12.1%	-26.1%	-27.7%	-24.6%	-26.1%	287 ms	169 ms	430 ms
`the-good-parts`	-11.9%		-25.9%	-27.7%	-24.3%	-26.1%	134 ms	185 ms	239 ms
`single-pass`	-11.6%	-11.8%	-23.7%	-27.7%	-22.3%	-26.1%	121 ms	161 ms	219 ms
`always-ssa-min`	-11.9%	-12.1%	-25.4%	-25.9%	-23.9%	-24.4%	330 ms	173 ms	406 ms
`always-ssa`	-11.8%	-12.0%	-23.1%	-29.8%	-21.7%	-28.1%	185 ms	178 ms	295 ms
`default-no-cse`	-12.1%		-26.2%		-24.7%		311 ms	182 ms	389 ms
`single-pass-no-cse`	-11.9%		-25.3%		-23.9%		108 ms	169 ms	212 ms
`small-loops`	-11.6%	-11.8%	-25.4%	-27.7%	-23.9%	-26.1%	255 ms	164 ms	349 ms

cameel commented 8 months ago

Final summary for `the-good-parts` sequence

Comparison with `default`

Cost differences in CI:

most between -1% and +3%
one outlier at -6% (ramanujan_pi.sol)
several outliers between +4% and +9% (including base64.sol and erc20.sol)

External tests:

bytecode size changes between -1.3% and +2.1%
runtime cost changes between 0% and +0.3% with one outlier at +2.1% (uniswap)
effect on legacy pipeline:
- bytecode size changes between 0% and +0.2%
- negligible runtime cost changes

Timing of extrnal tests:

Improvement between -15% and -45%.
- Timing improved in all cases.
Runtime of bytecode comparison job for optimized IR decreased by 25%.

seqbench results (values where the new sequence is worse in bold):		`default` runtime gas	`the-good-parts` runtime gas	`default` bytecode size	`the-good-parts` bytecode size	`default` compilation time (optimized)
`deposit_contract`	-15.5%	-15.3%	-37.0%	-39.3%	430 ms	259 ms
`FixedFeeRegistrar`	-5.6%	-5.6%	-33.8%	-32.4%	193 ms	135 ms
`prbmath_unsigned`	-44.4%	-44.5%	-30.8%	-29.6%	600 ms	409 ms
`ramanujan_pi`	-67.2%	-67.2%	-39.2%	-42.9%	622 ms	233 ms
`strings`	-11.9%	-11.9%	-26.1%	-25.9%	430 ms	239 ms

Conclusions

The new sequence clearly beats the single-pass sequence. Comparison with default is more mixed.

The outliers seem a bit concerning. Still, they're not extreme and go both ways so the new sequence is not consistently worse in any metric. In many cases it's a little better.

Overall results seem close enough that we're probably fine using it. I'm pretty sure the sequence can still be refined a little if we spend more time on it. All the sequences I tested have cases where they're better than default by a few percent, most of the time without doing as many repetitions. The trick is to get one that can do it consistently on most input.

cameel commented 8 months ago

Effects of removing StackCompressor

The results are oddly mixed: from completely neutral to very negative.

On one hand, I see almost no change in gas usage in our semantic tests.

I also ran seqbench on the default and the-good-parts sequences. Literally zero difference other than slightly compilation times (but still probably within the margin of error). Exactly the same bytecode sizes and execution gas. I did spot checks on the generated Yul and files are identical. To the point that I'm questioning whether my patched solc really worked correctly, but so far I see no evidence that it did not.

On the other hand, results of external tests are terrible (ir-optimize-evm+yul):

project	bytecode_size	deployment_gas	method_gas
brink	`0%`
colony	`0%`
ens	`+0.49% ❌`	`0%`	`0%`
perpetual-pools	`+9.09% ❌`	`+9.19% ❌`	`+1.6% ❌`
pool-together	`+0.86% ❌`
uniswap	`+4.05% ❌`	`+3.59% ❌`	`+5.87% ❌`
yield_liquidator	`+2.03% ❌`	`+2.29% ❌`	`+0.08% ❌`
zeppelin	`+0.01% ❌`	`+0.01% ❌`	`+0.02% ❌`

In some cases zero difference, but in others up to 10% increase in bytecode size and 6% increase in gas usage. And no cases where results improved.

Finally, the change caused some new StackTooDeep issues. Only in a few semantic/syntax tests in our test suite but also in 7 of the 15 external tests we have.

cameel commented 8 months ago

`the-good-parts-mk2` sequence

Comparison with `default`

Cost differences in CI:

same or better as the-good-parts. Still mostly between -1% and +3% with outliers at -6% and 4-9%, but also has a number of cases where a 2-4% increase became 0-1%.

External tests:

bytecode size changes between -2.4% and +1.65%: better than the-good-parts
runtime cost changes between 0% and +0.15% with one outlier at +2% (uniswap): better than the-good-parts
effect on legacy pipeline in the same range as the-good-parts (bytecode size changes between 0% and +0.3% and negligible runtime cost changes)

seqbench results (values that differ from `the-good-parts` in bold):		`default` runtime gas	`the-good-parts-mk2` runtime gas	`the-good-parts` runtime gas	`default` bytecode size	`the-good-parts-mk2` bytecode size	`the-good-parts` bytecode size	`default` compilation time	`the-good-parts-mk2` compilation time
`deposit_contract`	-15.5%	-15.3%	-15.3%	-37.0%	-39.4%	-39.3%	430 ms	248 ms	259 ms
`erc20`	-4.2%	-4.2%	-4.1%	-37.0%	-35.9%	-34.2%	149 ms	113 ms	94 ms
`FixedFeeRegistrar`	-5.6%	-5.6%	-5.6%	-33.8%	-32.8%	-32.4%	193 ms	120 ms	135 ms
`prbmath_unsigned`	-44.4%	-44.5%	-44.5%	-30.8%	-29.6%	-29.6%	600 ms	403 ms	364 ms
`ramanujan_pi`	-67.2%	-67.2%	-67.2%	-39.2%	-42.9%	-42.9%	622 ms	238 ms	233 ms
`strings`	-11.9%	-11.9%	-11.9%	-26.1%	-27.2%	-25.9%	430 ms	236 ms	239 ms

Conclusions

The new sequence is same or better than the-good-parts in nearly all cases. Speed is also pretty much the same.

It's still not as good as default but also still much faster.

cameel commented 6 months ago

Comparison of `the-good-parts` sequence variants

seqbench results on `cancun`

Contracts

New contracts:

erc20 (ERC20 stub from our repo)
oz-erc20 (actual ERC20 from OpenZeppelin)
chains from our repo (pathologically slow contract from our timing benchmarks)

Comparison between contracts for each sequence

Note that I excluded chains.sol here because the range of values makes it hard to compare with other contracts.

Which variant of `the-good-parts` is the best?

Quality of optimization

External tests: in almost all cases mk3 is a minimal improvement over mk2. mk2 was also better than mk1.
seqbench: bytecode size and runtime gas differences are usually fractions of a percent. In the few cases where they are bigger mk2 and mk3 beat mk1. mk2 is sometimes minimally better than mk3.
Tests in the repo: mk2 is a small improvement over mk1 in some cases. mk3 is very similar to mk2.
Optimized output in the repo: mk3 addresses some cases where code snippets from the repo were not fully optimized. Still not all of them. Haven't compared mk1 and mk2 in that regard.

Overall: mk3 >= mk2 >= mk1. But differences are small.

Compilation time

External tests: Neither mk3 nor mk2 seems consistently faster in CI. There are differences and sometimes they seem significant, but the timing variance in CI is so big that I think they are still the result of that variance.
seqbench: mk2 is only slightly slower than mk1. mk3 is more visibly slower than mk2, but still very close. Compared to default they can be considered pretty much the same. The difference comes from each version having a few more steps.
Timing benchmark in the repo: differences between mk2 and mk3 are in favor of the latter, but are very small, probably smaller than normal variance.

Overall: mk1 >= mk2 >= mk3. But again, differences are small.

Conclusion

All of the sequences are pretty close. I'd recommend mk3 because it has a small edge in terms of optimization quality and the difference in timing is not significant. It's also the one I spent the most time analyzing so I don't see much point in going back to the other two. Still, it seems that improvements were marginal at best.

`the-good-parts-mk3` vs `default`

Quality of optimization

Cost differences in CI:

Still mostly between -1% and +3% with outliers at -5% and 4-9%.

External tests:

bytecode size changes between -1.85% and +1.65%.
runtime cost between 0% and +0.15%.
- Note that the +2% outlier for uniswap we've seen with mk1 is gone, but this is probably due to switch to Cancun, not improvements in the sequence.
effect on legacy pipeline: bytecode size between 0% and +0.3%, runtime cost change negligible.

seqbench:

runtime cost improvement differences are negligible in all cases.
bytcode size improvement differs by 1-2% either way so it's wash. The only real outlier being the pathological chans.sol where default increases the size by 252.3%, while the-good-parts-mk3 "only" by 227.2%.

Compilation time

External tests:

New sequence is 10-65% faster, depending on the test (with colony being a single outlier at +6%).

seqbench:

Compilation time decreases by 1/3 to 2/3 depending on the test.

Timing benchmark:

The compilation time is halved in all three cases.

ethereum / solidity