ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
23.31k stars 5.77k forks source link

Improve default optimizer steps sequence in `via-ir` #14406

Open nikola-matic opened 1 year ago

nikola-matic commented 1 year ago

The default optimizer sequence we have at the moment is quite large, and thus causes bloating in compilation times.

libsolidity/interface/OptimiserSettings.h
static char constexpr DefaultYulOptimiserSteps[] =
    "dhfoDgvulfnTUtnIf"            // None of these can make stack problems worse
    "["
        "xa[r]EscLM"               // Turn into SSA and simplify
        "cCTUtTOntnfDIul"          // Perform structural simplification
        "Lcul"                     // Simplify again
        "Vcul [j]"                 // Reverse SSA

        // should have good "compilability" property here.

        "Tpeul"                    // Run functional expression inliner
        "xa[rul]"                  // Prune a bit more in SSA
        "xa[r]cL"                  // Turn into SSA again and simplify
        "gvif"                     // Run full inliner
        "CTUca[r]LSsTFOtfDnca[r]Iulc" // SSA plus simplify
    "]"
    "jmul[jul] VcTOcul jmul";      // Make source short and pretty

The outcome should hopefully be a shorter sequence, that achieves the same level of optimization, while reducing the compile times (in the context of the optimizer pipeline). Special attention should be paid to steps that destroy the SSA form of its input (better suited for optimizing), and thus cause subsequent steps to perform poorly.

ekpyron commented 1 year ago

Just to elaborate a bit: the current sequence transforms to SSA and back in a loop. This is partially based on the legacy-code-transform fact that the number of variables translates rather directly to required stack slots. The new code transform does not suffer from this and should be able to generate efficient code directly from SSA.

Based on that, we should, generally, be able to transform to SSA and stay in SSA, up until code generation. However, that would require to ensure that none of the steps that perform meaningful optimizations destroy the SSA property.

Actually changing any potential such steps to preserve SSA form I'd consider a separate tasks. This first task should merely investigate what can be achieved with modifying the sequence without actually touching the steps, keeping in mind that transforming back from SSA should not add value and contribute to "compilability" relative to the current code transform anymore.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to inactivity for the last 90 days. It will be automatically closed in 7 days.

ekpyron commented 1 year ago

While we're at it, we should double-check if the effects of the StackCompressor these days are actually still positive or whether we should just remove it (simplifying our entire pipeline)

cameel commented 8 months ago

For future reference, here are the initial results I showed last week: solc-seqbench-intro-and-initial-results.md.

cameel commented 8 months ago

And here's a bit more analysis:

I haven't analyzed it all yet so here are the plots and some casual observations. I still need to go over those plots in more detail.

Results

Contract with different sequences

Sequence with different contracts

Observations

Contract deposit_contract
Contract ramanujan_pi
cameel commented 8 months ago

Since the single-pass sequence looks very promising, I decided to see how it would fare in our CI before I analyze the results in more detail. Here's a test PR with changes #14887 and I posted benchmarks in comments.

Results look promising. 17-46% decrease in total compilation time. Gas benchmarks also aren't too bad. While I can see up to 25% increase in gas in a few tests, the majority is impacted very little and benchmark diffs for external tests look much more reasonable - most of them below 1.5% for both deployment and runtime. Also, in many cases costs actually decrease. ENS is an outlier with bytecode size increasing 5%, though on the other hand ElementFI had bytecode decrease by 4%.

Overall I think that we might be able to tweak this sequence enough to get results very close to the current default.

Also, worth noting that the running time decrease we're seeing with this sequence may very likely be the bottom line of what we can achieve by tweaking the sequence. The plots show that most of the gains happen in that first pass so we'll need to keep most of it in one form or another.

cameel commented 8 months ago

no-cse sequences

Here's the data for sequences I analyzed yesterday. This time I tried to remove some c (CommonSubexpressionEliminator) and j (ExpressionJoiner) steps that seemed to degrade the results for deposit_contract. The results are mixed. It helps in some cases, makes things worse in others.

Contract with different sequences

Sequence with different contracts

cameel commented 8 months ago

And here's some analysis of the data I gathered so far. This is more or less what I presented on the call today.

Analysis

Conclusions

It seems that a single-pass sequence is the way to go. Such sequences are significantly faster than default and very close in terms of optimization. Probably can even be better with enough tweaking.

Even if we don't manage to consistently beat default with such a sequence, a safe option would be to expose the MaxRounds constant for the main loop as an optimization parameter and set it to 1 by default. In most cases this would give enough optimization and users could decide on their own whether they want to spend more time to achieve minimally better results.

cameel commented 8 months ago

Some more loose observations from looking at seqbench results recently:

cameel commented 8 months ago

the-good-parts sequence

This is a sequence I put together by starting with single-pass and then trying to remove the parts that did not contribute much to the final result. Then refined it on the ramanujan_pi contract by distilling the bits of the default sequence that improved its bytecode size with every repetition.

Contract with different sequences

Sequence with different contracts

cameel commented 8 months ago

And here's the overall comparison of all the sequences I tested so far.

Remarks

Contract deposit_contract

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -15.5% -37.0% -39.4% -20.1% -21.8% 343 ms 207 ms 430 ms
the-good-parts -15.3% -39.3% -21.7% 160 ms 203 ms 259 ms
single-pass -14.1% -14.9% -34.4% -38.9% -18.7% -21.5% 154 ms 184 ms 304 ms
always-ssa-min -15.5% -15.6% -36.3% -39.0% -19.7% -21.5% 363 ms 174 ms 466 ms
always-ssa -15.2% -36.7% -38.7% -20.0% -21.3% 236 ms 184 ms 377 ms
default-no-cse -15.5% -39.4% -21.8% 298 ms 210 ms 394 ms
single-pass-no-cse -14.8% -38.4% -21.0% 134 ms 182 ms 255 ms
small-loops -15.4% -15.6% -36.2% -39.2% -19.7% -21.6% 282 ms 178 ms 441 ms

Contract FixedFeeRegistrar

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -5.6% -33.8% -34.3% -31.0% -31.5% 187 ms 170 ms 193 ms
the-good-parts -5.6% -32.4% -34.3% -29.7% -31.5% 92 ms 92 ms 135 ms
single-pass -2.9% -3.0% -32.9% -34.5% -30.1% -31.6% 102 ms 87 ms 153 ms
always-ssa-min -3.0% -34.1% -35.1% -31.2% -32.1% 382 ms 158 ms 248 ms
always-ssa -2.9% -3.0% -32.9% -34.5% -30.1% -31.6% 130 ms 88 ms 188 ms
default-no-cse -5.6% -32.6% -34.1% -29.8% -31.2% 147 ms 78 ms 195 ms
single-pass-no-cse -3.1% -35.1% -35.3% -32.2% -32.3% 133 ms 94 ms 160 ms
small-loops -2.9% -3.0% -33.4% -35.0% -30.6% -32.1% 149 ms 84 ms 196 ms

Contract prbmath_unsigned

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -44.4% -30.8% -30.3% 550 ms 596 ms 600 ms
the-good-parts -44.5% -29.6% -29.1% 249 ms 444 ms 409 ms
single-pass -44.4% -30.7% -30.8% -30.2% -30.3% 259 ms 420 ms 410 ms
always-ssa-min -44.4% -30.8% -30.3% 784 ms 564 ms 827 ms
always-ssa -44.4% -30.8% -30.3% 455 ms 498 ms 620 ms
default-no-cse -44.4% -29.5% -29.0% 510 ms 613 ms 638 ms
single-pass-no-cse -44.3% -29.7% -29.2% -29.3% 241 ms 433 ms 614 ms
small-loops -44.4% -30.6% -30.7% -30.1% 472 ms 403 ms 626 ms

Contract ramanujan_pi

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -67.2% -67.7% -39.2% -39.9% -36.2% -36.8% 520 ms 129 ms 622 ms
the-good-parts -67.2% -67.5% -42.9% -39.6% 135 ms 134 ms 233 ms
single-pass -64.1% -64.8% -30.1% -33.5% -27.8% -30.9% 109 ms 109 ms 170 ms
always-ssa-min -66.8% -67.3% -40.2% -37.1% 531 ms 114 ms 602 ms
always-ssa -65.7% -66.4% -30.8% -42.2% -28.4% -39.0% 160 ms 124 ms 232 ms
default-no-cse -67.7% -37.3% -39.9% -34.5% -36.8% 542 ms 129 ms 540 ms
single-pass-no-cse -65.0% -25.9% -32.1% -23.9% -29.6% 89 ms 118 ms 151 ms
small-loops -66.3% -67.0% -28.8% -31.9% -26.5% -29.5% 303 ms 109 ms 379 ms

Contract strings

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -11.9% -12.1% -26.1% -27.7% -24.6% -26.1% 287 ms 169 ms 430 ms
the-good-parts -11.9% -25.9% -27.7% -24.3% -26.1% 134 ms 185 ms 239 ms
single-pass -11.6% -11.8% -23.7% -27.7% -22.3% -26.1% 121 ms 161 ms 219 ms
always-ssa-min -11.9% -12.1% -25.4% -25.9% -23.9% -24.4% 330 ms 173 ms 406 ms
always-ssa -11.8% -12.0% -23.1% -29.8% -21.7% -28.1% 185 ms 178 ms 295 ms
default-no-cse -12.1% -26.2% -24.7% 311 ms 182 ms 389 ms
single-pass-no-cse -11.9% -25.3% -23.9% 108 ms 169 ms 212 ms
small-loops -11.6% -11.8% -25.4% -27.7% -23.9% -26.1% 255 ms 164 ms 349 ms
cameel commented 8 months ago

Final summary for the-good-parts sequence

Comparison with default

Cost differences in CI:

External tests:

Timing of extrnal tests:

seqbench results (values where the new sequence is worse in bold): default runtime gas the-good-parts runtime gas default bytecode size the-good-parts bytecode size default compilation time (optimized) the-good-parts compilation time (optimized)
deposit_contract -15.5% -15.3% -37.0% -39.3% 430 ms 259 ms
FixedFeeRegistrar -5.6% -5.6% -33.8% -32.4% 193 ms 135 ms
prbmath_unsigned -44.4% -44.5% -30.8% -29.6% 600 ms 409 ms
ramanujan_pi -67.2% -67.2% -39.2% -42.9% 622 ms 233 ms
strings -11.9% -11.9% -26.1% -25.9% 430 ms 239 ms

Conclusions

The new sequence clearly beats the single-pass sequence. Comparison with default is more mixed.

The outliers seem a bit concerning. Still, they're not extreme and go both ways so the new sequence is not consistently worse in any metric. In many cases it's a little better.

Overall results seem close enough that we're probably fine using it. I'm pretty sure the sequence can still be refined a little if we spend more time on it. All the sequences I tested have cases where they're better than default by a few percent, most of the time without doing as many repetitions. The trick is to get one that can do it consistently on most input.

cameel commented 8 months ago

Effects of removing StackCompressor

The results are oddly mixed: from completely neutral to very negative.

On one hand, I see almost no change in gas usage in our semantic tests.

I also ran seqbench on the default and the-good-parts sequences. Literally zero difference other than slightly compilation times (but still probably within the margin of error). Exactly the same bytecode sizes and execution gas. I did spot checks on the generated Yul and files are identical. To the point that I'm questioning whether my patched solc really worked correctly, but so far I see no evidence that it did not.

On the other hand, results of external tests are terrible (ir-optimize-evm+yul):

project bytecode_size deployment_gas method_gas
brink 0%
colony 0%
ens +0.49% ❌ 0% 0%
perpetual-pools +9.09% ❌ +9.19% ❌ +1.6% ❌
pool-together +0.86% ❌
uniswap +4.05% ❌ +3.59% ❌ +5.87% ❌
yield_liquidator +2.03% ❌ +2.29% ❌ +0.08% ❌
zeppelin +0.01% ❌ +0.01% ❌ +0.02% ❌

In some cases zero difference, but in others up to 10% increase in bytecode size and 6% increase in gas usage. And no cases where results improved.

Finally, the change caused some new StackTooDeep issues. Only in a few semantic/syntax tests in our test suite but also in 7 of the 15 external tests we have.

cameel commented 8 months ago

the-good-parts-mk2 sequence

Comparison with default

Cost differences in CI:

External tests:

seqbench results (values that differ from the-good-parts in bold): default runtime gas the-good-parts-mk2 runtime gas the-good-parts runtime gas default bytecode size the-good-parts-mk2 bytecode size the-good-parts bytecode size default compilation time the-good-parts-mk2 compilation time the-good-parts compilation time
deposit_contract -15.5% -15.3% -15.3% -37.0% -39.4% -39.3% 430 ms 248 ms 259 ms
erc20 -4.2% -4.2% -4.1% -37.0% -35.9% -34.2% 149 ms 113 ms 94 ms
FixedFeeRegistrar -5.6% -5.6% -5.6% -33.8% -32.8% -32.4% 193 ms 120 ms 135 ms
prbmath_unsigned -44.4% -44.5% -44.5% -30.8% -29.6% -29.6% 600 ms 403 ms 364 ms
ramanujan_pi -67.2% -67.2% -67.2% -39.2% -42.9% -42.9% 622 ms 238 ms 233 ms
strings -11.9% -11.9% -11.9% -26.1% -27.2% -25.9% 430 ms 236 ms 239 ms

Conclusions

The new sequence is same or better than the-good-parts in nearly all cases. Speed is also pretty much the same.

It's still not as good as default but also still much faster.

cameel commented 6 months ago

Comparison of the-good-parts sequence variants

seqbench results on cancun

Contracts

New contracts:

Comparison between contracts for each sequence

Note that I excluded chains.sol here because the range of values makes it hard to compare with other contracts.

Which variant of the-good-parts is the best?

Quality of optimization

Overall: mk3 >= mk2 >= mk1. But differences are small.

Compilation time

Overall: mk1 >= mk2 >= mk3. But again, differences are small.

Conclusion

All of the sequences are pretty close. I'd recommend mk3 because it has a small edge in terms of optimization quality and the difference in timing is not significant. It's also the one I spent the most time analyzing so I don't see much point in going back to the other two. Still, it seems that improvements were marginal at best.

the-good-parts-mk3 vs default

Quality of optimization

Cost differences in CI:

External tests:

seqbench:

Compilation time

External tests:

seqbench:

Timing benchmark: