Open nikola-matic opened 1 year ago
Just to elaborate a bit: the current sequence transforms to SSA and back in a loop. This is partially based on the legacy-code-transform fact that the number of variables translates rather directly to required stack slots. The new code transform does not suffer from this and should be able to generate efficient code directly from SSA.
Based on that, we should, generally, be able to transform to SSA and stay in SSA, up until code generation. However, that would require to ensure that none of the steps that perform meaningful optimizations destroy the SSA property.
Actually changing any potential such steps to preserve SSA form I'd consider a separate tasks. This first task should merely investigate what can be achieved with modifying the sequence without actually touching the steps, keeping in mind that transforming back from SSA should not add value and contribute to "compilability" relative to the current code transform anymore.
This issue has been marked as stale due to inactivity for the last 90 days. It will be automatically closed in 7 days.
While we're at it, we should double-check if the effects of the StackCompressor
these days are actually still positive or whether we should just remove it (simplifying our entire pipeline)
For future reference, here are the initial results I showed last week: solc-seqbench-intro-and-initial-results.md
.
And here's a bit more analysis:
small-loops
. It's just the default one with the big loop replaced with looping each component individually.I haven't analyzed it all yet so here are the plots and some casual observations. I still need to go over those plots in more detail.
deposit_contract
single-pass
takes 2x less time and gets only slightly worse results.small-loops
beats default
, which beats always-ssa-min
)small-loops
executes significantly fewer steps than default
and always-ssa-min
, but timing differs only minimallysmall-loops
converges visibly slower, especially in terms of bytecode size. default
and always-ssa-min
would beat it if we stopped as soon as they convergesingle-pass
in execution, in bytecode size half to 1/4.c
, M
, e
, l
, T
.ramanujan_pi
single-pass
converges very quickly and is only a little worse in terms of execution time. For bytecode size it's even same as small-loops
.
small-loops
takes 4x the time to converge and the other two 10x the time.Since the single-pass
sequence looks very promising, I decided to see how it would fare in our CI before I analyze the results in more detail. Here's a test PR with changes #14887 and I posted benchmarks in comments.
Results look promising. 17-46% decrease in total compilation time. Gas benchmarks also aren't too bad. While I can see up to 25% increase in gas in a few tests, the majority is impacted very little and benchmark diffs for external tests look much more reasonable - most of them below 1.5% for both deployment and runtime. Also, in many cases costs actually decrease. ENS is an outlier with bytecode size increasing 5%, though on the other hand ElementFI had bytecode decrease by 4%.
Overall I think that we might be able to tweak this sequence enough to get results very close to the current default.
Also, worth noting that the running time decrease we're seeing with this sequence may very likely be the bottom line of what we can achieve by tweaking the sequence. The plots show that most of the gains happen in that first pass so we'll need to keep most of it in one form or another.
no-cse
sequencesHere's the data for sequences I analyzed yesterday. This time I tried to remove some c
(CommonSubexpressionEliminator) and j
(ExpressionJoiner) steps that seemed to degrade the results for deposit_contract
. The results are mixed. It helps in some cases, makes things worse in others.
And here's some analysis of the data I gathered so far. This is more or less what I presented on the call today.
always-ssa-min
is a definitely worse than the default
sequence here. It usually takes just as much time and there were cases where it was almost 2x slower for almost no additional gain.no-cse
variants are generally slightly faster than their equivalents, due to having fewer steps. This indicates that dropping some steps may be worthwhile.ramanujan_pi
is the contract that converges the slowest. Bytecode size converges only after 6 repetitions of the main loop and runtime cost after 4. Stopping after just a single pass gives worse results, though it's worth noting that subsequent passes make things worse before they make them better so ultimately the single-pass sequences do not end up being as much worse as one might expect. There may be a way to achieve results comparable with multi-pass in a more efficient way.small-loops
sequence in some cases finishes significantly faster than default
(~60% faster for ramanujan_pi
) and never much longer. Still, it spends a lot of time not improving, just like default
, so the single-pass variants generally seem more promising.It seems that a single-pass sequence is the way to go. Such sequences are significantly faster than default
and very close in terms of optimization. Probably can even be better with enough tweaking.
Even if we don't manage to consistently beat default
with such a sequence, a safe option would be to expose the MaxRounds
constant for the main loop as an optimization parameter and set it to 1 by default. In most cases this would give enough optimization and users could decide on their own whether they want to spend more time to achieve minimally better results.
Some more loose observations from looking at seqbench results recently:
jmul
part seems to have a significant effect on bytecode size.small-loops
sequence)
ramanujan_pi
contract. TU
steps but the current sequence always reverses this gain with c
. Especially in the cleanup part that seeem undesirable.the-good-parts
sequenceThis is a sequence I put together by starting with single-pass
and then trying to remove the parts that did not contribute much to the final result. Then refined it on the ramanujan_pi
contract by distilling the bits of the default
sequence that improved its bytecode size with every repetition.
And here's the overall comparison of all the sequences I tested so far.
gas_diff_stats.py
, which does not know the original value).min
column is the minimum reached at any point in sequence. It's reported when it's lower then the final value.runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
---|---|---|---|---|---|---|---|---|---|
default |
-15.5% | -37.0% | -39.4% | -20.1% | -21.8% | 343 ms | 207 ms | 430 ms | |
the-good-parts |
-15.3% | -39.3% | -21.7% | 160 ms | 203 ms | 259 ms | |||
single-pass |
-14.1% | -14.9% | -34.4% | -38.9% | -18.7% | -21.5% | 154 ms | 184 ms | 304 ms |
always-ssa-min |
-15.5% | -15.6% | -36.3% | -39.0% | -19.7% | -21.5% | 363 ms | 174 ms | 466 ms |
always-ssa |
-15.2% | -36.7% | -38.7% | -20.0% | -21.3% | 236 ms | 184 ms | 377 ms | |
default-no-cse |
-15.5% | -39.4% | -21.8% | 298 ms | 210 ms | 394 ms | |||
single-pass-no-cse |
-14.8% | -38.4% | -21.0% | 134 ms | 182 ms | 255 ms | |||
small-loops |
-15.4% | -15.6% | -36.2% | -39.2% | -19.7% | -21.6% | 282 ms | 178 ms | 441 ms |
runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
---|---|---|---|---|---|---|---|---|---|
default |
-5.6% | -33.8% | -34.3% | -31.0% | -31.5% | 187 ms | 170 ms | 193 ms | |
the-good-parts |
-5.6% | -32.4% | -34.3% | -29.7% | -31.5% | 92 ms | 92 ms | 135 ms | |
single-pass |
-2.9% | -3.0% | -32.9% | -34.5% | -30.1% | -31.6% | 102 ms | 87 ms | 153 ms |
always-ssa-min |
-3.0% | -34.1% | -35.1% | -31.2% | -32.1% | 382 ms | 158 ms | 248 ms | |
always-ssa |
-2.9% | -3.0% | -32.9% | -34.5% | -30.1% | -31.6% | 130 ms | 88 ms | 188 ms |
default-no-cse |
-5.6% | -32.6% | -34.1% | -29.8% | -31.2% | 147 ms | 78 ms | 195 ms | |
single-pass-no-cse |
-3.1% | -35.1% | -35.3% | -32.2% | -32.3% | 133 ms | 94 ms | 160 ms | |
small-loops |
-2.9% | -3.0% | -33.4% | -35.0% | -30.6% | -32.1% | 149 ms | 84 ms | 196 ms |
runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
---|---|---|---|---|---|---|---|---|---|
default |
-44.4% | -30.8% | -30.3% | 550 ms | 596 ms | 600 ms | |||
the-good-parts |
-44.5% | -29.6% | -29.1% | 249 ms | 444 ms | 409 ms | |||
single-pass |
-44.4% | -30.7% | -30.8% | -30.2% | -30.3% | 259 ms | 420 ms | 410 ms | |
always-ssa-min |
-44.4% | -30.8% | -30.3% | 784 ms | 564 ms | 827 ms | |||
always-ssa |
-44.4% | -30.8% | -30.3% | 455 ms | 498 ms | 620 ms | |||
default-no-cse |
-44.4% | -29.5% | -29.0% | 510 ms | 613 ms | 638 ms | |||
single-pass-no-cse |
-44.3% | -29.7% | -29.2% | -29.3% | 241 ms | 433 ms | 614 ms | ||
small-loops |
-44.4% | -30.6% | -30.7% | -30.1% | 472 ms | 403 ms | 626 ms |
runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
---|---|---|---|---|---|---|---|---|---|
default |
-67.2% | -67.7% | -39.2% | -39.9% | -36.2% | -36.8% | 520 ms | 129 ms | 622 ms |
the-good-parts |
-67.2% | -67.5% | -42.9% | -39.6% | 135 ms | 134 ms | 233 ms | ||
single-pass |
-64.1% | -64.8% | -30.1% | -33.5% | -27.8% | -30.9% | 109 ms | 109 ms | 170 ms |
always-ssa-min |
-66.8% | -67.3% | -40.2% | -37.1% | 531 ms | 114 ms | 602 ms | ||
always-ssa |
-65.7% | -66.4% | -30.8% | -42.2% | -28.4% | -39.0% | 160 ms | 124 ms | 232 ms |
default-no-cse |
-67.7% | -37.3% | -39.9% | -34.5% | -36.8% | 542 ms | 129 ms | 540 ms | |
single-pass-no-cse |
-65.0% | -25.9% | -32.1% | -23.9% | -29.6% | 89 ms | 118 ms | 151 ms | |
small-loops |
-66.3% | -67.0% | -28.8% | -31.9% | -26.5% | -29.5% | 303 ms | 109 ms | 379 ms |
runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
---|---|---|---|---|---|---|---|---|---|
default |
-11.9% | -12.1% | -26.1% | -27.7% | -24.6% | -26.1% | 287 ms | 169 ms | 430 ms |
the-good-parts |
-11.9% | -25.9% | -27.7% | -24.3% | -26.1% | 134 ms | 185 ms | 239 ms | |
single-pass |
-11.6% | -11.8% | -23.7% | -27.7% | -22.3% | -26.1% | 121 ms | 161 ms | 219 ms |
always-ssa-min |
-11.9% | -12.1% | -25.4% | -25.9% | -23.9% | -24.4% | 330 ms | 173 ms | 406 ms |
always-ssa |
-11.8% | -12.0% | -23.1% | -29.8% | -21.7% | -28.1% | 185 ms | 178 ms | 295 ms |
default-no-cse |
-12.1% | -26.2% | -24.7% | 311 ms | 182 ms | 389 ms | |||
single-pass-no-cse |
-11.9% | -25.3% | -23.9% | 108 ms | 169 ms | 212 ms | |||
small-loops |
-11.6% | -11.8% | -25.4% | -27.7% | -23.9% | -26.1% | 255 ms | 164 ms | 349 ms |
the-good-parts
sequencedefault
ramanujan_pi.sol
)base64.sol
and erc20.sol
)seqbench results (values where the new sequence is worse in bold): | default runtime gas |
the-good-parts runtime gas |
default bytecode size |
the-good-parts bytecode size |
default compilation time (optimized) |
the-good-parts compilation time (optimized) |
|
---|---|---|---|---|---|---|---|
deposit_contract |
-15.5% | -15.3% | -37.0% | -39.3% | 430 ms | 259 ms | |
FixedFeeRegistrar |
-5.6% | -5.6% | -33.8% | -32.4% | 193 ms | 135 ms | |
prbmath_unsigned |
-44.4% | -44.5% | -30.8% | -29.6% | 600 ms | 409 ms | |
ramanujan_pi |
-67.2% | -67.2% | -39.2% | -42.9% | 622 ms | 233 ms | |
strings |
-11.9% | -11.9% | -26.1% | -25.9% | 430 ms | 239 ms |
The new sequence clearly beats the single-pass
sequence. Comparison with default
is more mixed.
The outliers seem a bit concerning. Still, they're not extreme and go both ways so the new sequence is not consistently worse in any metric. In many cases it's a little better.
Overall results seem close enough that we're probably fine using it. I'm pretty sure the sequence can still be refined a little if we spend more time on it. All the sequences I tested have cases where they're better than default
by a few percent, most of the time without doing as many repetitions. The trick is to get one that can do it consistently on most input.
The results are oddly mixed: from completely neutral to very negative.
On one hand, I see almost no change in gas usage in our semantic tests.
I also ran seqbench on the default
and the-good-parts
sequences. Literally zero difference other than slightly compilation times (but still probably within the margin of error). Exactly the same bytecode sizes and execution gas. I did spot checks on the generated Yul and files are identical. To the point that I'm questioning whether my patched solc really worked correctly, but so far I see no evidence that it did not.
On the other hand, results of external tests are terrible (ir-optimize-evm+yul
):
project | bytecode_size | deployment_gas | method_gas |
---|---|---|---|
brink | 0% |
||
colony | 0% |
||
ens | +0.49% ❌ |
0% |
0% |
perpetual-pools | +9.09% ❌ |
+9.19% ❌ |
+1.6% ❌ |
pool-together | +0.86% ❌ |
||
uniswap | +4.05% ❌ |
+3.59% ❌ |
+5.87% ❌ |
yield_liquidator | +2.03% ❌ |
+2.29% ❌ |
+0.08% ❌ |
zeppelin | +0.01% ❌ |
+0.01% ❌ |
+0.02% ❌ |
In some cases zero difference, but in others up to 10% increase in bytecode size and 6% increase in gas usage. And no cases where results improved.
Finally, the change caused some new StackTooDeep
issues. Only in a few semantic/syntax tests in our test suite but also in 7 of the 15 external tests we have.
the-good-parts-mk2
sequencedefault
the-good-parts
. Still mostly between -1% and +3% with outliers at -6% and 4-9%, but also has a number of cases where a 2-4% increase became 0-1%.the-good-parts
the-good-parts
the-good-parts
(bytecode size changes between 0% and +0.3% and negligible runtime cost changes)seqbench results
(values that differ from the-good-parts in bold): |
default runtime gas |
the-good-parts-mk2 runtime gas |
the-good-parts runtime gas |
default bytecode size |
the-good-parts-mk2 bytecode size |
the-good-parts bytecode size |
default compilation time |
the-good-parts-mk2 compilation time |
the-good-parts compilation time |
|
---|---|---|---|---|---|---|---|---|---|---|
deposit_contract |
-15.5% | -15.3% | -15.3% | -37.0% | -39.4% | -39.3% | 430 ms | 248 ms | 259 ms | |
erc20 |
-4.2% | -4.2% | -4.1% | -37.0% | -35.9% | -34.2% | 149 ms | 113 ms | 94 ms | |
FixedFeeRegistrar |
-5.6% | -5.6% | -5.6% | -33.8% | -32.8% | -32.4% | 193 ms | 120 ms | 135 ms | |
prbmath_unsigned |
-44.4% | -44.5% | -44.5% | -30.8% | -29.6% | -29.6% | 600 ms | 403 ms | 364 ms | |
ramanujan_pi |
-67.2% | -67.2% | -67.2% | -39.2% | -42.9% | -42.9% | 622 ms | 238 ms | 233 ms | |
strings |
-11.9% | -11.9% | -11.9% | -26.1% | -27.2% | -25.9% | 430 ms | 236 ms | 239 ms |
The new sequence is same or better than the-good-parts
in nearly all cases. Speed is also pretty much the same.
It's still not as good as default
but also still much faster.
the-good-parts
sequence variantscancun
New contracts:
erc20
(ERC20 stub from our repo)oz-erc20
(actual ERC20 from OpenZeppelin)chains
from our repo (pathologically slow contract from our timing benchmarks)Note that I excluded chains.sol
here because the range of values makes it hard to compare with other contracts.
the-good-parts
is the best?Overall: mk3 >= mk2 >= mk1. But differences are small.
default
they can be considered pretty much the same. The difference comes from each version having a few more steps.Overall: mk1 >= mk2 >= mk3. But again, differences are small.
All of the sequences are pretty close. I'd recommend mk3 because it has a small edge in terms of optimization quality and the difference in timing is not significant. It's also the one I spent the most time analyzing so I don't see much point in going back to the other two. Still, it seems that improvements were marginal at best.
the-good-parts-mk3
vs default
uniswap
we've seen with mk1 is gone, but this is probably due to switch to Cancun, not improvements in the sequence.seqbench:
chans.sol
where default
increases the size by 252.3%, while the-good-parts-mk3
"only" by 227.2%.colony
being a single outlier at +6%).seqbench:
The default optimizer sequence we have at the moment is quite large, and thus causes bloating in compilation times.
The outcome should hopefully be a shorter sequence, that achieves the same level of optimization, while reducing the compile times (in the context of the optimizer pipeline). Special attention should be paid to steps that destroy the SSA form of its input (better suited for optimizing), and thus cause subsequent steps to perform poorly.