Open kivikakk opened 5 days ago
But, I've noticed running
opt
on a design beforewrite_cxxrtl
can (at least) speed up a low-to-moderate complexity design (barebones RV32 core) by 1.5x Hz, compiling everything with-O3
in both cases.
That's remarkable! It is contrary to my own testing, which indicated that running opt
typically had no benefit, and sometimes caused a mild performance regression. (I don't doubt your numbers, this is just context for my past decisionmaking.)
There are two concerns that I have re: including more stuff in amaranth-yosys
:
None of these are insurmountable, and I have reasonably clear criteria for what's acceptable regarding these concerns and what isn't.
For (1), if you can get some numbers on how much impact on warm (cached) startup time it has on "a typical x86_64 target" (as opposed to e.g. "Apple M2" which isn't very representative), we can go off from that.
For (2), there are two aspects to it.
The first one is simply the amount of Stuff that will have to go into the Makefile to ensure that opt
builds and runs, complicated somewhat by the fact that (if I recall correctly) opt
calls some of the sub-passes dynamically.
The second one is the cost of updating amaranth-yosys
. This happens every time Amaranth bumps the lower Yosys version requirement (which usually happens every time we fix a bug upstream that would cause silent miscompilation, or a crash with no other reasonable workaround, in other words, not during happy times). This of course requires updating all the Stuff that goes into running opt
.
I don't actually know how bad the cost is. I am open to consideration here, and if you're willing to commit, on an indefinite basis but with the option to stop doing this at any point (at which point I will only support it on a best-effort basis and might just remove opt
again), to updating the opt
Makefile
Stuff whenever we need to bump it, I'm much more amenable to closing my eyes on large amounts of Stuff that gets added.
(I find the Yosys Makefile
horrific. I think, many years ago, I once had a mild psychotic episode after thinking about it with the intent to refactor it for three days straight, and since then I try to look at it as little as I can...)
To add to the above, the highlight of the 0.6 milestone is (as is stated in the description) CXXRTL integration, so any additional speedup, much less a 1.5x speedup (which will completely hide the ~10-15% overhead of capturing a full view using the request/replay machinery), is very relevant to the milestone.
Thinking about this some more, I have a question or suspicion regarding the speedup, where 1.5x is a fairly round number: does that come form less time spent per delta cycle, or fewer delta cycles total? Because what opt
can do (as well as some other passes, like splitnets -driver
), is to eliminate feedback arcs, which can reduce the number of delta cycles per step from 3 to 2, which would be a 1.5x or so speedup.
It is a rather round number! For the record, I was having the design run at a ~12MHz after warmup without opt
, and ~18MHz with. My driver was giving the numbers to me rounded in such a way, so the exact figure is probably less clean, but it certainly looks close enough that it could be that.
I'll look into which of the two it is (time per delta cycle vs. fewer delta cycles) and do the timing analysis per (1) early next week most likely.
As for (2), well, due to all that Nix stuff I did last year ("hdx") I'm un(?)fortunately(??) familiar with the nightmare that is Yosys' Makefile
, so it might not be too bad. Assuming the costs per (1) don't look forbidding, I'll simply give it a try in amaranth-yosys
, and will be able to commit (or not) based on how that goes!
To add to the above, the highlight of the 0.6 milestone is (as is stated in the description) CXXRTL integration, so any additional speedup, much less a 1.5x speedup (which will completely hide the ~10-15% overhead of capturing a full view using the request/replay machinery), is very relevant to the milestone.
Yesssss, okay, this is really good to know! Thanks and noted!
Sounds good.
Thinking about this some more, I have a question or suspicion regarding the speedup, where 1.5x is a fairly round number: does that come form less time spent per delta cycle, or fewer delta cycles total? Because what
opt
can do (as well as some other passes, likesplitnets -driver
), is to eliminate feedback arcs, which can reduce the number of delta cycles per step from 3 to 2, which would be a 1.5x or so speedup.
I compared the return values of module::step()
(while timing and counting steps), and for this design all steps were resolving in exactly one delta cycle under both conditions — it's just that it was managing to run 31,950,684 steps per second (or ~16MHz on the top-level clock, with the added bookkeeping on every step) with opt
, and 22,274,592 (or ~11MHz) without.
I note the generated IL is reduced quite a bit (168,099 bytes after opt
, 211,187 bytes without, both written out by Yosys) — it's hard to compare them super cleanly, but on optimising there's a net reduction of 151 cells (mostly $add
, $eq
, $sub
etc.), and a net increase of 152 connections at module level, some widths reduced in cell parameters, etc. It might just be opt_merge
or something.
I tried a different codebase and there was very little speedup, so it's very much design-dependent!
Thanks. All of this makes intuitive sense to me, and I've also been expecting it to be design-dependent. Now that you've confirmed that the difference comes from specifically the netlist optimization, I have no further concerns about the applicability of opt
--only the points (1) and (2) about the practicality of shipping it.
(If one of the netlists had more delta cycles I'd have asked you to try splitnets -driver
first, as that's a much smaller component to ship; I think it might be a part of the distribution already.)
The filesize of yosys.wasm
increases from 24,589,466 bytes to 35,323,851 (+44%).
Here's our startup time measurements, as measured by hyperfine -w 3 'python -m amaranth_yosys'
.
Testing on a CPX21 Hetzner cloud instance:
Status quo:
Benchmark 1: python -m amaranth_yosys
Time (mean ± σ): 195.1 ms ± 11.4 ms [User: 168.4 ms, System: 42.9 ms]
Range (min … max): 185.1 ms … 229.0 ms 14 runs
With opt
and all dependencies added (+16.9ms (9%)):
Benchmark 1: python -m amaranth_yosys
Time (mean ± σ): 212.0 ms ± 15.4 ms [User: 185.1 ms, System: 42.5 ms]
Range (min … max): 194.6 ms … 249.3 ms 13 runs
Testing on a NAS with an AMD R1600 "embedded" CPU:
Status quo:
Benchmark 1: /volume1/homes/kivikakk/src/amaranth-yosys/venv/bin/python -m amaranth_yosys
Time (mean ± σ): 135.4 ms ± 0.9 ms [User: 98.1 ms, System: 37.5 ms]
Range (min … max): 134.1 ms … 136.7 ms 21 runs
With opt
etc. (+27.9ms (21%)):
Benchmark 1: /volume1/homes/kivikakk/src/amaranth-yosys/venv/bin/python -m amaranth_yosys
Time (mean ± σ): 163.3 ms ± 1.3 ms [User: 115.6 ms, System: 47.9 ms]
Range (min … max): 161.5 ms … 166.9 ms 17 runs
As for (2), I'd be happy to keep opt
building in future — it wasn't too troublesome (https://github.com/amaranth-lang/amaranth-yosys/compare/develop...kivikakk:add-opt).
I must admit, I was pretty indiscriminate with this: I just added everything in passes/opt/Makefile.inc
(including the stuff excluded if SMALL
is set, since we don't set it), and then added objects until I didn't get a link error. It's possible there are other dynamically-called passes that I haven't hit which are missing, in which case we'll get a runtime exception. I did test that my previous test cases do opt
successfully!
The filesize of
yosys.wasm
increases from 24,589,466 bytes to 35,323,851 (+44%).
It's compressed in the *.whl
--that's a ZIP I think. How's compressed size like? (It's likely OK.)
Here's our startup time measurements, as measured by
hyperfine -w 3 'python -m amaranth_yosys'
.
This is all completely fine. It should not be noticeable in any meaningful way--30 ms on an embedded CPU is below the perceptual threshold, I think.
I must admit, I was pretty indiscriminate with this: I just added everything in
passes/opt/Makefile.inc
(including the stuff excluded ifSMALL
is set, since we don't set it), and then added objects until I didn't get a link error. It's possible there are other dynamically-called passes that I haven't hit which are missing, in which case we'll get a runtime exception. I did test that my previous test cases doopt
successfully!
Yeah, that's fine I think.
It's compressed in the
*.whl
--that's a ZIP I think. How's compressed size like? (It's likely OK.)
From 5,290,853 bytes to 7,708,881 (+46%).
Right, so neither the startup latency nor the download size isn't going to be noticeable. I think it's also possible that wasmtime got better in the meantime; back when I first made this package, the difference between running yowasp-yosys and amaranth-yosys was quite drastic. Perhaps at some point it would make sense to even deprecate amaranth-yosys entirely, if yowasp-yosys becomes just as fast--ultimately I don't want to maintain two separate, subtly different and incompatible (and differently versioned) builds of the same software.
From memory the startup latency was something like 150-300 ms for this package and 3-4 s for yowasp-yosys. So a little too felt for seemingly simple operations like verilog.convert
. I don't think anyone would expect a HDL to be real-time-fast, but I still aim for that level of quality anyway.
Hm. On the Hetzner VPS:
Benchmark 1: yowasp-yosys
Time (mean ± σ): 170.3 ms ± 8.3 ms [User: 155.9 ms, System: 29.6 ms]
Range (min … max): 157.0 ms … 186.7 ms 16 runs
That's 25ms faster than current amaranth-yosys
(without opt
), and indeed its yosys.wasm
is ~2MB lighter.
On my NAS:
Benchmark 1: /var/services/homes/kivikakk/.local/bin/yowasp-yosys
Time (mean ± σ): 108.2 ms ± 1.0 ms [User: 81.8 ms, System: 25.5 ms]
Range (min … max): 106.1 ms … 109.7 ms 26 runs
27ms faster.
Have sanity checked that the yowasp-yosys
looks fine, actually runs correctly, has everything we expect it to, so I'm just a little confused. Maybe the version upgrades of yosys since (0.40 vs 0.42) have improved things greatly?
Here's the thing that super confuses me — on my M3, here's current amaranth-yosys
:
Benchmark 1: /Users/kivikakk/.asdf/shims/python -m amaranth_yosys
Time (mean ± σ): 178.9 ms ± 2.0 ms [User: 119.1 ms, System: 17.1 ms]
Range (min … max): 176.7 ms … 184.3 ms 16 runs
And here's yowasp-yosys
:
Benchmark 1: yowasp-yosys
Time (mean ± σ): 521.0 ms ± 4.8 ms [User: 461.0 ms, System: 16.6 ms]
Range (min … max): 512.1 ms … 528.4 ms 10 runs
191% slower. ?????? Something in wasmtime that really struggles on arm64 which only comes out in the full distribution? Or only with WASI SDK 22?
It appears to mostly occur during exit: when running e.g. time yowasp-yosys -p ''
the only observable pause is after Yosys finishes and writes "Time spent: no commands executed".
Try building amaranth-yosys with wasi-sdk 22.0?
Will!
No difference there. Also confirmed the pause is after the entirety of yowasp_runtime.run_wasm
(including the thread.join()
, shutil.rmtree()
etc.), but before it returns to yosys_yowasp.run_yosys
, so there's some local being finalised in run_wasm
?
I don't know enough about Python here yet to say. I removing the separate thread (to bring YoWASP closer to how amaranth-yosys runs), removing a bunch of the preopens, but no difference.
If this particular behaviour (YoWASP-yosys on darwin-arm64 hangs for ~300ms on exit for whatever reason) is what makes the difference between dropping amaranth-yosys for it and not, I'm happy to figure it out (and probably learn how to use macOS's Instruments or whatever), otherwise I'll leave it.
README states:
No qualms here, it's
amaranth-yosys
. But, I've noticed runningopt
on a design beforewrite_cxxrtl
can (at least) speed up a low-to-moderate complexity design (barebones RV32 core) by 1.5x Hz, compiling everything with-O3
in both cases.It's not a big deal since I can just use system Yosys, but I thought I'd bring it up and see what you thought. There's no way this would be exposed as-is (and so would be a waste except for people calling Amaranth's Yosys directly like me), but e.g. maybe one day
amaranth.cli
would support enabling optimizations.Noting here that plenty of
opt
passes don't run because we don't runproc
, and runningproc; opt
is worse than doing nothing, presumably because CXXRTL's own proc pass does a much better job at this.(this unearthed when a CI build failed since I deliberately didn't install Yosys for a "units test only" job.)