Thoughts on including optimization passes?

kivikakk commented 5 days ago

README states:

This build is aggressively optimized for binary size and startup latency, and only includes features required by Amaranth's Verilog and CXXRTL backends; it is not useful for any other purpose.

No qualms here, it's amaranth-yosys. But, I've noticed running opt on a design before write_cxxrtl can (at least) speed up a low-to-moderate complexity design (barebones RV32 core) by 1.5x Hz, compiling everything with -O3 in both cases.

It's not a big deal since I can just use system Yosys, but I thought I'd bring it up and see what you thought. There's no way this would be exposed as-is (and so would be a waste except for people calling Amaranth's Yosys directly like me), but e.g. maybe one day amaranth.cli would support enabling optimizations.

Noting here that plenty of opt passes don't run because we don't run proc, and running proc; opt is worse than doing nothing, presumably because CXXRTL's own proc pass does a much better job at this.

(this unearthed when a CI build failed since I deliberately didn't install Yosys for a "units test only" job.)

whitequark commented 5 days ago

But, I've noticed running opt on a design before write_cxxrtl can (at least) speed up a low-to-moderate complexity design (barebones RV32 core) by 1.5x Hz, compiling everything with -O3 in both cases.

That's remarkable! It is contrary to my own testing, which indicated that running opt typically had no benefit, and sometimes caused a mild performance regression. (I don't doubt your numbers, this is just context for my past decisionmaking.)

There are two concerns that I have re: including more stuff in amaranth-yosys:

Code size and (more importantly, but dependent on the former) startup time.
Additional maintenance overhead.

None of these are insurmountable, and I have reasonably clear criteria for what's acceptable regarding these concerns and what isn't.

For (1), if you can get some numbers on how much impact on warm (cached) startup time it has on "a typical x86_64 target" (as opposed to e.g. "Apple M2" which isn't very representative), we can go off from that.

For (2), there are two aspects to it.

The first one is simply the amount of Stuff that will have to go into the Makefile to ensure that opt builds and runs, complicated somewhat by the fact that (if I recall correctly) opt calls some of the sub-passes dynamically.

The second one is the cost of updating amaranth-yosys. This happens every time Amaranth bumps the lower Yosys version requirement (which usually happens every time we fix a bug upstream that would cause silent miscompilation, or a crash with no other reasonable workaround, in other words, not during happy times). This of course requires updating all the Stuff that goes into running opt.

I don't actually know how bad the cost is. I am open to consideration here, and if you're willing to commit, on an indefinite basis but with the option to stop doing this at any point (at which point I will only support it on a best-effort basis and might just remove opt again), to updating the opt Makefile Stuff whenever we need to bump it, I'm much more amenable to closing my eyes on large amounts of Stuff that gets added.

(I find the Yosys Makefile horrific. I think, many years ago, I once had a mild psychotic episode after thinking about it with the intent to refactor it for three days straight, and since then I try to look at it as little as I can...)

whitequark commented 5 days ago

To add to the above, the highlight of the 0.6 milestone is (as is stated in the description) CXXRTL integration, so any additional speedup, much less a 1.5x speedup (which will completely hide the ~10-15% overhead of capturing a full view using the request/replay machinery), is very relevant to the milestone.

whitequark commented 5 days ago

Thinking about this some more, I have a question or suspicion regarding the speedup, where 1.5x is a fairly round number: does that come form less time spent per delta cycle, or fewer delta cycles total? Because what opt can do (as well as some other passes, like splitnets -driver), is to eliminate feedback arcs, which can reduce the number of delta cycles per step from 3 to 2, which would be a 1.5x or so speedup.

kivikakk commented 5 days ago

It is a rather round number! For the record, I was having the design run at a ~12MHz after warmup without opt, and ~18MHz with. My driver was giving the numbers to me rounded in such a way, so the exact figure is probably less clean, but it certainly looks close enough that it could be that.

I'll look into which of the two it is (time per delta cycle vs. fewer delta cycles) and do the timing analysis per (1) early next week most likely.

As for (2), well, due to all that Nix stuff I did last year ("hdx") I'm un(?)fortunately(??) familiar with the nightmare that is Yosys' Makefile, so it might not be too bad. Assuming the costs per (1) don't look forbidding, I'll simply give it a try in amaranth-yosys, and will be able to commit (or not) based on how that goes!

To add to the above, the highlight of the 0.6 milestone is (as is stated in the description) CXXRTL integration, so any additional speedup, much less a 1.5x speedup (which will completely hide the ~10-15% overhead of capturing a full view using the request/replay machinery), is very relevant to the milestone.

Yesssss, okay, this is really good to know! Thanks and noted!

whitequark commented 5 days ago

Sounds good.

kivikakk commented 4 days ago

Thinking about this some more, I have a question or suspicion regarding the speedup, where 1.5x is a fairly round number: does that come form less time spent per delta cycle, or fewer delta cycles total? Because what opt can do (as well as some other passes, like splitnets -driver), is to eliminate feedback arcs, which can reduce the number of delta cycles per step from 3 to 2, which would be a 1.5x or so speedup.

I compared the return values of module::step() (while timing and counting steps), and for this design all steps were resolving in exactly one delta cycle under both conditions — it's just that it was managing to run 31,950,684 steps per second (or ~16MHz on the top-level clock, with the added bookkeeping on every step) with opt, and 22,274,592 (or ~11MHz) without.

I note the generated IL is reduced quite a bit (168,099 bytes after opt, 211,187 bytes without, both written out by Yosys) — it's hard to compare them super cleanly, but on optimising there's a net reduction of 151 cells (mostly $add, $eq, $sub etc.), and a net increase of 152 connections at module level, some widths reduced in cell parameters, etc. It might just be opt_merge or something.

I tried a different codebase and there was very little speedup, so it's very much design-dependent!

whitequark commented 4 days ago

Thanks. All of this makes intuitive sense to me, and I've also been expecting it to be design-dependent. Now that you've confirmed that the difference comes from specifically the netlist optimization, I have no further concerns about the applicability of opt--only the points (1) and (2) about the practicality of shipping it.

(If one of the netlists had more delta cycles I'd have asked you to try splitnets -driver first, as that's a much smaller component to ship; I think it might be a part of the distribution already.)

kivikakk commented 19 hours ago

The filesize of yosys.wasm increases from 24,589,466 bytes to 35,323,851 (+44%).

Here's our startup time measurements, as measured by hyperfine -w 3 'python -m amaranth_yosys'.

Testing on a CPX21 Hetzner cloud instance:

Status quo:

Benchmark 1: python -m amaranth_yosys
  Time (mean ± σ):     195.1 ms ±  11.4 ms    [User: 168.4 ms, System: 42.9 ms]
  Range (min … max):   185.1 ms … 229.0 ms    14 runs

With opt and all dependencies added (+16.9ms (9%)):

Benchmark 1: python -m amaranth_yosys
  Time (mean ± σ):     212.0 ms ±  15.4 ms    [User: 185.1 ms, System: 42.5 ms]
  Range (min … max):   194.6 ms … 249.3 ms    13 runs

Testing on a NAS with an AMD R1600 "embedded" CPU:

Status quo:

Benchmark 1: /volume1/homes/kivikakk/src/amaranth-yosys/venv/bin/python -m amaranth_yosys
  Time (mean ± σ):     135.4 ms ±   0.9 ms    [User: 98.1 ms, System: 37.5 ms]
  Range (min … max):   134.1 ms … 136.7 ms    21 runs

With opt etc. (+27.9ms (21%)):

Benchmark 1: /volume1/homes/kivikakk/src/amaranth-yosys/venv/bin/python -m amaranth_yosys
  Time (mean ± σ):     163.3 ms ±   1.3 ms    [User: 115.6 ms, System: 47.9 ms]
  Range (min … max):   161.5 ms … 166.9 ms    17 runs

As for (2), I'd be happy to keep opt building in future — it wasn't too troublesome (https://github.com/amaranth-lang/amaranth-yosys/compare/develop...kivikakk:add-opt).

I must admit, I was pretty indiscriminate with this: I just added everything in passes/opt/Makefile.inc (including the stuff excluded if SMALL is set, since we don't set it), and then added objects until I didn't get a link error. It's possible there are other dynamically-called passes that I haven't hit which are missing, in which case we'll get a runtime exception. I did test that my previous test cases do opt successfully!

whitequark commented 19 hours ago

The filesize of yosys.wasm increases from 24,589,466 bytes to 35,323,851 (+44%).

It's compressed in the *.whl--that's a ZIP I think. How's compressed size like? (It's likely OK.)

Here's our startup time measurements, as measured by hyperfine -w 3 'python -m amaranth_yosys'.

This is all completely fine. It should not be noticeable in any meaningful way--30 ms on an embedded CPU is below the perceptual threshold, I think.

whitequark commented 19 hours ago

I must admit, I was pretty indiscriminate with this: I just added everything in passes/opt/Makefile.inc (including the stuff excluded if SMALL is set, since we don't set it), and then added objects until I didn't get a link error. It's possible there are other dynamically-called passes that I haven't hit which are missing, in which case we'll get a runtime exception. I did test that my previous test cases do opt successfully!

Yeah, that's fine I think.

kivikakk commented 19 hours ago

It's compressed in the *.whl--that's a ZIP I think. How's compressed size like? (It's likely OK.)

From 5,290,853 bytes to 7,708,881 (+46%).

whitequark commented 19 hours ago

Right, so neither the startup latency nor the download size isn't going to be noticeable. I think it's also possible that wasmtime got better in the meantime; back when I first made this package, the difference between running yowasp-yosys and amaranth-yosys was quite drastic. Perhaps at some point it would make sense to even deprecate amaranth-yosys entirely, if yowasp-yosys becomes just as fast--ultimately I don't want to maintain two separate, subtly different and incompatible (and differently versioned) builds of the same software.

whitequark commented 19 hours ago

From memory the startup latency was something like 150-300 ms for this package and 3-4 s for yowasp-yosys. So a little too felt for seemingly simple operations like verilog.convert. I don't think anyone would expect a HDL to be real-time-fast, but I still aim for that level of quality anyway.

kivikakk commented 18 hours ago

Hm. On the Hetzner VPS:

Benchmark 1: yowasp-yosys
  Time (mean ± σ):     170.3 ms ±   8.3 ms    [User: 155.9 ms, System: 29.6 ms]
  Range (min … max):   157.0 ms … 186.7 ms    16 runs

That's 25ms faster than current amaranth-yosys (without opt), and indeed its yosys.wasm is ~2MB lighter.

On my NAS:

Benchmark 1: /var/services/homes/kivikakk/.local/bin/yowasp-yosys
  Time (mean ± σ):     108.2 ms ±   1.0 ms    [User: 81.8 ms, System: 25.5 ms]
  Range (min … max):   106.1 ms … 109.7 ms    26 runs

27ms faster.

Have sanity checked that the yowasp-yosys looks fine, actually runs correctly, has everything we expect it to, so I'm just a little confused. Maybe the version upgrades of yosys since (0.40 vs 0.42) have improved things greatly?

Here's the thing that super confuses me — on my M3, here's current amaranth-yosys:

Benchmark 1: /Users/kivikakk/.asdf/shims/python -m amaranth_yosys
  Time (mean ± σ):     178.9 ms ±   2.0 ms    [User: 119.1 ms, System: 17.1 ms]
  Range (min … max):   176.7 ms … 184.3 ms    16 runs

And here's yowasp-yosys:

Benchmark 1: yowasp-yosys
  Time (mean ± σ):     521.0 ms ±   4.8 ms    [User: 461.0 ms, System: 16.6 ms]
  Range (min … max):   512.1 ms … 528.4 ms    10 runs

191% slower. ?????? Something in wasmtime that really struggles on arm64 which only comes out in the full distribution? Or only with WASI SDK 22?

It appears to mostly occur during exit: when running e.g. time yowasp-yosys -p '' the only observable pause is after Yosys finishes and writes "Time spent: no commands executed".

whitequark commented 16 hours ago

Try building amaranth-yosys with wasi-sdk 22.0?

kivikakk commented 14 hours ago

Will!

kivikakk commented 1 hour ago

No difference there. Also confirmed the pause is after the entirety of yowasp_runtime.run_wasm (including the thread.join(), shutil.rmtree() etc.), but before it returns to yosys_yowasp.run_yosys, so there's some local being finalised in run_wasm?

I don't know enough about Python here yet to say. I removing the separate thread (to bring YoWASP closer to how amaranth-yosys runs), removing a bunch of the preopens, but no difference.

If this particular behaviour (YoWASP-yosys on darwin-arm64 hangs for ~300ms on exit for whatever reason) is what makes the difference between dropping amaranth-yosys for it and not, I'm happy to figure it out (and probably learn how to use macOS's Instruments or whatever), otherwise I'll leave it.

amaranth-lang / amaranth-yosys

Thoughts on including optimization passes? #11