calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

Toplevel `ref` cells interfere with static promotion #1956

Open ayakayorihiro opened 2 months ago

ayakayorihiro commented 2 months ago

Branching off of #1603 - one blocker for deprecating @external and adopting ref's synthesis mode everywhere is that currently the version of Verilog generated by synthesis mode (externalizing all memories) incurs a performance cost.

Below is a comparison between (1) Calyx's current simulation mode (emitting readmemh/writememh - Calyx) and (2) Using ref to externalize memories (CalyxRef). Time [ms] is the average Verilator simulation time in milliseconds, and Cycles is the number of cycles taken by the simulation. The cycle counts for CalyxRef are about 1.5-2x more than those for Calyx, and that affects the simulation time as well. image

This performance drop is most likely because of additional FSMs generated by the Calyx compiler to keep track of memory updates.

"Reproduction"/Trying out CalyxRef

You can run Calyx programs in CalyxRef by using fud2 from the fud2-refmem branch:

cd <CALYX_REPO>
git pull
git checkout fud2-refmem
cargo build
( cd fud2 ; cargo build)

Once you have fud2 set up, you can run CalyxRef by adding --through calyx-to-verilog-refmem to your fud2 simulation command: ex)

fud2 examples/tutorial/language-tutorial-iterate.futil --to dat --through calyx-to-verilog-refmem -s sim.data=`pwd`/examples/tutorial/data.json

should give you

{
  "cycles": 76,
  "memories": {
    "mem": [
      42
    ]
  }
}

whereas running Calyx without externalized memories (Calyx):

fud2 examples/tutorial/language-tutorial-iterate.futil --to dat --through calyx-to-verilog -s sim.data=`pwd`/examples/tutorial/data.json

gives you

{
  "cycles": 37,
  "memories": {
    "mem": [
      42
    ]
  }
}

.

rachitnigam commented 2 months ago

Thanks for the explanation! I'm not sure if there is anything to do about this. The readmemh and writememh functions hide the latency to read and write IO memories and ref is just making it explicit. If you want to report only the compute part, you can subtract out the amount of time to read and write things.

Anyways, this goes to show that real accelerators need to optimize more than just compute; memory movement often dominates!

sampsyo commented 2 months ago

Indeed; thanks, @ayakayorihiro, for the detailed reproduction instructions! I could even reproduce the effect for the simpler language-tutorial-compute.futil: 3 cycles vs. 5 cycles. I admit I got a little nerdsniped by this. I don't think it should be slower, and I wanted to know why it was. Relevantly, re. @rachitnigam's comment:

The readmemh and writememh functions hide the latency to read and write IO memories and ref is just making it explicit.

I don't think this is the case here. readmemh and writememh exist in both treatments. And the ref shouldn't add any cycles, I think: we're either accessing a comb_mem_d1 locally or by reference, in the testbench. This makes me think something is wrong with the way we're compiling the ref in general, and it might not have to do with the testbench approach.

Here are two VCDs I dumped from Icarus simulation of the two under the two treatments normal.vcd.txt refmem.vcd.txt

The thing to notice in the waveforms is that the ref version has two FSM registers, fsm and fsm0, while the "normal" one just gets 1:

normal-waveform ref-waveform

I dug into the way these two are compiled (going pass-by-pass through the compilation pipeline). Critically, it looks like static promotion failed for the ref version. Its control after promotion is:

seq {
  static<2> seq  {
    read0;
    upd0;
  }
  write;
}

Whereas the "normal" version yields:

static<3> seq  {
  read0;
  upd0;
  write0;
}

(To reproduce this, try:

cargo run -- -p infer-data-path -p compile-invoke -p static-inference -p static-promotion examples/tutorial/language-tutorial-compute.futil

and add -p external-to-ref at the beginning to see the difference.)

In the end, the problem is that the program's write group is not inferred static when the underlying memory is a ref. That group looks like this, FWIW:

group write {
  mem.addr0 = 1'b0;
  mem.write_en = 1'b1;
  mem.write_data = val.out;
  write[done] = mem.done;
}

(The read group is always inferred static, probably because it's reading combinationally from a comb_mem_d1.)

Anyway, it seems pretty clear what's going on: top-level ref memories, unlike top-level @external memories, are interfering with static inference. That seems like the bug to fix here.

ayakayorihiro commented 2 months ago

Wow, thanks a lot for looking into this @sampsyo ! It's good to know that the problem was from a bug rather than something fundamental about externalizing memories. If we can get this fixed then there seems to be no big obstacles around always externalizing memories via ref?

sampsyo commented 2 months ago

Indeed, that would be awesome! Then we would have only one testbench where the readmemh/writememhs live, and we could remove a lot of intelligence from the compiler itself that currently deals with emitting this…