calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

Create an AXI-interface generator implemented Calyx #1733

Open nathanielnrn opened 8 months ago

nathanielnrn commented 8 months ago

This issue is intended to track progress on Phase 2 of Calyx Meets the Real World. This writeup gives great overarching context and what we are working towards.

Currently, we can run a limited number of programs on real FPGAS using fud. We accomplish this by generating Verilog AXI wrappers.

Unfortunately, the current state of the AXI wrappers is less than ideal. Lots of the generation code is hardcoded, and in general Verilog is not a fun language to work with. To that end, we are trying to build a generator that will take in a .yxi file and output an AXI interface -- in calyx. The hope is that by using calyx-py we will be able to avoid some of the issues we've faced in the past (see #1071) and more easily create a more generalizable wrapper.

For reference, a dot-product.yxi (meaning the yxi-backend output of a dot-product.futil program) looks like this:

{
  "toplevel": "main",
  "memories": [
    {
      "name": "A0",
      "width": 32,
      "size": 8
    },
    {
      "name": "B0",
      "width": 32,
      "size": 8
    },
    {
      "name": "v0",
      "width": 32,
      "size": 1
    }
  ]
}

The current plan is to have a separate AXI controller for each memory, similar to the current Verilog implementation.

Currently, both @evanmwilliams and I are working on getting acquainted with calyx-py. After that it probably makes sense to get together and formalize some next incremental steps, as a full AXI interface seems a bit daunting to tackle all in one go.

At that point we can list and track completion of subtasks here!

Update Nov 20 2023: Both me and @evanmwilliams have familiarized ourselves with calyx-py a bit. Work has also gone into manually creating a version of a Calyx axi-wrapper. Based on in person discussions it seems like next step is to create a testbench that ensures the correctness of said axi-wrapper with cocotb, similar to what we've done in the past. Goal is to strat with just the read portion of an axi-wrapper. The code we are trying to target lives in the branch axi-calyx-gen

Update Jan 2024: I've broken up work into a bunch of smaller tasks both in case we onboard someone to help work on this and also to give a clear game plan as we all get busy as the semester starts. There is a lot here but I think by chipping away at things we can make good progress.

Tasks to be completed, in order:

Some offshoot ideas that have sprung up:

sampsyo commented 8 months ago

Excellent! Sounds like a plan!

sampsyo commented 7 months ago

Expanding a little bit on the imaginary code I wrote above for how the AXI "wrapper" code might work, I think we should really use Calyx's ref cells to thread through the memories we want to expose.

That is, imagine that we have our main Calyx design, called main, that we intend to wrap:

component main() -> () {
  cells {
    @external input_mem = std_mem_d1(...);
    @external output_mem = std_mem_d1(...);
  }
}

We should first rewrite main to use ref cells instead of @external:

ref input_mem = std_mem_d1(...);
ref output_mem = std_mem_d1(...);

(In fact, we have elsewhere occasionally discussed getting rid of the @external attribute altogether and replacing it with ref. Since @external can only appear in top-level components anyway, ref would behave identically to @external in top-level components. But that's for another day; for now we can imagine that we have to do this preprocessing ourselves.)

Then, our job in this work is to generate a new top-level component, called axi_wrapper. It will "own" the memories, declaring them as "real" (non-ref) subcells:

component axi_wrapper(...) -> (...) {
  cells {
    the_main = main();
    main_input_mem = std_mem_d1(...);
    main_output_mem = std_mem_d1(...);
  }
}

The control for axi_wrapper can then use an invoke to run main, like this:

invoke the_main[input_mem=main_input_mem, output_mem=main_output_mem]();

Therefore, we can think of axi_wrapper's control program as embodying this rough "to-do" list:

  1. Receive input data from the host, putting them in my own main_input_mem.
  2. invoke the_main, as above. It has access to main_input_mem and main_output_mem during its execution.
  3. Send output data from my main_output_mem back to the host.
  4. Tell the host we are done!

…which can hopefully be implemented as a big seq that steps through those various phases!

(One minor note: the axi_wrapper thingy I'm envisioning here may also want to have subcells for individual, per-memory AXI controller components. Maybe? In which case we would define an axi_controller component, which would also have a ref cell for the memory it needs to interact with. And then axi_wrapper would use invoke axi_controller[mem=something](...) to tell it to send/receive data or whatever.)

rachitnigam commented 7 months ago

I like this idea! It is in the spirit of #1603. The idea is that Calyx is purely responsible for defining the computational interface of the component and something else can come in and provide the memory interface.

Spitballing a little more: one can imagine that once we #1261 and have a standard memory interface that has read and write done signal, the Calyx kernel can directly be connected to the AXI manager. Going a step further, this AXI module can instantiate things like memory coalescers, caches, reuse buffers etc. and transparently improve the performance of the module. This kind of compute-memory decoupling might also be interesting to @andrewb1999 and @matth2k.

andrewb1999 commented 7 months ago

One question I have here is how to AXI interfaces will be implemented. I know currently the AXI interfaces reads all inputs values to on-chip memory and then launches the kernel. My general suggestion is that by default external memories should be fully off-chip, aka every time we want to read an address value we need to use the AXI interface to read a value from DRAM. If we want to buffer values on chip, this should be explicitly in the Calyx somewhere (either the main module or the axi wrapper module).

rachitnigam commented 7 months ago

Yeah, seconded! The goal of this project (if I understand correctly) is to express as much of the logic needed to move data around within Calyx itself. This includes the logic needed to "externalize" memory interfaces.

sampsyo commented 7 months ago

Thanks for the feedback, both of y'all!

I know currently the AXI interfaces reads all inputs values to on-chip memory and then launches the kernel. My general suggestion is that by default external memories should be fully off-chip, aka every time we want to read an address value we need to use the AXI interface to read a value from DRAM.

Yes, it is in scope in our original proposal to go beyond the "one-sized-fits-all" data flow we have now. That is, aside from just changing the default (from buffer-everything to buffer-nothing/directly access host memory), it seems like there are many intermediate points you'd want to generate. For example, streaming data "blockwise" instead of requesting it on demand "wordwise" would be in scope, and would put things like AXI bursts behind the ref std_seq_mem abstraction layer.

So anyway, the overall trajectory here is (1) recreate exactly what we currently have (the buffer-everything-on-chip policy) in Calyx land, and then (2) use our new, awesome, flexible, hackable, debuggable AXI generator to add new features/interface styles.

rachitnigam commented 7 months ago

Fly-by comment but there is something unsaid about the expressive power of ref in all of this. It's enabling us to do some cool things so we should eventually spend some more time thinking about extensions or other use cases.

nathanielnrn commented 5 months ago

There has been substantial progress with getting the read portion of the AXI interface to work #1820. Also some updated tracking in the original comment

sampsyo commented 4 months ago

Given @nathanielnrn's awesome recent progress in #1842, I found myself mapping out a few granular steps for the medium-term future (aside from the aforementioned next step of converting this fixed-function implementation into a suitably parameterized generator). In no particular order:

And there are three "offshoot" ideas that are not that important but are kind of adjacent, to consider retuning to "someday":

nathanielnrn commented 4 months ago

As the semester is coming up I thought it seemed like a good place to stop and more concretely consider next steps and take stock of where we are with things.

Some good progress has been made w.r.t creating a parameterized version of our AXI implementation:

  1. Parameterized address channels (AR and AW) has an outstanding PR #1855.
  2. Parameterized read channels has an outstanding PR #1856

Things left to be done for the parameterized generator:

  1. Create parameterized write channels
  2. Create parameterized write-response channels. (Worth noting this one should be especially easy, as we don't currently do much with this channel)

It should be noted that all 4 of the above are blocked by #1850, which is what I will be working on most immediately.


Once the generator is done, I think it makes sense to tackle things in the following order (see comment for more detail about specific tasks):

  1. Modify the cocotb testbench to take in .yxi files. We can likely look to the verilog cocotb testbench for some inspiration in this respect.
  2. Write a fud2 harness that works end to end. We want the harness to a. Compile the calyx program normally. b. Emit the programs interface as an.yxi file . c. Generate the AXI wrapper from the yxi spec. d. Run the wrapped design using cocotb.
  3. Make the existing cocotb testbench work with runt and CI.
  4. Expand unit tests to include things like:
    • Multiple transactions
    • 0 and non-zero base addresses
    • Large (>256 transfers) data sets
  5. Work on the (hardcoded and then generator? Maybe we can skip straight to the generator) subordinate control interface in order to interface with XRT.
  6. Add a pass to the compiler that omits the go/done interface and replaces it with an ap_start/ap_done interface for toplevel components. This will likely be necessary for XRT interfacing to work. It is worth noting that there may be another option to target user-managed control instead, but it seems like this misses some of the point of creating generalizable interface for FPGAs that give us the benefits of using XRT.

The current offshoot ideas that are adjacent to this work, that we can continue returning to someday are:

The tracking for these has been updated above.

sampsyo commented 4 months ago

This all sounds great!!! Just one small note on the compiler hacking:

Add a pass to the compiler that omits the go/done interface and replaces it with an ap_start/ap_done interface for toplevel components.

The heart of the matter here may not actually be a new pass, nor even a new backend: I think all we need is a compiler option that omits the go/done signals on the top-level component. Then we can provide our own control interface in our wrapper, without worrying about anyone else mucking it up.