Open nathanielnrn opened 8 months ago
Excellent! Sounds like a plan!
Expanding a little bit on the imaginary code I wrote above for how the AXI "wrapper" code might work, I think we should really use Calyx's ref
cells to thread through the memories we want to expose.
That is, imagine that we have our main Calyx design, called main
, that we intend to wrap:
component main() -> () {
cells {
@external input_mem = std_mem_d1(...);
@external output_mem = std_mem_d1(...);
}
}
We should first rewrite main
to use ref
cells instead of @external
:
ref input_mem = std_mem_d1(...);
ref output_mem = std_mem_d1(...);
(In fact, we have elsewhere occasionally discussed getting rid of the @external
attribute altogether and replacing it with ref
. Since @external
can only appear in top-level components anyway, ref
would behave identically to @external
in top-level components. But that's for another day; for now we can imagine that we have to do this preprocessing ourselves.)
Then, our job in this work is to generate a new top-level component, called axi_wrapper
. It will "own" the memories, declaring them as "real" (non-ref
) subcells:
component axi_wrapper(...) -> (...) {
cells {
the_main = main();
main_input_mem = std_mem_d1(...);
main_output_mem = std_mem_d1(...);
}
}
The control for axi_wrapper
can then use an invoke
to run main
, like this:
invoke the_main[input_mem=main_input_mem, output_mem=main_output_mem]();
Therefore, we can think of axi_wrapper
's control program as embodying this rough "to-do" list:
main_input_mem
.invoke the_main
, as above. It has access to main_input_mem
and main_output_mem
during its execution.main_output_mem
back to the host.…which can hopefully be implemented as a big seq
that steps through those various phases!
(One minor note: the axi_wrapper
thingy I'm envisioning here may also want to have subcells for individual, per-memory AXI controller components. Maybe? In which case we would define an axi_controller
component, which would also have a ref
cell for the memory it needs to interact with. And then axi_wrapper
would use invoke axi_controller[mem=something](...)
to tell it to send/receive data or whatever.)
I like this idea! It is in the spirit of #1603. The idea is that Calyx is purely responsible for defining the computational interface of the component and something else can come in and provide the memory interface.
Spitballing a little more: one can imagine that once we #1261 and have a standard memory interface that has read and write done
signal, the Calyx kernel can directly be connected to the AXI manager. Going a step further, this AXI module can instantiate things like memory coalescers, caches, reuse buffers etc. and transparently improve the performance of the module. This kind of compute-memory decoupling might also be interesting to @andrewb1999 and @matth2k.
One question I have here is how to AXI interfaces will be implemented. I know currently the AXI interfaces reads all inputs values to on-chip memory and then launches the kernel. My general suggestion is that by default external memories should be fully off-chip, aka every time we want to read an address value we need to use the AXI interface to read a value from DRAM. If we want to buffer values on chip, this should be explicitly in the Calyx somewhere (either the main module or the axi wrapper module).
Yeah, seconded! The goal of this project (if I understand correctly) is to express as much of the logic needed to move data around within Calyx itself. This includes the logic needed to "externalize" memory interfaces.
Thanks for the feedback, both of y'all!
I know currently the AXI interfaces reads all inputs values to on-chip memory and then launches the kernel. My general suggestion is that by default external memories should be fully off-chip, aka every time we want to read an address value we need to use the AXI interface to read a value from DRAM.
Yes, it is in scope in our original proposal to go beyond the "one-sized-fits-all" data flow we have now. That is, aside from just changing the default (from buffer-everything to buffer-nothing/directly access host memory), it seems like there are many intermediate points you'd want to generate. For example, streaming data "blockwise" instead of requesting it on demand "wordwise" would be in scope, and would put things like AXI bursts behind the ref std_seq_mem
abstraction layer.
So anyway, the overall trajectory here is (1) recreate exactly what we currently have (the buffer-everything-on-chip policy) in Calyx land, and then (2) use our new, awesome, flexible, hackable, debuggable AXI generator to add new features/interface styles.
Fly-by comment but there is something unsaid about the expressive power of ref
in all of this. It's enabling us to do some cool things so we should eventually spend some more time thinking about extensions or other use cases.
There has been substantial progress with getting the read portion of the AXI interface to work #1820. Also some updated tracking in the original comment
Given @nathanielnrn's awesome recent progress in #1842, I found myself mapping out a few granular steps for the medium-term future (aside from the aforementioned next step of converting this fixed-function implementation into a suitably parameterized generator). In no particular order:
fud2 something.futil -s sim.data=stuff.json --to dat --through axi-cocotb
should (1) compile the Calyx program normally, (2) emit the yxi JSON file, (3) generate the AXI wrapper from the yxi spec, and (4) run the combined design using cocotb. This would ideally allow broad differential testing with "normal" (readmemh/writememh) simulation.@toplevel
component. Something like this will be important for when we hand this stuff off to the Xilinx toolchain, which of course will not know that it needs to do this. Morally speaking, the AXI control interface takes the place of the Calyx go/done interface, so it makes sense to omit one and keep the other.And there are three "offshoot" ideas that are not that important but are kind of adjacent, to consider retuning to "someday":
int_to_bytes
and bytes_to_int
is surprisingly subtle and not actually all that AXI-specific. A standalone tool for generating the byte arrays necessary here would make this important functionality more reusable and testable.As the semester is coming up I thought it seemed like a good place to stop and more concretely consider next steps and take stock of where we are with things.
Some good progress has been made w.r.t creating a parameterized version of our AXI implementation:
AR
and AW
) has an outstanding PR #1855.Things left to be done for the parameterized generator:
It should be noted that all 4 of the above are blocked by #1850, which is what I will be working on most immediately.
Once the generator is done, I think it makes sense to tackle things in the following order (see comment for more detail about specific tasks):
.yxi
files. We can likely look to the verilog cocotb testbench for some inspiration in this respect.fud2
harness that works end to end. We want the harness to
a. Compile the calyx program normally.
b. Emit the programs interface as an.yxi
file .
c. Generate the AXI wrapper from the yxi spec.
d. Run the wrapped design using cocotb.ap_start
/ap_done
interface for toplevel
components. This will likely be necessary for XRT interfacing to work. It is worth noting that there may be another option to target user-managed control instead, but it seems like this misses some of the point of creating generalizable interface for FPGAs that give us the benefits of using XRT. The current offshoot ideas that are adjacent to this work, that we can continue returning to someday are:
*_to_byte
and byte_to_*
functions. I believe this currently exists in a number of places in the repo. I believe some of this is done in fud
currently? But I have a vague recollection of it being duplicated in some places? Perhaps the old verilog AXI cocotb testbench?IDX_SIZE
must match the expected width based on SIZE
of a memory and that multi-dimension memories be flattened to seq_mem_d1
memories.The tracking for these has been updated above.
This all sounds great!!! Just one small note on the compiler hacking:
Add a pass to the compiler that omits the go/done interface and replaces it with an
ap_start
/ap_done
interface fortoplevel
components.
The heart of the matter here may not actually be a new pass, nor even a new backend: I think all we need is a compiler option that omits the go
/done
signals on the top-level component. Then we can provide our own control interface in our wrapper, without worrying about anyone else mucking it up.
This issue is intended to track progress on Phase 2 of Calyx Meets the Real World. This writeup gives great overarching context and what we are working towards.
Currently, we can run a limited number of programs on real FPGAS using fud. We accomplish this by generating Verilog AXI wrappers.
Unfortunately, the current state of the AXI wrappers is less than ideal. Lots of the generation code is hardcoded, and in general Verilog is not a fun language to work with. To that end, we are trying to build a generator that will take in a
.yxi
file and output an AXI interface -- in calyx. The hope is that by usingcalyx-py
we will be able to avoid some of the issues we've faced in the past (see #1071) and more easily create a more generalizable wrapper.For reference, a
dot-product.yxi
(meaning the yxi-backend output of adot-product.futil
program) looks like this:The current plan is to have a separate AXI controller for each memory, similar to the current Verilog implementation.
Currently, both @evanmwilliams and I are working on getting acquainted with
calyx-py.
After that it probably makes sense to get together and formalize some next incremental steps, as a full AXI interface seems a bit daunting to tackle all in one go.At that point we can list and track completion of subtasks here!
Update Nov 20 2023: Both me and @evanmwilliams have familiarized ourselves with
calyx-py
a bit. Work has also gone into manually creating a version of a Calyx axi-wrapper. Based on in person discussions it seems like next step is to create a testbench that ensures the correctness of said axi-wrapper with cocotb, similar to what we've done in the past. Goal is to strat with just the read portion of an axi-wrapper. The code we are trying to target lives in the branchaxi-calyx-gen
Update Jan 2024: I've broken up work into a bunch of smaller tasks both in case we onboard someone to help work on this and also to give a clear game plan as we all get busy as the semester starts. There is a lot here but I think by chipping away at things we can make good progress.
Tasks to be completed, in order:
AR
andR
channels) implemented in Calyx. This a way to better understand what we hope to eventually dynamically generate.AW
,W
, andB
channels) implemented in Calyx. This will hopefully be more straightforward once the infrastructure is set up from tasks above.yxi
file. #1994yxi
spec. #1994a passan option to the relevant compiler pass that omits the go/done interface and replaces it with an ap_start/ap_done interface for toplevel components. This will likely be necessary for XRT interfacing to work. It is worth noting that there may be another option to target user-managed control instead, but it seems like this misses some of the point of creating generalizable interface for FPGAs that give us the benefits of using XRT.Some offshoot ideas that have sprung up: