Closed zzy666666zzy closed 1 year ago
Digging into the example a bit, here are some answers:
ref
keyword to make this easier to implement and optimize but the CIRCT compiler currently does not use it--nested
flag from the final command. This will generate wires that track the FSM signals better.I haven't looked into your testbench yet but let me know if these answers help? If your TB simulates, can you send me the VCD file so I can take a closer look?
Thanks a lot for the reply.
va_for_loop.mlir.txt Probably my MLIR file is incorrect. As I would like to realize 4-element vector-add, but IO memories only contains one element, e.g., %arg0: memref<1xi32, 1>, %arg1: memref<1xi32, 1>, %arg2: memref<1xi32, 1>.
I am trying to do behavior simulation of the generated Verilog in Vivadova_for_loop_calyx_naive.sv.txt. It is little vague which port is input and output. But nevertheless I found 'ext_mem1_write_data' and 'ext_mem2_write_data' don't have drivers and deliver 'xxxx_xxxx'. So I think these two ports should be input, and 'ext_mem0_write_data' is the output port.
I input one value each clock cycle, meaning that I input four values '0,1,2,3' to 'ext_mem1_read_data' and '0,2,4,6' to 'ext_mem2_read_data' in four clock cycles. But I couldn't get correct results from 'ext_mem0_write_data'. So I just wonder if I didn't stimulate the input improperly. As I am using Vivado, so this is the waveform file I got. In case of you cannot open it, I also attached the screenshot here. va_1x4_tb_behav.zip
Looking forwards to your reply.
Ah, thanks for the simulation trace. Couple more pointers for you:
memref<4xi32>
, it should create an array with 4 elements.std_mem_d1
. You can look at the Verilog for that module to get a sense of how the ports should be used. read_data
is the "output" port of a memory and combinationally updates when you provide a value to addr0
. The write_data
port writes a value into the address specified by addr0
only when write_en
is also asserted.The reason you might be having a hard time simulating this design appropriately is because there are no memories in the program–your test bench needs to be one that models the behavior of the memories and maintain the interface with specified by std_mem_d1
.
If I may, I really recommend going through the language tutorial to get a sense of how we simulate programs with Calyx's default testbench. In the meantime, I'm going to go and attempt to write a testbench for you to and see if I can make your design work.
Okay, I've managed to create a test bench that works and demonstrates the correct result. To run it, I used Icarus verilog:
iverilog -g2012 va-tb.sv && vvp ./a.out
I think you might be able to use the Vivado simulator with it too. The big things to note:
arg0
is the output memory and is connected to the ext_mem0_*
ports. Its read interface (read_data
) is unconnected and produces 'x
because the memory is only used as an output memoryarg1
and arg2
are the input memories and their write interface (write_en
, write_data
) is unconnected and produces 'x
.std_mem_d1
primitives in the testbench and leaving it to the DUT to control it through its ports.Take a look and see if you can make it work and feel free to ask questions!
Many thanks for the help.
The Verilog code seems not the same with my uploaded MLIR. Can I ask what is your lowering pipelines get the Verilog you sent just now?
Oh I see, hardware primitives are gathered in one .sv file. I will try the testbench.
Thanks a lot!
Hi there,
Thanks for the previous explanation. Vector-add examples work correctly now and move to matrix-multiplication.
The generated Verilog seems cannot recognize each row of the output matrix. For example we are trying 4x4,
x
the correct result should be:
`
But the code give us the result:
0,0,0,0
0,4,12,24
24,32,48,72
72,84,108,144
`
Elements of the output matrix keep accumulating until the last one. I think the reason is that fsm are not mapped correctly from MLIR to Verilog. I added a Verilog snipped in the module, and revised the RAM in core.sv and solved this issue for any size of matrix multiplications.
The added Verilog in the module (4x4 gemm for example):
always_ff @(posedge clk or posedge reset)begin
if (reset)
we_cntr<=32'b0;
else if (we_cntr<'d4 && ext_mem0_write_en) //length of each MAC
we_cntr<=we_cntr+1'b1;
else if (we_cntr=='d4) //length of each MAC
we_cntr<='b0;
else
we_cntr<=we_cntr;
end
Revised BRAM snippet:
assign read_data = we_cntr=='b0 ? 'b0 : mem[addr0];
So that we can make the first element to zero before the multiplication.
Although the manual modification can make it work, it would hinder the entire automation pipelines. I just wonder is there any better method to address this issue if it is because of Calyx? I attached verilog files and MLIR there.
The second question is that how can I wrap ext_mem0read and ext_mem0write with AXI interface using the Calyx tool? Do I need to change the IO to @external in calyx dialect?
The questions might seems little annoying, appreciate you taking time to read it.
Many thanks!
Very interesting! It would be good to narrow down whether this bug is on the CIRCT side (translating to Calyx) or on the Calyx side (compiling Calyx IL to Verilog). I see the zip archive there has the original (affine
dialect) MLIR cod and the Verilog—any chance you could generate the Calyx IL code that results from the CIRCT-side compilation so we can look at that?
If you're interested, it's possible to execute Calyx code directly with our debugger, which would help distinguish whether the Calyx code is wrong or whether it's being compiled wrong.
The second question is that how can I wrap ext_mem0read and ext_mem0write with AXI interface using the Calyx tool? Do I need to change the IO to @external in calyx dialect?
Wrapping memories in an AXI interface in Calyx is currently under the umbrella of our support for the Xilinx toolchain. You can enable that backend by passing -b xilinx
to the compiler, like this as an example:
cargo run -- -b xilinx examples/futil/dot-product.futil
That produces Verilog with AXI wrappers for the @external
memories.
Hi professor, Calyx_IL.zip
Thanks a lot for your fast reply.
Attached please see our Calyx IL and other CIRCT dialect files (memref.scf).
I will study the debugger to investigate where the bug occurs.
That produces Verilog with AXI wrappers for the @external memories.
Our Calyx IL does not have @external
memories that is the point I am curious about. Without '@external', the error was launched:
Error: Program has no memories marked with attribute @external. Please make sure that at least one memory is marked as @external.
when running cargo run -- -b
Many thanks.
Our Calyx IL does not have
@external
memories that is the point I am curious about. Without '@external', the error was launched:
Ah, I see now! I didn't previously quite grok that the Calyx program didn't have any memories at all; they have already been "externalized." This is a little hard to explain, but bear with me: if you look at the (native) Calyx code, you'll see that the top-level component (mlir_funcSYCL_class_mxm_kernel
) has a bunch of input ports like ext_mem0_read_data
, ext_mem0_done
: several ports for each memory. The idea is that the module is assuming someone else has instantiated these memories and is handing over all their ports to access them—for example, that's exactly what @rachitnigam's testbench above provided.
In the native Calyx ecosystem, we achieve the same effect by declaring @external
memories. These "look like" real memories but actually get turned into the same thing as above during compilation: there is no real memory, and instead someone "external" provides the interface ports to the memories. It looks like this:
https://github.com/cucapra/calyx/blob/d03d22ffb3f5b13d3091e2d3e82d8fb3d0963f54/examples/tutorial/language-tutorial-iterate.futil#L5
It looks like the SCFToCalyx
pass is responsible for creating the former "add a bunch of ports" style of code, as described by the MemoryPortsImpl
variant of MemoryInterface
:
https://github.com/llvm/circt/blob/7bbf7adc882686ea337d4ebb371cf204f6b04d73/include/circt/Dialect/Calyx/CalyxLoweringUtils.h#L87-L93
It seems to me that we might want to change this pass to use the @external
memory style instead. It would ultimately compile to the same thing, and it would let us do stuff in Calyx that is currently impossible: most relevantly here, running the debugger and using the Xilinx backend to generate AXI wrappers for the memories.
Hey @zzy666666zzy, I'm working on a pass to automate the transformation of ports into @external
cells which will allow you to use the Cider and fud
tools in this repo: https://github.com/cucapra/calyx/pull/1488
I'll ping the thread once its done with instructions for you to try it out. Hopefully, this will also make it easier to debug the bug in the program above which likely caused by CIRCT lowering.
Okay, the pass I linked above is working! Once it is merged, you can run the following command to extract cells marked with @external
from your program:
cargo run -- gemm4x4_calyx.txt -p validate -p discover-external -p validate -m file -x discover-external:default=16
The cargo run
command rebuilds the compiler from source in the repository. If you're using the crate, you'll have to update it first and then run the compiler:
cargo install calyx
calyx gemm4x4_calyx.txt -p validate -p discover-external -p validate -m file -x discover-external:default=16
(This should update to version 0.2.1
of the compiler)
You'll notice that your get the following warnings:
[WARN calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem0_ in std_mem_d1, defaulting to 4
[WARN calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem2_ in std_mem_d1, defaulting to 4
[WARN calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem1_ in std_mem_d1, defaulting to 4
This is because the pass has no way of inferring the size of the memory for you. We provide the -x discover-external:default=16
to say that when a parameter to a cell (in this case, the size of the memory) cannot be inferred, please use the value 16
which works for your example since your memory has 4x4=16 elements.
If you look at the output of the pass, you can see the we've removed a lot of the ports and added a bunch of new cells marked with @external
:
cells {
...
@external @generated ext_mem1_ = std_mem_d1(32, 4, 4);
@external @generated ext_mem0_ = std_mem_d1(32, 4, 4);
@external @generated ext_mem2_ = std_mem_d1(32, 4, 4);
}
These are exactly the cells we need to make the program work with the AXI generator and the debugger! Next, we need to specify some data using Calyx testbench's data format which is just a JSON file:
{
"ext_mem0_":{
"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
"format":{"numeric_type":"bitnum","is_signed":false,"width":32}},
"ext_mem1_":{
"data":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],
"format":{"numeric_type":"bitnum","is_signed":false,"width":32}},
"ext_mem2_":{"data":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
"format":{"numeric_type":"bitnum","is_signed":false,"width":32}}
}
Then, I ran the following fud
command to simulate the program through icarus-verilog:
fud e -s verilog.data data.json --to dat --through icarus-verilog out.futil
Sadly, fud
will complain that the main
module is missing. This is because the Calyx testbench hardcodes main
as the name of the DUT. To fix this, I renamed the top module from mlir_funcSYCL_class_mxm_kernel
to main
and this allows fud to generate the output:
{"cycles":1940,"memories":{"ext_mem0_":[0,0,0,0,0,4,12,24,24,32,48,72,72,84,108,144],"ext_mem1_":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],"ext_mem2_":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]}}
This reproduces the erroneous output you are getting. I'm going wrap this comment up at this point but will report back once I find the issue! Hopefully this is helpful.
Please ask questions if you run into any trouble understanding this!
Okay, I've figured out the bug! The problem is this group:
group assign_while_0_init_1 {
while_0_arg1_reg.in = load_0_reg.out; // incorrect init
while_0_arg1_reg.write_en = 1'd1;
assign_while_0_init_1[done] = while_0_arg1_reg.done;
}
Note that it resets the values of the while_0_arg1_reg
to the value of load_0_reg
. This is incorrect and if we instead make it set the value to 0, the program generates the expected out:
group assign_while_0_init_1 {
while_0_arg1_reg.in = 32'd0;
while_0_arg1_reg.write_en = 1'd1;
assign_while_0_init_1[done] = while_0_arg1_reg.done;
}
Running fud, we get the correct answer:
{"cycles":2068,"memories":{"ext_mem0_":[0,0,0,0,0,4,8,12,0,8,16,24,0,12,24,36],"ext_mem1_":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],"ext_mem2_":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]}}
We'll still have to figure out where the CIRCT compilation introduces this bug...
Can we double check the input MLIR? This is what I have from the thread:
// Is this IR correct?
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
%0 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
%1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %0) -> (i32) {
%2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
%3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
%4 = arith.muli %2, %3 : i32
%5 = arith.addi %4, %arg6 : i32
affine.store %5, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
affine.yield %5 : i32
}
}
}
return
}
This is two attempts at rewriting:
// The above didn't make sense to me. This is how I rewrote the gemv kernel
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
// Start the inner product at constant 0
%c0_i32 = arith.constant 0 : i32
%1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %c0_i32) -> (i32) {
%2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
%3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
%4 = arith.muli %2, %3 : i32
%5 = arith.addi %4, %arg6 : i32
affine.yield %5 : i32
}
// Assign out[%arg3][%arg4] with inner product result
affine.store %1, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
}
}
return
}
// Can use repeated loads and stores instead of carrying an iterArg
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
// Start the inner product at constant 0
%c0_i32 = arith.constant 0 : i32
affine.store %c0_i32, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
affine.for %arg5 = 0 to 4 {
%2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
%3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
%4 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
%5 = arith.muli %2, %3 : i32
%6 = arith.addi %5, %4 : i32
affine.store %6, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
}
}
}
return
}
We can use the linalg
dialect and -convert-linalg-to-affine-loops
to make it less confusing to check too:
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<4x4xi32, 1>, %arg1: memref<4x4xi32, 1>, %arg2: memref<4x4xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
affine.for %arg5 = 0 to 4 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<4x4xi32, 1>
%1 = affine.load %arg1[%arg5, %arg4] : memref<4x4xi32, 1>
%2 = affine.load %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
affine.store %4, %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
}
}
}
return
}
Oh hey, thats a good catch @matth2k! @zzy666666zzy, it seems the input MLIR code is wrong because it does not correctly load the accumulation variable causing the downstream problem.
I've also written up some documentation for two new passes that should make it easier to interact with the Calyx compiler when using CIRCT generated code! https://docs.calyxir.org/fud/circt.html#using-native-tools-with-mlir-generated-calyx
Hi @matth2k,
Thanks for the corrected MLIR. I have tried the provided three MLIR, only
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
// Start the inner product at constant 0
%c0_i32 = arith.constant 0 : i32
%1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %c0_i32) -> (i32) {
%2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
%3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
%4 = arith.muli %2, %3 : i32
%5 = arith.addi %4, %arg6 : i32
affine.yield %5 : i32
}
// Assign out[%arg3][%arg4] with inner product result
affine.store %1, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
}
}
return
}
works for my current testbench.
This version
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
// Start the inner product at constant 0
%c0_i32 = arith.constant 0 : i32
affine.store %c0_i32, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
affine.for %arg5 = 0 to 4 {
%2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
%3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
%4 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
%5 = arith.muli %2, %3 : i32
%6 = arith.addi %5, %4 : i32
affine.store %6, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
}
}
}
return
}
writes back resultant BRAM with all zeros.
And this version:
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<4x4xi32, 1>, %arg1: memref<4x4xi32, 1>, %arg2: memref<4x4xi32, 1>) {
affine.for %arg3 = 0 to 4 {
affine.for %arg4 = 0 to 4 {
affine.for %arg5 = 0 to 4 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<4x4xi32, 1>
%1 = affine.load %arg1[%arg5, %arg4] : memref<4x4xi32, 1>
%2 = affine.load %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
affine.store %4, %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
}
}
}
return
}
seems change the destination pointer from %arg0
to %arg2
, so input and output ports in testbench should be changed accordingly. I have managed to use the working version to generate AXI wrapped Top-module. Thanks a lot.
Thanks @rachitnigam for the complemented document, can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?
(Just wanted to say AMAZING WORK to @rachitnigam for improvising those extra passes to help debug this! Really really cool that it was possible to "recover" @external
memories from what CIRCT is currently producing.)
can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?
Good question. Looking at the generated Verilog, is there anything in there that prevents this use case? That is, our current AXI-wrapping pass is indeed meant to be used from XRT, but it "mostly" just consists of AXI interfaces to the relevant memories (and control signals). It seems like this could be a good starting point for integrating with something else instead, such as a soft ARM core, if it's willing to speak AXI. But are there particular things we're generating there that prevent that for your situation?
Thanks @rachitnigam for the complemented document, can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?
I think it should be possible. As @sampsyo said, the code is designed to generate generic AXI controller but uses XRT to generate and interface with the host code. Can you take a shot at using it and seeing what kinds of errors show up? We can help you once we have a sense of what kinds of problems you're running into.
Thanks for your reply @sampsyo, @rachitnigam ,
I have generated main.sv
, toplevel.sv
, and kernel.xml
files from my gemm_calyx file with @external
ports.
I am trying to generate .xo
file using
vivado -mode batch -source my_path/gen_xo.tcl -tclargs xclbin/kernel.xo
from the three files above. However, the last command line package_xo -xo_path ${xoname} -kernel_name Toplevel -ip_directory ${path_to_packaged} -kernel_xml $path_to_kernel_xml/kernel.xml
in the gen_xo.tcl
file launched this error:
ERROR: [Common 17-132] XML syntax error: Required attribute(s) missing (my_path/kernel.xml:2)
I also tried 'example/dot-product.futil', but got the same error. Probably it is because my conversion ways?
calyx -l ./ -b verilog --nested --disable-verify
calyx -l ./ -b xilinx --synthesis --disable-verify --nested
calyx -l ./ -b xilinx-xml --synthesis --disable-verify --nested
Attached please see my kernel.xml
and gemm_calyx_external
file.
gemm4x4_git_external.txt
kernel.xml.txt
Appreciate if there is any suggestion. Many thanks!
Hey @zzy666666zzy, we've gotten a little busy with things on our end for the moment. Can you dig into the error message a bit more and see what is causing it? We've documented the AXI Generation process so maybe that can be of help. We'll try to get back to you in two weeks or so.
Hi there! Debugging Xilinx toolchain problems can be a huge pain; sorry you've run into trouble here. Here are a couple of thoughts about this:
gen_xo.tcl
script the same as ours? I'm guessing yes, but just wanted to confirm.vivado
manually—I imagine you got this command from a fud
execution, right? Just wanted to make sure the same thing happens when you do, like, fud --to xclbin
as when running the command manually.kernel.xml
in a somewhat different format. We should conform!kernel.xml
file onto different lines. You'll notice that we generate it with no line breaks; I've pasted a version of your file below that I prettified with xmllint
. Vivado will give a more specific line number on this file!Looking at the XML file directly, it seems like we're using a different XML format from what the Xilinx user guide recommends. We generate, for example:
<port>
<name>s_axi_control</name>
<mode>slave</mode>
<range>0x1000</range>
<dataWidth>32</dataWidth>
<portType>addressable</portType>
<base>0x0</base>
</port>
Whereas the Xilinx manual recommends:
<port name="s_axi_control" mode="slave" range="0x1000" dataWidth="32" portType="addressable" base="0x0"/>
That is, it uses XML attributes instead of nested tags. It is possible that Vivado got more strict about this format between 2019 and 2023. We should definitely switch to the recommended format…
I dug into this a little more and it turns out that our kernel.xml
generation had actually just broken in the last couple of months! #1529 should fix this.
Hi there! Debugging Xilinx toolchain problems can be a huge pain; sorry you've run into trouble here. Here are a couple of thoughts about this:
* Just to check: is your `gen_xo.tcl` script the same as [ours](https://github.com/cucapra/calyx/blob/master/fud/bitstream/gen_xo.tcl)? I'm guessing yes, but just wanted to confirm. * It looks like you're running `vivado` manually—I imagine you got this command from a `fud` execution, right? Just wanted to make sure the same thing happens when you do, like, `fud --to xclbin` as when running the command manually. * My best guess is that this has to do with the version of Vivado/Vitis you have. We have been testing against 2019.1. It seems plausible to me that more recent versions of Vivado are expecting a `kernel.xml` in a somewhat different format. We should conform! * A good way to get a more specific error out of Vivado might be to split up the `kernel.xml` file onto different lines. You'll notice that we generate it with no line breaks; I've pasted a version of your file below that I prettified with `xmllint`. Vivado will give a more specific line number on this file!
prettified kernel.xml
Hi @sampsyo, thanks a lot for your reply.
fud --to xclbin
to generate .xo, but got following error:
$ fud e examples/futil/dot-product.futil -o foo.xclbin --to xclbin
Traceback (most recent call last):
File "/home/zzy666/.local/bin/fud", line 8, in <module>
sys.exit(main())
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/main.py", line 342, in main
exec.run_fud(args, cfg)
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/exec.py", line 105, in run_fud
step()
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/__init__.py", line 51, in __call__
self.output.data = self.func(*self.args)
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/xilinx/xclbin.py", line 82, in package_xo
port_names = list(get_ports(Path(tmpdir) / "kernel.xml"))
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/xilinx/xclbin.py", line 24, in get_ports
if port.attrib["mode"] == "master":
KeyError: 'mode'
Seems also because of kernel.xml
<args>
<arg name="m_ptr" addressQualifier="1" id="0" port="M_AXI_GMEM" size="0x8" offset="0x10" hostOffset="0x0" hostSize="0x8" type="void*"/>
<arg name="m_ptr1" addressQualifier="1" id="1" port="M_AXI_GMEM" size="0x8" offset="0x1C" hostOffset="0x0" hostSize="0x8" type="void*"/>
<arg name="m_ptr2" addressQualifier="1" id="2" port="M_AXI_GMEM" size="0x8" offset="0x28" hostOffset="0x0" hostSize="0x8" type="void*"/>
</args>
It looks different to calyx generated xml:
<port>
<name>s_axi_control</name>
<mode>slave</mode>
<range>0x1000</range>
<dataWidth>32</dataWidth>
<portType>addressable</portType>
<base>0x0</base>
</port>
xmllint
is very useful. I just open the .xml file in my browser and it looks prettier.I am investigating the source code as well, but it takes time to figure our the structure.
Thanks.
Yep, I think that confirms it! The problem I investigated above in https://github.com/cucapra/calyx/pull/1529 appears to be at fault. I'll merge that now; Calyx should then generate the right kernel.xml
for you.
Yep, I think that confirms it! The problem I investigated above in #1529 appears to be at fault. I'll merge that now; Calyx should then generate the right
kernel.xml
for you.
Thanks for fixing that issue. Now I have correct xml fed into vivado and got .xo file.
Before I input .xo into v++, I made my .xpfm file because I am using a different board and set the frequency to 150 or 100 MHz for the IP core. When I run
$Vpp_PATH/v++ -O3 -t hw --platform=$xpfm_PATH --save-temps --profile.data all:all:all --profile.exec all:all:all -lo $SPTH/kernel.xclbin $SPTH/kernel.xo
I got this errors in v++ link
about clock frequency mismatch.
Because the clock is from PS sides, so it cannot output exact 100MHz, so output 99.99MHz.
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP0_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m0_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /System_DPA/dpa_mon1/MON_M_AXI(99990000) and /Toplevel_1/m0_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP1_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m1_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /System_DPA/dpa_mon2/MON_M_AXI(99990000) and /Toplevel_1/m1_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP2_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m2_axi(100000000)
When I open the generated Vivado project from v++, .sv files from primitives were not imported either.
I tried to use fun e --to
to try the example dot_multiply.futil
, and got this error:
[fud] ERROR: `.futil' corresponds to multiple stages: ['futil', 'calyx']. Please provide an explicit stage using --to or --from.
I am bit confused as I used --to
, but still no luck.
Regarding the primitives missing files, I just changed the path in the .tcl and got them imported.
Other bug is that in the Toplevel
module, the port of instantiated hardware is different:
The ports' name of the my hardware module is:
input logic [31:0] ext_mem0_read_data,
input logic ext_mem0_done,
input logic [31:0] ext_mem1_read_data,
input logic ext_mem1_done,
input logic [31:0] ext_mem2_read_data,
input logic ext_mem2_done,
input logic clk,
input logic reset,
input logic go,
output logic [31:0] ext_mem0_write_data,
output logic [3:0] ext_mem0_addr0,
output logic ext_mem0_write_en,
output logic [31:0] ext_mem1_write_data,
output logic [3:0] ext_mem1_addr0,
output logic ext_mem1_write_en,
output logic [31:0] ext_mem2_write_data,
output logic [3:0] ext_mem2_addr0,
output logic ext_mem2_write_en,
output logic done
In the top level, the instantiated ports' name is:
.clk(ap_clk),
.done(kernel_done),
.ext_mem0__addr0(ext_mem0__addr0),
.ext_mem0__clk(),
.ext_mem0__done(ext_mem0__done),
.ext_mem0__read_data(ext_mem0__read_data),
.ext_mem0__write_data(ext_mem0__write_data),
.ext_mem0__write_en(ext_mem0__write_en),
.ext_mem1__addr0(ext_mem1__addr0),
.ext_mem1__clk(),
.ext_mem1__done(ext_mem1__done),
.ext_mem1__read_data(ext_mem1__read_data),
.ext_mem1__write_data(ext_mem1__write_data),
.ext_mem1__write_en(ext_mem1__write_en),
.ext_mem2__addr0(ext_mem2__addr0),
.ext_mem2__clk(),
.ext_mem2__done(ext_mem2__done),
.ext_mem2__read_data(ext_mem2__read_data),
.ext_mem2__write_data(ext_mem2__write_data),
.ext_mem2__write_en(ext_mem2__write_en),
.go(kernel_start),
.reset(reset || memories_sent)
There are two underscores __
. So the Toplevel
cannot be synthesized as well.
@zzy666666zzy I think the double underscore are coming from the discover-external
pass. If you look at the output of:
calyx -p discover-external <file>
You'll notice that the inferred memories have an _
at the end which is causing this problem. This is because when discovering the memory names, we look at port prefixes and extract the one that ends up producing all the ports on the memory primitive.
One quick solution to this is adding a strip-suffix
flag to the discover externals pass and have it remove the extra _
.
@zzy666666zzy I've added this feature in a pull request: https://github.com/cucapra/calyx/pull/1534. Once it's merged, you can rebuild the compiler and add the flag to remove the extra _
from the suffix of the inferred memories.
Hi @rachitnigam ,
Many thanks for the answers so far. We are very close to our objective.
We are trying to avoid using XRT because we would like to target a standalone embedded platform without an operating system. So we need to configure the AXI-wrapped IP core on the ARM core side (Vitis or Vivado SDK) manually to send data to the IP core and get the result. Our case still sticks to GEMM, where we instantiate two input matrices in ARM and get the multiplication result from the IP core. Now we have the block design:
with mapped address:
So I am trying to send some data to m1_axi
, m2_axi
, I found the address offset in the Toplevel AXI verilog code. For example, offset of m1_axi
is 32
in decimal and 20
in hex:
Accordingly, in SDK main function I use two pointers:
volatile int *slave_addr_m1 = (int *) 0x43C00020;
volatile int *slave_addr_m2 = (int *) 0x43C00028;
pointing to m1_axi
, m2_axi
.
Control_axi inst_control_axi (
....
.ext_mem0_(ext_mem0_),//BASE_ADDRESS of Master0
.ext_mem1_(ext_mem1_),//BASE_ADDRESS of Master1
.ext_mem2_(ext_mem2_),//BASE_ADDRESS of Master2
...
);
Now, we use ILA and waveforms show that we can configure s_control_axi
, now. But we don't know how to send our matrices to m1_axi
and m2_axi
from the ARM core?
We don't have idea what the offsets of m0_axi_RDATA
and m0_axi_ARADDR
are, and couldn't configure the registers accordingly.
I appreciate it if there is any solution for my issue.
Many thanks!
Cool; awesome that you're making progress!!
I don't know too much about constructing an AXI interface from the ARM core side, but I have two tips that might be helpful:
kernel.xml
file we were discussing above is a good "source of truth" for AXI memory address offsets. In fact, the reason it exists is to communicate these details to XRT—but you can use them too when constructing your AXI interface manually. For example, you might see a port declaration that looks like <arg name="m_ptr" addressQualifier="1" id="0" port="M_AXI_GMEM" size="0x8" offset="0x10" hostOffset="0x0" hostSize="0x8" type="void*"/>
. That means the "argument" (blob of bytes that is an input or output) named m_ptr
goes on the AXI port named M_AXI_GMEM
, which is declared elsewhere, and that its address range begins at 0x10
. Does reading kernel.xml
in this way help?Hi @rachitnigam @sampsyo ,
Thanks for the suggestions, I have probably managed to configure the address and offsets of the IP core.
I am trying to verify a simple 4x4 GEMM. When I use ILA to observe the data transferred from ARM to the IP, I found two issues: 1. I initialize integer elements in matrix A as
0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3
and matrix B as
0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3
and transfer it to the IP, Let's see the transferred data of the 2nd row in matrix A. All 32-bit data was duplicated up to 512-bit in each handshake:
Also, in the 1st row in matrix B, duplicated data contains two elements, but I think each handshake should just transfer one element: .
This data transfer seems not ideal. Maybe I should modify my host code on ARM?
2. I try to catch the ap_done
signal, indicating the computing is finished, but the host program hangs in if((*slaveaddr_ap_done)==1)
in my code. I think it is because of SEND_TO_HOST_DONE
and related signals in AXI Toplevel.sv.
I attached my host code, and address info from Vivado address editor here for a better understanding of my confusion. helloworld.txt
I really appreciate it if there would be some advice. Thanks a lot!
Hi @sampsyo @rachitnigam ,
Many thanks for all the suggestions from behavior simulation to on-board debugging.
I have finally managed to run a matrix multiplication example using bare-metal ARM code without XRT and .xlbin. I think it will somehow benefit the low-power embedded system community.
I think we can close this issue nicely and I will keep you updated when I successfully integrate Calyx into my front-end infrastructure. Hopefully, we could have more in-depth discussions soon!
Thanks a lot.
That's incredible! Thank you for pushing the Calyx design through. Please feel free to open more issues if you need any help from us
Indeed; awesome! We would love to hear how this project proceeds if you ever have any updates to share. :smiley:
Hi there,
Thanks for taking time reading my issue.
I am trying to use CIRCT and Calyx to lower my MLIR which is a simple four-element vector addition.
I used the following lowering pipeline to get Verilog:
./mlir-opt --lower-affine ./circt-opt --lower-scf-to-calyx --canonicalize --verify-diagnostics ./circt-translate --export-calyx futil -l ./ -b verilog --synthesis --disable-init --nested
After getting Verilog, I exported the Verilog code into Vivado to do the simulation. I noticed a couple of issues impeding my simulation:
I attached my MLIR, generated Verilog, and my test bench here. Hope they are easy to understand. I am looking forwards to your reply! va_for_loop.mlir.txt va_for_loop_calyx_naive.sv.txt va_1x4_tb.sv.txt
Many thanks.