Simulation Verilog - Githubissues

zzy666666zzy commented 1 year ago

Hi there,

Thanks for taking time reading my issue.

I am trying to use CIRCT and Calyx to lower my MLIR which is a simple four-element vector addition.

I used the following lowering pipeline to get Verilog:

./mlir-opt --lower-affine ./circt-opt --lower-scf-to-calyx --canonicalize --verify-diagnostics ./circt-translate --export-calyx futil -l ./ -b verilog --synthesis --disable-init --nested

After getting Verilog, I exported the Verilog code into Vivado to do the simulation. I noticed a couple of issues impeding my simulation:

There are three input and three output ports, which is supposed to be two input and one output according to my MLIR. But then I figured out only two input ports read data from outside and only one output has the driver, equivalent to two input and one output, making sense to me at this stage.
The address signal of the output port is a binary signal without bit-width, meaning the output port cannot write data to external memory. So I couldn't get the correct results from my test bench.
There are redundant finite state machine signals and assign statements, is it possible to make the Verilog more compact and readable?

I attached my MLIR, generated Verilog, and my test bench here. Hope they are easy to understand. I am looking forwards to your reply! va_for_loop.mlir.txt va_for_loop_calyx_naive.sv.txt va_1x4_tb.sv.txt

Many thanks.

rachitnigam commented 1 year ago

Digging into the example a bit, here are some answers:

When you define a memory as an argument (input or output), the CIRCT Calyx compiler will implement it by taking all the ports of a std_mem_d1 primitive and putting them into port arguments of the containing component. This is what you're seeing in the example: your three memories get turned into lots of ports which correspond to the memories. Calyx supports the ref keyword to make this easier to implement and optimize but the CIRCT compiler currently does not use it
The output signal is a binary because your memory is a 1-element memory which means a binary signal is sufficient to address it. If you change your example to have larger memories, Calyx with generate the appropriate signals.
Yes, remove the --nested flag from the final command. This will generate wires that track the FSM signals better.

I haven't looked into your testbench yet but let me know if these answers help? If your TB simulates, can you send me the VCD file so I can take a closer look?

zzy666666zzy commented 1 year ago

Thanks a lot for the reply.

va_for_loop.mlir.txt Probably my MLIR file is incorrect. As I would like to realize 4-element vector-add, but IO memories only contains one element, e.g., %arg0: memref<1xi32, 1>, %arg1: memref<1xi32, 1>, %arg2: memref<1xi32, 1>.

I am trying to do behavior simulation of the generated Verilog in Vivadova_for_loop_calyx_naive.sv.txt. It is little vague which port is input and output. But nevertheless I found 'ext_mem1_write_data' and 'ext_mem2_write_data' don't have drivers and deliver 'xxxx_xxxx'. So I think these two ports should be input, and 'ext_mem0_write_data' is the output port.

I input one value each clock cycle, meaning that I input four values '0,1,2,3' to 'ext_mem1_read_data' and '0,2,4,6' to 'ext_mem2_read_data' in four clock cycles. But I couldn't get correct results from 'ext_mem0_write_data'. So I just wonder if I didn't stimulate the input improperly. As I am using Vivado, so this is the waveform file I got. In case of you cannot open it, I also attached the screenshot here. va_1x4_tb_behav.zip

Looking forwards to your reply.

rachitnigam commented 1 year ago

Ah, thanks for the simulation trace. Couple more pointers for you:

Yup! That's exactly why you have only one element in your memory. If you change it to memref<4xi32>, it should create an array with 4 elements.
The primitives are implement using the ports in std_mem_d1. You can look at the Verilog for that module to get a sense of how the ports should be used.
read_data is the "output" port of a memory and combinationally updates when you provide a value to addr0. The write_data port writes a value into the address specified by addr0 only when write_en is also asserted.

The reason you might be having a hard time simulating this design appropriately is because there are no memories in the program–your test bench needs to be one that models the behavior of the memories and maintain the interface with specified by std_mem_d1.

If I may, I really recommend going through the language tutorial to get a sense of how we simulate programs with Calyx's default testbench. In the meantime, I'm going to go and attempt to write a testbench for you to and see if I can make your design work.

rachitnigam commented 1 year ago

Okay, I've managed to create a test bench that works and demonstrates the correct result. To run it, I used Icarus verilog:

iverilog -g2012 va-tb.sv && vvp ./a.out

I think you might be able to use the Vivado simulator with it too. The big things to note:

arg0 is the output memory and is connected to the ext_mem0_* ports. Its read interface (read_data) is unconnected and produces 'x because the memory is only used as an output memory
The other two memories, arg1 and arg2 are the input memories and their write interface (write_en, write_data) is unconnected and produces 'x.
The trick to making the design work was instantiating the std_mem_d1 primitives in the testbench and leaving it to the DUT to control it through its ports.

Take a look and see if you can make it work and feel free to ask questions!

vadd.zip

zzy666666zzy commented 1 year ago

Many thanks for the help.

The Verilog code seems not the same with my uploaded MLIR. Can I ask what is your lowering pipelines get the Verilog you sent just now?

zzy666666zzy commented 1 year ago

Oh I see, hardware primitives are gathered in one .sv file. I will try the testbench.

Thanks a lot!

zzy666666zzy commented 1 year ago

Hi there,

Thanks for the previous explanation. Vector-add examples work correctly now and move to matrix-multiplication.

The generated Verilog seems cannot recognize each row of the output matrix. For example we are trying 4x4,

x the correct result should be: ` But the code give us the result: 0,0,0,0
0,4,12,24
24,32,48,72
72,84,108,144

`

Elements of the output matrix keep accumulating until the last one. I think the reason is that fsm are not mapped correctly from MLIR to Verilog. I added a Verilog snipped in the module, and revised the RAM in core.sv and solved this issue for any size of matrix multiplications.

The added Verilog in the module (4x4 gemm for example):

always_ff @(posedge clk or posedge reset)begin
if (reset)
we_cntr<=32'b0;
else if (we_cntr<'d4 && ext_mem0_write_en) //length of each MAC we_cntr<=we_cntr+1'b1; else if (we_cntr=='d4) //length of each MAC we_cntr<='b0;
else we_cntr<=we_cntr; end

Revised BRAM snippet: assign read_data = we_cntr=='b0 ? 'b0 : mem[addr0]; So that we can make the first element to zero before the multiplication.

Although the manual modification can make it work, it would hinder the entire automation pipelines. I just wonder is there any better method to address this issue if it is because of Calyx? I attached verilog files and MLIR there.

The second question is that how can I wrap ext_mem0read and ext_mem0write with AXI interface using the Calyx tool? Do I need to change the IO to @external in calyx dialect?

The questions might seems little annoying, appreciate you taking time to read it.

Many thanks!

gemm4x4.zip

sampsyo commented 1 year ago

Very interesting! It would be good to narrow down whether this bug is on the CIRCT side (translating to Calyx) or on the Calyx side (compiling Calyx IL to Verilog). I see the zip archive there has the original (affine dialect) MLIR cod and the Verilog—any chance you could generate the Calyx IL code that results from the CIRCT-side compilation so we can look at that?

If you're interested, it's possible to execute Calyx code directly with our debugger, which would help distinguish whether the Calyx code is wrong or whether it's being compiled wrong.

The second question is that how can I wrap ext_mem0read and ext_mem0write with AXI interface using the Calyx tool? Do I need to change the IO to @external in calyx dialect?

Wrapping memories in an AXI interface in Calyx is currently under the umbrella of our support for the Xilinx toolchain. You can enable that backend by passing -b xilinx to the compiler, like this as an example:

cargo run -- -b xilinx examples/futil/dot-product.futil

That produces Verilog with AXI wrappers for the @external memories.

zzy666666zzy commented 1 year ago

Hi professor, Calyx_IL.zip

Thanks a lot for your fast reply.

Attached please see our Calyx IL and other CIRCT dialect files (memref.scf).

I will study the debugger to investigate where the bug occurs.

That produces Verilog with AXI wrappers for the @external memories.

Our Calyx IL does not have @external memories that is the point I am curious about. Without '@external', the error was launched: Error: Program has no memories marked with attribute @external. Please make sure that at least one memory is marked as @external. when running cargo run -- -b

Many thanks.

sampsyo commented 1 year ago

Our Calyx IL does not have @external memories that is the point I am curious about. Without '@external', the error was launched:

Ah, I see now! I didn't previously quite grok that the Calyx program didn't have any memories at all; they have already been "externalized." This is a little hard to explain, but bear with me: if you look at the (native) Calyx code, you'll see that the top-level component (mlir_funcSYCL_class_mxm_kernel) has a bunch of input ports like ext_mem0_read_data, ext_mem0_done: several ports for each memory. The idea is that the module is assuming someone else has instantiated these memories and is handing over all their ports to access them—for example, that's exactly what @rachitnigam's testbench above provided.

In the native Calyx ecosystem, we achieve the same effect by declaring @external memories. These "look like" real memories but actually get turned into the same thing as above during compilation: there is no real memory, and instead someone "external" provides the interface ports to the memories. It looks like this: https://github.com/cucapra/calyx/blob/d03d22ffb3f5b13d3091e2d3e82d8fb3d0963f54/examples/tutorial/language-tutorial-iterate.futil#L5

It looks like the SCFToCalyx pass is responsible for creating the former "add a bunch of ports" style of code, as described by the MemoryPortsImpl variant of MemoryInterface: https://github.com/llvm/circt/blob/7bbf7adc882686ea337d4ebb371cf204f6b04d73/include/circt/Dialect/Calyx/CalyxLoweringUtils.h#L87-L93

It seems to me that we might want to change this pass to use the @external memory style instead. It would ultimately compile to the same thing, and it would let us do stuff in Calyx that is currently impossible: most relevantly here, running the debugger and using the Xilinx backend to generate AXI wrappers for the memories.

rachitnigam commented 1 year ago

Hey @zzy666666zzy, I'm working on a pass to automate the transformation of ports into @external cells which will allow you to use the Cider and fud tools in this repo: https://github.com/cucapra/calyx/pull/1488

I'll ping the thread once its done with instructions for you to try it out. Hopefully, this will also make it easier to debug the bug in the program above which likely caused by CIRCT lowering.

rachitnigam commented 1 year ago

Okay, the pass I linked above is working! Once it is merged, you can run the following command to extract cells marked with @external from your program:

cargo run -- gemm4x4_calyx.txt  -p validate -p discover-external -p validate -m file -x discover-external:default=16

The cargo run command rebuilds the compiler from source in the repository. If you're using the crate, you'll have to update it first and then run the compiler:

cargo install calyx
calyx gemm4x4_calyx.txt  -p validate -p discover-external -p validate -m file -x discover-external:default=16

(This should update to version 0.2.1 of the compiler)

You'll notice that your get the following warnings:

[WARN  calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem0_ in std_mem_d1, defaulting to 4
[WARN  calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem2_ in std_mem_d1, defaulting to 4
[WARN  calyx_opt::passes::discover_external] Unable to infer parameter value for ext_mem1_ in std_mem_d1, defaulting to 4

This is because the pass has no way of inferring the size of the memory for you. We provide the -x discover-external:default=16 to say that when a parameter to a cell (in this case, the size of the memory) cannot be inferred, please use the value 16 which works for your example since your memory has 4x4=16 elements.

If you look at the output of the pass, you can see the we've removed a lot of the ports and added a bunch of new cells marked with @external:

cells {
    ...
    @external @generated ext_mem1_ = std_mem_d1(32, 4, 4);
    @external @generated ext_mem0_ = std_mem_d1(32, 4, 4);
    @external @generated ext_mem2_ = std_mem_d1(32, 4, 4);
}

These are exactly the cells we need to make the program work with the AXI generator and the debugger! Next, we need to specify some data using Calyx testbench's data format which is just a JSON file:

{
"ext_mem0_":{
  "data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
  "format":{"numeric_type":"bitnum","is_signed":false,"width":32}},
"ext_mem1_":{
  "data":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],
  "format":{"numeric_type":"bitnum","is_signed":false,"width":32}},
"ext_mem2_":{"data":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
  "format":{"numeric_type":"bitnum","is_signed":false,"width":32}}
}

Then, I ran the following fud command to simulate the program through icarus-verilog:

fud e -s verilog.data data.json --to dat --through icarus-verilog out.futil

Sadly, fud will complain that the main module is missing. This is because the Calyx testbench hardcodes main as the name of the DUT. To fix this, I renamed the top module from mlir_funcSYCL_class_mxm_kernel to main and this allows fud to generate the output:

{"cycles":1940,"memories":{"ext_mem0_":[0,0,0,0,0,4,12,24,24,32,48,72,72,84,108,144],"ext_mem1_":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],"ext_mem2_":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]}}

This reproduces the erroneous output you are getting. I'm going wrap this comment up at this point but will report back once I find the issue! Hopefully this is helpful.

Please ask questions if you run into any trouble understanding this!

rachitnigam commented 1 year ago

Okay, I've figured out the bug! The problem is this group:

    group assign_while_0_init_1 {
      while_0_arg1_reg.in = load_0_reg.out; // incorrect init
      while_0_arg1_reg.write_en = 1'd1;
      assign_while_0_init_1[done] = while_0_arg1_reg.done;
    }

Note that it resets the values of the while_0_arg1_reg to the value of load_0_reg. This is incorrect and if we instead make it set the value to 0, the program generates the expected out:

    group assign_while_0_init_1 {
      while_0_arg1_reg.in = 32'd0;
      while_0_arg1_reg.write_en = 1'd1;
      assign_while_0_init_1[done] = while_0_arg1_reg.done;
    }

Running fud, we get the correct answer:

{"cycles":2068,"memories":{"ext_mem0_":[0,0,0,0,0,4,8,12,0,8,16,24,0,12,24,36],"ext_mem1_":[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3],"ext_mem2_":[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]}}

We'll still have to figure out where the CIRCT compilation introduces this bug...

matth2k commented 1 year ago

Can we double check the input MLIR? This is what I have from the thread:

// Is this IR correct?
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
  affine.for %arg3 = 0 to 4 {
    affine.for %arg4 = 0 to 4 {
      %0 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
      %1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %0) -> (i32) {
        %2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
        %3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %4, %arg6 : i32
        affine.store %5, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
        affine.yield %5 : i32
      }
    }
  }
  return
}

This is two attempts at rewriting:

// The above didn't make sense to me. This is how I rewrote the gemv kernel
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
  affine.for %arg3 = 0 to 4 {
    affine.for %arg4 = 0 to 4 {
      // Start the inner product at constant 0
      %c0_i32 = arith.constant 0 : i32
      %1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %c0_i32) -> (i32) {
        %2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
        %3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %4, %arg6 : i32
        affine.yield %5 : i32
      }
      // Assign out[%arg3][%arg4] with inner product result
      affine.store %1, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
    }
  }
  return
}

// Can use repeated loads and stores instead of carrying an iterArg
func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
  affine.for %arg3 = 0 to 4 {
    affine.for %arg4 = 0 to 4 {
      // Start the inner product at constant 0
      %c0_i32 = arith.constant 0 : i32
      affine.store %c0_i32, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
      affine.for %arg5 = 0 to 4 {
        %2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
        %3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
        %4 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
        %5 = arith.muli %2, %3 : i32
        %6 = arith.addi %5, %4 : i32
        affine.store %6, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
      }
    }
  }
  return
}

We can use the linalg dialect and -convert-linalg-to-affine-loops to make it less confusing to check too:

func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<4x4xi32, 1>, %arg1: memref<4x4xi32, 1>, %arg2: memref<4x4xi32, 1>) {
    affine.for %arg3 = 0 to 4 {
      affine.for %arg4 = 0 to 4 {
        affine.for %arg5 = 0 to 4 {
          %0 = affine.load %arg0[%arg3, %arg5] : memref<4x4xi32, 1>
          %1 = affine.load %arg1[%arg5, %arg4] : memref<4x4xi32, 1>
          %2 = affine.load %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
          %3 = arith.muli %0, %1 : i32
          %4 = arith.addi %2, %3 : i32
          affine.store %4, %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
        }
      }
    }
    return
  }

rachitnigam commented 1 year ago

Oh hey, thats a good catch @matth2k! @zzy666666zzy, it seems the input MLIR code is wrong because it does not correctly load the accumulation variable causing the downstream problem.

rachitnigam commented 1 year ago

I've also written up some documentation for two new passes that should make it easier to interact with the Calyx compiler when using CIRCT generated code! https://docs.calyxir.org/fud/circt.html#using-native-tools-with-mlir-generated-calyx

zzy666666zzy commented 1 year ago

Hi @matth2k,

Thanks for the corrected MLIR. I have tried the provided three MLIR, only

func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
  affine.for %arg3 = 0 to 4 {
    affine.for %arg4 = 0 to 4 {
      // Start the inner product at constant 0
      %c0_i32 = arith.constant 0 : i32
      %1 = affine.for %arg5 = 0 to 4 iter_args(%arg6 = %c0_i32) -> (i32) {
        %2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
        %3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %4, %arg6 : i32
        affine.yield %5 : i32
      }
      // Assign out[%arg3][%arg4] with inner product result
      affine.store %1, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
    }
  }
  return
}

works for my current testbench.

This version

func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<16xi32, 1>, %arg1: memref<16xi32, 1>, %arg2: memref<16xi32, 1>) {
  affine.for %arg3 = 0 to 4 {
    affine.for %arg4 = 0 to 4 {
      // Start the inner product at constant 0
      %c0_i32 = arith.constant 0 : i32
      affine.store %c0_i32, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
      affine.for %arg5 = 0 to 4 {
        %2 = affine.load %arg2[%arg5 * 4 + %arg4] : memref<16xi32, 1>
        %3 = affine.load %arg1[%arg5 + %arg3 * 4] : memref<16xi32, 1>
        %4 = affine.load %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
        %5 = arith.muli %2, %3 : i32
        %6 = arith.addi %5, %4 : i32
        affine.store %6, %arg0[%arg4 + %arg3 * 4] : memref<16xi32, 1>
      }
    }
  }
  return
}

writes back resultant BRAM with all zeros.

And this version:

func.func @mlir_funcSYCL_class_mxm_kernel(%arg0: memref<4x4xi32, 1>, %arg1: memref<4x4xi32, 1>, %arg2: memref<4x4xi32, 1>) {
    affine.for %arg3 = 0 to 4 {
      affine.for %arg4 = 0 to 4 {
        affine.for %arg5 = 0 to 4 {
          %0 = affine.load %arg0[%arg3, %arg5] : memref<4x4xi32, 1>
          %1 = affine.load %arg1[%arg5, %arg4] : memref<4x4xi32, 1>
          %2 = affine.load %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
          %3 = arith.muli %0, %1 : i32
          %4 = arith.addi %2, %3 : i32
          affine.store %4, %arg2[%arg3, %arg4] : memref<4x4xi32, 1>
        }
      }
    }
    return
  }

seems change the destination pointer from %arg0 to %arg2, so input and output ports in testbench should be changed accordingly. I have managed to use the working version to generate AXI wrapped Top-module. Thanks a lot.

Thanks @rachitnigam for the complemented document, can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?

sampsyo commented 1 year ago

(Just wanted to say AMAZING WORK to @rachitnigam for improvising those extra passes to help debug this! Really really cool that it was possible to "recover" @external memories from what CIRCT is currently producing.)

can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?

Good question. Looking at the generated Verilog, is there anything in there that prevents this use case? That is, our current AXI-wrapping pass is indeed meant to be used from XRT, but it "mostly" just consists of AXI interfaces to the relevant memories (and control signals). It seems like this could be a good starting point for integrating with something else instead, such as a soft ARM core, if it's willing to speak AXI. But are there particular things we're generating there that prevent that for your situation?

rachitnigam commented 1 year ago

Thanks @rachitnigam for the complemented document, can I ask if it is possible to avoid using XRT, but to use standalone (baremetal) and a .tcl script to generate a block design and integrate it with ARM CPU?

I think it should be possible. As @sampsyo said, the code is designed to generate generic AXI controller but uses XRT to generate and interface with the host code. Can you take a shot at using it and seeing what kinds of errors show up? We can help you once we have a sense of what kinds of problems you're running into.

zzy666666zzy commented 1 year ago

Thanks for your reply @sampsyo, @rachitnigam ,

I have generated main.sv, toplevel.sv, and kernel.xml files from my gemm_calyx file with @external ports. I am trying to generate .xo file using vivado -mode batch -source my_path/gen_xo.tcl -tclargs xclbin/kernel.xo from the three files above. However, the last command line package_xo -xo_path ${xoname} -kernel_name Toplevel -ip_directory ${path_to_packaged} -kernel_xml $path_to_kernel_xml/kernel.xml in the gen_xo.tcl file launched this error:

ERROR: [Common 17-132] XML syntax error: Required attribute(s) missing (my_path/kernel.xml:2)

I also tried 'example/dot-product.futil', but got the same error. Probably it is because my conversion ways? calyx -l ./ -b verilog --nested --disable-verify calyx -l ./ -b xilinx --synthesis --disable-verify --nested calyx -l ./ -b xilinx-xml --synthesis --disable-verify --nested

Attached please see my kernel.xml and gemm_calyx_external file. gemm4x4_git_external.txt kernel.xml.txt

Appreciate if there is any suggestion. Many thanks!

rachitnigam commented 1 year ago

Hey @zzy666666zzy, we've gotten a little busy with things on our end for the moment. Can you dig into the error message a bit more and see what is causing it? We've documented the AXI Generation process so maybe that can be of help. We'll try to get back to you in two weeks or so.

sampsyo commented 1 year ago

Hi there! Debugging Xilinx toolchain problems can be a huge pain; sorry you've run into trouble here. Here are a couple of thoughts about this:

Just to check: is your gen_xo.tcl script the same as ours? I'm guessing yes, but just wanted to confirm.
It looks like you're running vivado manually—I imagine you got this command from a fud execution, right? Just wanted to make sure the same thing happens when you do, like, fud --to xclbin as when running the command manually.
My best guess is that this has to do with the version of Vivado/Vitis you have. We have been testing against 2019.1. It seems plausible to me that more recent versions of Vivado are expecting a kernel.xml in a somewhat different format. We should conform!
A good way to get a more specific error out of Vivado might be to split up the kernel.xml file onto different lines. You'll notice that we generate it with no line breaks; I've pasted a version of your file below that I prettified with xmllint. Vivado will give a more specific line number on this file!

prettified kernel.xml

```xml 1 6 Toplevel ip_c capra.cs.cornell.edu:kernel:Toplevel:1.0 0 1 false ap_ctrl_hs s_axi_control slave 0x1000 32 addressable 0x0 m0_axi master 0xFFFFFFFFFFFFFFFF 512 addressable 0x0 m1_axi master 0xFFFFFFFFFFFFFFFF 512 addressable 0x0 m2_axi master 0xFFFFFFFFFFFFFFFF 512 addressable 0x0 timeout 0 0 s_axi_control 0x4 0x010 uint 0x0 0x4 ext_mem0_ 1 1 m0_axi 0x8 0x18 int* 0x0 0x8 ext_mem2_ 1 2 m1_axi 0x8 0x20 int* 0x0 0x8 ext_mem1_ 1 3 m2_axi 0x8 0x28 int* 0x0 0x8 ```

sampsyo commented 1 year ago

Looking at the XML file directly, it seems like we're using a different XML format from what the Xilinx user guide recommends. We generate, for example:

      <port>
        <name>s_axi_control</name>
        <mode>slave</mode>
        <range>0x1000</range>
        <dataWidth>32</dataWidth>
        <portType>addressable</portType>
        <base>0x0</base>
      </port>

Whereas the Xilinx manual recommends:

 <port name="s_axi_control" mode="slave" range="0x1000" dataWidth="32" portType="addressable" base="0x0"/>

That is, it uses XML attributes instead of nested tags. It is possible that Vivado got more strict about this format between 2019 and 2023. We should definitely switch to the recommended format…

sampsyo commented 1 year ago

I dug into this a little more and it turns out that our kernel.xml generation had actually just broken in the last couple of months! #1529 should fix this.

zzy666666zzy commented 1 year ago

Hi there! Debugging Xilinx toolchain problems can be a huge pain; sorry you've run into trouble here. Here are a couple of thoughts about this:

* Just to check: is your `gen_xo.tcl` script the same as [ours](https://github.com/cucapra/calyx/blob/master/fud/bitstream/gen_xo.tcl)? I'm guessing yes, but just wanted to confirm.

* It looks like you're running `vivado` manually—I imagine you got this command from a `fud` execution, right? Just wanted to make sure the same thing happens when you do, like, `fud --to xclbin` as when running the command manually.

* My best guess is that this has to do with the version of Vivado/Vitis you have. We have been testing against 2019.1. It seems plausible to me that more recent versions of Vivado are expecting a `kernel.xml` in a somewhat different format. We should conform!

* A good way to get a more specific error out of Vivado might be to split up the `kernel.xml` file onto different lines. You'll notice that we generate it with no line breaks; I've pasted a version of your file below that I prettified with `xmllint`. Vivado will give a more specific line number on this file!

prettified kernel.xml

Hi @sampsyo, thanks a lot for your reply.

Yes, I used the provided .tcl file just changed the path including our Verilog file.

Yes, I used step-by-step calyx command lines. And also used fud --to xclbin to generate .xo, but got following error:

$ fud e examples/futil/dot-product.futil -o foo.xclbin --to xclbin
Traceback (most recent call last):
File "/home/zzy666/.local/bin/fud", line 8, in <module>
sys.exit(main())
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/main.py", line 342, in main
exec.run_fud(args, cfg)
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/exec.py", line 105, in run_fud
step()
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/__init__.py", line 51, in __call__
self.output.data = self.func(*self.args)
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/xilinx/xclbin.py", line 82, in package_xo
port_names = list(get_ports(Path(tmpdir) / "kernel.xml"))
File "/home/zzy666/.local/lib/python3.9/site-packages/fud/stages/xilinx/xclbin.py", line 24, in get_ports
if port.attrib["mode"] == "master":
KeyError: 'mode'

Seems also because of kernel.xml

I agree, I checked my another xml file example

<args>
<arg name="m_ptr" addressQualifier="1" id="0" port="M_AXI_GMEM" size="0x8" offset="0x10" hostOffset="0x0" hostSize="0x8" type="void*"/>
<arg name="m_ptr1" addressQualifier="1" id="1" port="M_AXI_GMEM" size="0x8" offset="0x1C" hostOffset="0x0" hostSize="0x8" type="void*"/>
<arg name="m_ptr2" addressQualifier="1" id="2" port="M_AXI_GMEM" size="0x8" offset="0x28" hostOffset="0x0" hostSize="0x8" type="void*"/>
</args>

It looks different to calyx generated xml:

  <port>
    <name>s_axi_control</name>
    <mode>slave</mode>
    <range>0x1000</range>
    <dataWidth>32</dataWidth>
    <portType>addressable</portType>
    <base>0x0</base>
  </port>

Thanks, xmllint is very useful. I just open the .xml file in my browser and it looks prettier.

I am investigating the source code as well, but it takes time to figure our the structure.

Thanks.

sampsyo commented 1 year ago

Yep, I think that confirms it! The problem I investigated above in https://github.com/cucapra/calyx/pull/1529 appears to be at fault. I'll merge that now; Calyx should then generate the right kernel.xml for you.

zzy666666zzy commented 1 year ago

Yep, I think that confirms it! The problem I investigated above in #1529 appears to be at fault. I'll merge that now; Calyx should then generate the right kernel.xml for you.

Thanks for fixing that issue. Now I have correct xml fed into vivado and got .xo file.

Before I input .xo into v++, I made my .xpfm file because I am using a different board and set the frequency to 150 or 100 MHz for the IP core. When I run

$Vpp_PATH/v++ -O3 -t hw --platform=$xpfm_PATH --save-temps --profile.data all:all:all --profile.exec all:all:all -lo $SPTH/kernel.xclbin $SPTH/kernel.xo

I got this errors in v++ link about clock frequency mismatch. Because the clock is from PS sides, so it cannot output exact 100MHz, so output 99.99MHz.

ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP0_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m0_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /System_DPA/dpa_mon1/MON_M_AXI(99990000) and /Toplevel_1/m0_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP1_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m1_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /System_DPA/dpa_mon2/MON_M_AXI(99990000) and /Toplevel_1/m1_axi(100000000)
ERROR: [VPL 41-237] Bus Interface property FREQ_HZ does not match between /axi_ic_zynq_ultra_ps_e_0_S_AXI_HP2_FPD/s00_couplers/auto_ds/S_AXI(99990000) and /Toplevel_1/m2_axi(100000000)

When I open the generated Vivado project from v++, .sv files from primitives were not imported either.

I tried to use fun e --to to try the example dot_multiply.futil, and got this error:

[fud] ERROR: `.futil' corresponds to multiple stages: ['futil', 'calyx']. Please provide an explicit stage using --to or --from.

I am bit confused as I used --to, but still no luck.

zzy666666zzy commented 1 year ago

Regarding the primitives missing files, I just changed the path in the .tcl and got them imported.

zzy666666zzy commented 1 year ago

Other bug is that in the Toplevel module, the port of instantiated hardware is different: The ports' name of the my hardware module is:

  input logic [31:0] ext_mem0_read_data,
  input logic ext_mem0_done,
  input logic [31:0] ext_mem1_read_data,
  input logic ext_mem1_done,
  input logic [31:0] ext_mem2_read_data,
  input logic ext_mem2_done,
  input logic clk,
  input logic reset,
  input logic go,
  output logic [31:0] ext_mem0_write_data,
  output logic [3:0] ext_mem0_addr0,
  output logic ext_mem0_write_en,
  output logic [31:0] ext_mem1_write_data,
  output logic [3:0] ext_mem1_addr0,
  output logic ext_mem1_write_en,
  output logic [31:0] ext_mem2_write_data,
  output logic [3:0] ext_mem2_addr0,
  output logic ext_mem2_write_en,
  output logic done

In the top level, the instantiated ports' name is:

        .clk(ap_clk),
        .done(kernel_done),
        .ext_mem0__addr0(ext_mem0__addr0),
        .ext_mem0__clk(),
        .ext_mem0__done(ext_mem0__done),
        .ext_mem0__read_data(ext_mem0__read_data),
        .ext_mem0__write_data(ext_mem0__write_data),
        .ext_mem0__write_en(ext_mem0__write_en),
        .ext_mem1__addr0(ext_mem1__addr0),
        .ext_mem1__clk(),
        .ext_mem1__done(ext_mem1__done),
        .ext_mem1__read_data(ext_mem1__read_data),
        .ext_mem1__write_data(ext_mem1__write_data),
        .ext_mem1__write_en(ext_mem1__write_en),
        .ext_mem2__addr0(ext_mem2__addr0),
        .ext_mem2__clk(),
        .ext_mem2__done(ext_mem2__done),
        .ext_mem2__read_data(ext_mem2__read_data),
        .ext_mem2__write_data(ext_mem2__write_data),
        .ext_mem2__write_en(ext_mem2__write_en),
        .go(kernel_start),
        .reset(reset || memories_sent)

There are two underscores __. So the Toplevel cannot be synthesized as well.

rachitnigam commented 1 year ago

@zzy666666zzy I think the double underscore are coming from the discover-external pass. If you look at the output of:

calyx -p discover-external <file>

You'll notice that the inferred memories have an _ at the end which is causing this problem. This is because when discovering the memory names, we look at port prefixes and extract the one that ends up producing all the ports on the memory primitive.

One quick solution to this is adding a strip-suffix flag to the discover externals pass and have it remove the extra _.

rachitnigam commented 1 year ago

@zzy666666zzy I've added this feature in a pull request: https://github.com/cucapra/calyx/pull/1534. Once it's merged, you can rebuild the compiler and add the flag to remove the extra _ from the suffix of the inferred memories.

zzy666666zzy commented 1 year ago

Hi @rachitnigam ,

Many thanks for the answers so far. We are very close to our objective.

We are trying to avoid using XRT because we would like to target a standalone embedded platform without an operating system. So we need to configure the AXI-wrapped IP core on the ARM core side (Vitis or Vivado SDK) manually to send data to the IP core and get the result. Our case still sticks to GEMM, where we instantiate two input matrices in ARM and get the multiplication result from the IP core. Now we have the block design: 1686176409917 with mapped address: 1686176432487 So I am trying to send some data to m1_axi, m2_axi, I found the address offset in the Toplevel AXI verilog code. For example, offset of m1_axi is 32 in decimal and 20 in hex: 1686176784780 Accordingly, in SDK main function I use two pointers:

volatile int *slave_addr_m1 = (int *) 0x43C00020;
volatile int *slave_addr_m2 = (int *) 0x43C00028;

pointing to m1_axi, m2_axi.

    Control_axi inst_control_axi (
        ....
        .ext_mem0_(ext_mem0_),//BASE_ADDRESS of Master0
        .ext_mem1_(ext_mem1_),//BASE_ADDRESS of Master1
        .ext_mem2_(ext_mem2_),//BASE_ADDRESS of Master2
         ...
    );

Now, we use ILA and waveforms show that we can configure s_control_axi, now. But we don't know how to send our matrices to m1_axi and m2_axi from the ARM core? We don't have idea what the offsets of m0_axi_RDATA and m0_axi_ARADDR are, and couldn't configure the registers accordingly.

I appreciate it if there is any solution for my issue.

Many thanks!

sampsyo commented 1 year ago

Cool; awesome that you're making progress!!

I don't know too much about constructing an AXI interface from the ARM core side, but I have two tips that might be helpful:

The kernel.xml file we were discussing above is a good "source of truth" for AXI memory address offsets. In fact, the reason it exists is to communicate these details to XRT—but you can use them too when constructing your AXI interface manually. For example, you might see a port declaration that looks like <arg name="m_ptr" addressQualifier="1" id="0" port="M_AXI_GMEM" size="0x8" offset="0x10" hostOffset="0x0" hostSize="0x8" type="void*"/>. That means the "argument" (blob of bytes that is an input or output) named m_ptr goes on the AXI port named M_AXI_GMEM, which is declared elsewhere, and that its address range begins at 0x10. Does reading kernel.xml in this way help?
For an example of how to interface directly with the AXI wrapper, you might consider checking out our cocotb-based AXI test harness. Of course, that relies on cool cocotb libraries to implement the AXI communication, but seeing how the interfaces are constructed and used may still be instructive.

zzy666666zzy commented 1 year ago

Hi @rachitnigam @sampsyo ,

Thanks for the suggestions, I have probably managed to configure the address and offsets of the IP core.

I am trying to verify a simple 4x4 GEMM. When I use ILA to observe the data transferred from ARM to the IP, I found two issues: 1. I initialize integer elements in matrix A as

0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3

and matrix B as

0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3

and transfer it to the IP, Let's see the transferred data of the 2nd row in matrix A. All 32-bit data was duplicated up to 512-bit in each handshake: M1 RDATA signal M1 RDATA1 signal M1 RDATA2 signal

Also, in the 1st row in matrix B, duplicated data contains two elements, but I think each handshake should just transfer one element: M2 RDATA signal .

This data transfer seems not ideal. Maybe I should modify my host code on ARM?

2. I try to catch the ap_done signal, indicating the computing is finished, but the host program hangs in if((*slaveaddr_ap_done)==1) in my code. I think it is because of SEND_TO_HOST_DONE and related signals in AXI Toplevel.sv.

I attached my host code, and address info from Vivado address editor here for a better understanding of my confusion. helloworld.txt 1686588160573

I really appreciate it if there would be some advice. Thanks a lot!

zzy666666zzy commented 1 year ago

Hi @sampsyo @rachitnigam ,

Many thanks for all the suggestions from behavior simulation to on-board debugging.

I have finally managed to run a matrix multiplication example using bare-metal ARM code without XRT and .xlbin. I think it will somehow benefit the low-power embedded system community.

I think we can close this issue nicely and I will keep you updated when I successfully integrate Calyx into my front-end infrastructure. Hopefully, we could have more in-depth discussions soon!

Thanks a lot.

rachitnigam commented 1 year ago

That's incredible! Thank you for pushing the Calyx design through. Please feel free to open more issues if you need any help from us

sampsyo commented 1 year ago

Indeed; awesome! We would love to hear how this project proceeds if you ever have any updates to share. :smiley:

calyxir / calyx

Simulation Verilog #1470