Open zzy666666zzy opened 1 year ago
This issue can be addressed by adding the following Verilog to the destination master port in AXI top-level file
reg [3:0] shift_var;
always @(posedge ACLK) begin
if(memory_mode_state == 3) begin
if(BVALID & BREADY) shift_var <= shift_var + 'd1;
//else if (BVALID & BREADY & shift_var=='d15) shift_var<='d0;
else shift_var <= shift_var;
end
else shift_var <= 0;
end
And replace send_addr_offset
with shift_var
in
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset* 32;
assign WSTRB = {{15{4'h0}}, 4'hF} << send_addr_offset* 4;
Might this bit modification can be realized in Calyx source code.
Reopening this! @zzy666666zzy we were all traveling last week so can look into this a bit more now that we’re back.
Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt
Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0
is the output argument (output ext_mem0
), arg1
and arg2
are input matrix (ext_mem1_*
and 'extmem2'* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size.
For example: inside Control_axi
module, of 4x4 GEMM, the order of the three ports is:
12'h18 : begin
rdata[31:0] <= addr_ext_mem0_[31:0];
end
12'h1c : begin
rdata[31:0] <= addr_ext_mem0_[63:32];
end
12'h20 : begin
rdata[31:0] <= addr_ext_mem2_[31:0];
end
12'h24 : begin
rdata[31:0] <= addr_ext_mem2_[63:32];
end
12'h28 : begin
rdata[31:0] <= addr_ext_mem1_[31:0];
end
12'h2c : begin
rdata[31:0] <= addr_ext_mem1_[63:32];
end
But for 16x16 the order is:
12'h18 : begin
rdata[31:0] <= addr_ext_mem1_[31:0];
end
12'h1c : begin
rdata[31:0] <= addr_ext_mem1_[63:32];
end
12'h20 : begin
rdata[31:0] <= addr_ext_mem0_[31:0];
end
12'h24 : begin
rdata[31:0] <= addr_ext_mem0_[63:32];
end
12'h28 : begin
rdata[31:0] <= addr_ext_mem2_[31:0];
end
12'h2c : begin
rdata[31:0] <= addr_ext_mem2_[63:32];
end
The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control
.
Is there any approach to avoid the shuffled sequence of addr_ext_mem*_
?
Thanks a lot.
Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt
Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached),
arg0
is the output argument (outputext_mem0
),arg1
andarg2
are input matrix (ext_mem1_*
and 'extmem2'* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: insideControl_axi
module, of 4x4 GEMM, the order of the three ports is:12'h18 : begin rdata[31:0] <= addr_ext_mem0_[31:0]; end 12'h1c : begin rdata[31:0] <= addr_ext_mem0_[63:32]; end 12'h20 : begin rdata[31:0] <= addr_ext_mem2_[31:0]; end 12'h24 : begin rdata[31:0] <= addr_ext_mem2_[63:32]; end 12'h28 : begin rdata[31:0] <= addr_ext_mem1_[31:0]; end 12'h2c : begin rdata[31:0] <= addr_ext_mem1_[63:32]; end
But for 16x16 the order is:
12'h18 : begin rdata[31:0] <= addr_ext_mem1_[31:0]; end 12'h1c : begin rdata[31:0] <= addr_ext_mem1_[63:32]; end 12'h20 : begin rdata[31:0] <= addr_ext_mem0_[31:0]; end 12'h24 : begin rdata[31:0] <= addr_ext_mem0_[63:32]; end 12'h28 : begin rdata[31:0] <= addr_ext_mem2_[31:0]; end 12'h2c : begin rdata[31:0] <= addr_ext_mem2_[63:32]; end
The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through
s_axi_control
.Is there any approach to avoid the shuffled sequence of
addr_ext_mem*_
?Thanks a lot.
I updated the calyx repo, the shuffled sequence of ext_mem*
seems being solved.
Hi,
Following the previous issue #1470, we have managed to run a 4x4 GEMM on a Standalone (bared-metal) platform.
When we tried bigger matrix sizes (8x8, 16x16), I printed the output matrix from the ARM sides, the IP core only gave us 16 correct elements (2 rows of correct elements for 8x8, 1 row of correct elements for 16x16). The rest elements are all zeros:
The correct output should be
![1687208241004](https://github.com/cucapra/calyx/assets/35866553/e3143469-e9b4-4033-8401-93d4750fa03b)
When I delve into the generated Toplevel AXI verilog code, I found some clues.
In our case, 'm1' outputs the resultant matrix. We go to the module
Memory_controller_axi_1
, and look at this snippetbram_read_data
is the result from the core GEMM logic (ext_mem0__write_data) and always produces correct results.BUT, when we try bigger matrix sizes and
send_addr_offset equals
is bigger than 16,WDATA
is always zero because it left shifts (>16)x32 bit which is over 512-b. Previously, we could get the correct result from 4x4 because we luckily saturated the 512-b with 16 resultant elements.I am sure the core logic without the AXI wrapper can output correct results, we have tested 4x4 to 64x64. We have changed
discover-external:default=xxx
according to our matrix sizes. From the generated AXI verilog code we can tell thatdiscover-external:default=xxx
affectssend_addr_offset
and thereby affects the number of shifts. I attached the MLIR and the AXI code here for convenient reference. gemm8x8.mlir.txt gemm8x8_wrapper.sv.txtAlso the doubled under-score issue in
main kernel_inst
still exists, signal inconsistency leads to synthesis failure.Appreciate it if there is any solution for returning 0s and the doubled underscore issue.
Many thanks.