calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
465 stars 47 forks source link

AXI returns unexpected 0s #1576

Open zzy666666zzy opened 1 year ago

zzy666666zzy commented 1 year ago

Hi,

Following the previous issue #1470, we have managed to run a 4x4 GEMM on a Standalone (bared-metal) platform.

When we tried bigger matrix sizes (8x8, 16x16), I printed the output matrix from the ARM sides, the IP core only gave us 16 correct elements (2 rows of correct elements for 8x8, 1 row of correct elements for 16x16). The rest elements are all zeros: 1687208165571 The correct output should be 1687208241004

When I delve into the generated Toplevel AXI verilog code, I found some clues.

In our case, 'm1' outputs the resultant matrix. We go to the module Memory_controller_axi_1, and look at this snippet

reg [6:0] send_addr_offset;
...
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset * 32;

bram_read_data is the result from the core GEMM logic (ext_mem0__write_data) and always produces correct results.

BUT, when we try bigger matrix sizes and send_addr_offset equals is bigger than 16, WDATA is always zero because it left shifts (>16)x32 bit which is over 512-b. Previously, we could get the correct result from 4x4 because we luckily saturated the 512-b with 16 resultant elements.

I am sure the core logic without the AXI wrapper can output correct results, we have tested 4x4 to 64x64. We have changed discover-external:default=xxx according to our matrix sizes. From the generated AXI verilog code we can tell that discover-external:default=xxx affects send_addr_offset and thereby affects the number of shifts. I attached the MLIR and the AXI code here for convenient reference. gemm8x8.mlir.txt gemm8x8_wrapper.sv.txt

Also the doubled under-score issue in main kernel_inst still exists, signal inconsistency leads to synthesis failure.

Appreciate it if there is any solution for returning 0s and the doubled underscore issue.

Many thanks.

zzy666666zzy commented 1 year ago

This issue can be addressed by adding the following Verilog to the destination master port in AXI top-level file

reg [3:0] shift_var;
always @(posedge ACLK) begin
if(memory_mode_state == 3) begin
    if(BVALID & BREADY) shift_var <= shift_var + 'd1;
    //else if (BVALID & BREADY & shift_var=='d15) shift_var<='d0;
    else shift_var <= shift_var;
end 
else shift_var <= 0;
end

And replace send_addr_offset with shift_var in

    assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset* 32;
    assign WSTRB = {{15{4'h0}}, 4'hF} << send_addr_offset* 4;

Might this bit modification can be realized in Calyx source code.

rachitnigam commented 1 year ago

Reopening this! @zzy666666zzy we were all traveling last week so can look into this a bit more now that we’re back.

zzy666666zzy commented 1 year ago

Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'extmem2'* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

zzy666666zzy commented 1 year ago

Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'extmem2'* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

I updated the calyx repo, the shuffled sequence of ext_mem* seems being solved.