SFU-HiAccel / CHIP-KNN

[TRETS'23, FPT'20] CHIP-KNN: Configurable and HIgh-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs
BSD 2-Clause "Simplified" License
13 stars 9 forks source link

Hardware emulation failed on U200 #3

Open zyt1024 opened 1 year ago

zyt1024 commented 1 year ago

Hello,I am using the U200 platform and have generated xcbin files for host programs and hardware simulations according to the process.However, the following error was reported when executing the make check. Could you tell me why? Thanks very much!!! image

KennySLiu commented 1 year ago

Hi there, Could you let me know what configuration you were using? This would be in your config.py.

zyt1024 commented 1 year ago

ok,this is my config.py

#!/usr/bin/python 
# -*- coding: utf-8 -*-
'''---------------------------------------
#       Basic User Configuration
---------------------------------------'''
# KNN Parameters
# N = 4194304
N = 524288
D = 2
Dist = 1 # 0 = Manhattan; 1 = Euclidean
K = 10
# FPGA Platform Specifications  U200平台
FPGA_part_name = 'xcu200-fsgd2104-2-e' # xcu200-fsgd2104-2-e = U200; xcu280-fsvh2892-2L-e = U280  
num_SLR = 3
SLR_resource = [{'BRAM':1390, 'DSP':2275, 'FF':746000, 'LUT':365000, 'URAM':320}, \
                {'BRAM':752,  'DSP':1317, 'FF':339000, 'LUT':162000, 'URAM':160}, \
                {'BRAM':1390, 'DSP':2275, 'FF':746000, 'LUT':365000, 'URAM':320}]
# SLR_resource_U200 = [{'BRAM':1390, 'DSP':2275, 'FF':746000, 'LUT':365000, 'URAM':320}, \
#                      {'BRAM':752,  'DSP':1317, 'FF':339000, 'LUT':162000, 'URAM':160}, \
#                      {'BRAM':1390, 'DSP':2275, 'FF':746000, 'LUT':365000, 'URAM':320}]
# SLR_resource_U280 = [{'BRAM':980,  'DSP':2733, 'FF':736000, 'LUT':360000, 'URAM':320}, \
#                      {'BRAM':980,  'DSP':2877, 'FF':710000, 'LUT':352000, 'URAM':320}, \
#                      {'BRAM':1020, 'DSP':2800, 'FF':734000, 'LUT':370000, 'URAM':320}]
memory_type = 'DDR4' # DDR4, HBM2
num_mem_banks = 4
'''---------------------------------------
#       Advanced User Configuration
---------------------------------------'''
singlePE_template_config = [{'port_width':512, 'buf_size':128*1024},\
                            {'port_width':512, 'buf_size':64*1024}] 
                            # {'port_width':512, 'buf_size':64*1024},\
                            # {'port_width':256, 'buf_size':128*1024},\
                            # {'port_width':256, 'buf_size':64*1024}]
resource_limit = 0.7
kernel_frequency = 300 #MHz
KennySLiu commented 1 year ago

Hi, sorry for the late reply. I just configured it with your build on my own version, and I found a workaround.

The workaround I found was as follows:

In host.cpp, change the code around line 217.

ORIGINAL:

    // For Allocating Buffer to specific Global Memory Bank, user has to use cl_mem_ext_ptr_t
    // and provide the Banks
    if (xcl::is_emulation()) {
        printf("Emulation Mode \n");
        for (int i = 0; i < NUM_KERNEL; i++) {
            inputSearchSpaceBufExt[i].obj = searchspace_data_part[i].data();
            inputSearchSpaceBufExt[i].param = 0;
            inputSearchSpaceBufExt[i].flags = XCL_MEM_DDR_BANK1;
        }
        outputResultDistBufExt.obj = hw_dist.data();
        outputResultDistBufExt.param = 0;
        outputResultDistBufExt.flags = XCL_MEM_DDR_BANK1;
        outputResultIdBufExt.obj = hw_id.data();
        outputResultIdBufExt.param = 0;
        outputResultIdBufExt.flags = XCL_MEM_DDR_BANK1;
    }

WORKING:

    if (xcl::is_emulation()) {
        printf("Emulation Mode \n");
        for (int i = 0; i < NUM_KERNEL; i++) {
            inputSearchSpaceBufExt[i].obj = searchspace_data_part[i].data();
            inputSearchSpaceBufExt[i].param = 0;
            inputSearchSpaceBufExt[i].flags = XCL_MEM_DDR_BANK1;
        }
        inputSearchSpaceBufExt[0].flags = XCL_MEM_DDR_BANK0;    //added this
        inputSearchSpaceBufExt[1].flags = XCL_MEM_DDR_BANK1;    //added this
        inputSearchSpaceBufExt[2].flags = XCL_MEM_DDR_BANK2;    //added this
        inputSearchSpaceBufExt[3].flags = XCL_MEM_DDR_BANK3;    //added this

        outputResultDistBufExt.obj = hw_dist.data();
        outputResultDistBufExt.param = 0;
        outputResultDistBufExt.flags = XCL_MEM_DDR_BANK1;
        outputResultIdBufExt.obj = hw_id.data();
        outputResultIdBufExt.param = 0;
        outputResultIdBufExt.flags = XCL_MEM_DDR_BANK1;
    }

I copied those few lines from the "else" statement at line 225, into the "if" statement at line 211.

zyt1024 commented 1 year ago

Yes,Thanks,Now I can run hardware simulation, after that I implemented it on hardware and ran the make build TARGET=hwcommand, when implementing, the following error occurred:

[19:14:59] Phase 5.1 Delay CleanUp
[19:15:30] Phase 5.2 Clock Skew Optimization
[19:16:33] Phase 6 Post Hold Fix
[19:16:33] Phase 6.1 Hold Fix Iter
[19:17:36] Phase 6.2 Additional Hold Fix
[19:19:10] Phase 7 Leaf Clock Prog Delay Opt
[19:21:47] Phase 8 Route finalize
[19:22:19] Phase 9 Verifying routed nets
[19:22:50] Phase 10 Depositing Routes
[19:32:50] Run vpl: Step impl: Failed
[19:32:52] Run vpl: FINISHED. Run Status: impl ERROR

===>The following messages were generated while processing /home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1 :
ERROR: [VPL 18-1000] Routing results verification failed due to partially-conflicted nets (Up to first 10 of violated nets):  level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[67] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[66] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[63] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[58] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[54] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[53] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[52] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[51] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[49] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[48] 
ERROR: [VPL 60-773] In '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/runme.log', caught Tcl error:  problem implementing dynamic region, impl_1: route_design ERROR, please look at the run log file '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log' for more information
WARNING: [VPL 60-732] Link warning: No monitor points found for BD automation.
ERROR: [VPL 60-704] Integration error, problem implementing dynamic region, impl_1: route_design ERROR, please look at the run log file '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log' for more information
ERROR: [VPL 60-1328] Vpl run 'vpl' failed
ERROR: [VPL 60-806] Failed to finish platform linker
INFO: [v++ 60-1442] [19:32:54] Run run_link: Step vpl: Failed
Time (s): cpu = 00:20:45 ; elapsed = 09:03:22 . Memory (MB): peak = 2007.672 ; gain = 0.000 ; free physical = 53043 ; free virtual = 81685
ERROR: [v++ 60-661] v++ link run 'run_link' failed
ERROR: [v++ 60-626] Kernel link failed to complete
ERROR: [v++ 60-703] Failed to finish linking
INFO: [v++ 60-1653] Closing dispatch client.
make: *** [Makefile:94: build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/knn.xclbin] Error 1

Could you continue to help me?Thanks!!!

zyt1024 commented 1 year ago

Hi, sorry for the late reply. I just configured it with your build on my own version, and I found a workaround.

The workaround I found was as follows:

In host.cpp, change the code around line 217.

ORIGINAL:

    // For Allocating Buffer to specific Global Memory Bank, user has to use cl_mem_ext_ptr_t
    // and provide the Banks
    if (xcl::is_emulation()) {
        printf("Emulation Mode \n");
        for (int i = 0; i < NUM_KERNEL; i++) {
            inputSearchSpaceBufExt[i].obj = searchspace_data_part[i].data();
            inputSearchSpaceBufExt[i].param = 0;
            inputSearchSpaceBufExt[i].flags = XCL_MEM_DDR_BANK1;
        }
        outputResultDistBufExt.obj = hw_dist.data();
        outputResultDistBufExt.param = 0;
        outputResultDistBufExt.flags = XCL_MEM_DDR_BANK1;
        outputResultIdBufExt.obj = hw_id.data();
        outputResultIdBufExt.param = 0;
        outputResultIdBufExt.flags = XCL_MEM_DDR_BANK1;
    }

WORKING:

    if (xcl::is_emulation()) {
        printf("Emulation Mode \n");
        for (int i = 0; i < NUM_KERNEL; i++) {
            inputSearchSpaceBufExt[i].obj = searchspace_data_part[i].data();
            inputSearchSpaceBufExt[i].param = 0;
            inputSearchSpaceBufExt[i].flags = XCL_MEM_DDR_BANK1;
        }
        inputSearchSpaceBufExt[0].flags = XCL_MEM_DDR_BANK0;    //added this
        inputSearchSpaceBufExt[1].flags = XCL_MEM_DDR_BANK1;    //added this
        inputSearchSpaceBufExt[2].flags = XCL_MEM_DDR_BANK2;    //added this
        inputSearchSpaceBufExt[3].flags = XCL_MEM_DDR_BANK3;    //added this

        outputResultDistBufExt.obj = hw_dist.data();
        outputResultDistBufExt.param = 0;
        outputResultDistBufExt.flags = XCL_MEM_DDR_BANK1;
        outputResultIdBufExt.obj = hw_id.data();
        outputResultIdBufExt.param = 0;
        outputResultIdBufExt.flags = XCL_MEM_DDR_BANK1;
    }

I copied those few lines from the "else" statement at line 225, into the "if" statement at line 211.

Yes,Thanks,Now I can run hardware simulation, after that I implemented it on hardware and ran themake build TARGET=hwcommand, when implementing, the following error occurred:

[19:14:59] Phase 5.1 Delay CleanUp
[19:15:30] Phase 5.2 Clock Skew Optimization
[19:16:33] Phase 6 Post Hold Fix
[19:16:33] Phase 6.1 Hold Fix Iter
[19:17:36] Phase 6.2 Additional Hold Fix
[19:19:10] Phase 7 Leaf Clock Prog Delay Opt
[19:21:47] Phase 8 Route finalize
[19:22:19] Phase 9 Verifying routed nets
[19:22:50] Phase 10 Depositing Routes
[19:32:50] Run vpl: Step impl: Failed
[19:32:52] Run vpl: FINISHED. Run Status: impl ERROR

===>The following messages were generated while processing /home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1 :
ERROR: [VPL 18-1000] Routing results verification failed due to partially-conflicted nets (Up to first 10 of violated nets):  level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[67] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[66] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[63] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[58] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[54] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[53] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[52] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[51] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[49] level0_i/ulp/ip_cc_axi_data_h2c_00/inst/gen_clock_conv.gen_async_conv.asyncfifo_axi/inst_fifo_gen/gaxi_full_lite.gread_ch.grach2.axi_rach/grf.rf/gntv_or_sync_fifo.mem/gdm.dm_gen.dm/dout_i[48] 
ERROR: [VPL 60-773] In '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/runme.log', caught Tcl error:  problem implementing dynamic region, impl_1: route_design ERROR, please look at the run log file '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log' for more information
WARNING: [VPL 60-732] Link warning: No monitor points found for BD automation.
ERROR: [VPL 60-704] Integration error, problem implementing dynamic region, impl_1: route_design ERROR, please look at the run log file '/home/zyt/Downloads/CHIP_ZYT/KNN_2/CHIP-KNN/scripts/gen_design/build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log' for more information
ERROR: [VPL 60-1328] Vpl run 'vpl' failed
ERROR: [VPL 60-806] Failed to finish platform linker
INFO: [v++ 60-1442] [19:32:54] Run run_link: Step vpl: Failed
Time (s): cpu = 00:20:45 ; elapsed = 09:03:22 . Memory (MB): peak = 2007.672 ; gain = 0.000 ; free physical = 53043 ; free virtual = 81685
ERROR: [v++ 60-661] v++ link run 'run_link' failed
ERROR: [v++ 60-626] Kernel link failed to complete
ERROR: [v++ 60-703] Failed to finish linking
INFO: [v++ 60-1653] Closing dispatch client.
make: *** [Makefile:94: build_dir.hw.xilinx_u200_gen3x16_xdma_1_202110_1/knn.xclbin] Error 1

Could you continue to help me?Thanks!!!

KennySLiu commented 1 year ago

Hi there,

Yeah, this is a known issue. I'm not sure how experienced you are with HLS and FPGA designing, so I apologize if I overexplain. Anyways, the issue you're seeing - "partially conflicted nets" - happens because this design has some high routing congestion. This is a consistent pain point for FPGA designs, in our experience. In order to solve this problem, you will need to try regenerating the bitstream, with a smaller resource usage target (i.e. in config.py, change resource_limit = 0.7 to, say, resource_limit = 0.6, and then rerun the singlePE, multiPE, HW build flow. Note, when running these designs, we don't usually bother with HW_EMU.)

Note that the resource limit of 0.7 generated 2 PEs in SLR0, 0 PEs in SLR1, and 2 PEs in SLR2 - i.e. [2, 0, 2]. You can verify this by manually inspecting gen_design/src/knn.ini.

So when you regenerate, you'll probably want it to decrease to [1, 1, 1] or [1, 0, 1] or something like that. If you try again with resource_limit = 0.6 and it generates [2, 0, 2] still, you will still find the same issue.

Could you please tell me what your goal is with using our tool?

FYI, We're planning on releasing an updated version in a few months, but this newer version will not be as well-tested for the U200. It's primarily targeted for the U280.

zyt1024 commented 8 months ago

Thank you for your answer. My goal is to calculate 128 points with a feature dimension of 2, and obtain the nearest K neighbors for each of these 128 points.