Xilinx / gemx

Matrix Operation Library for FPGA https://xilinx.github.io/gemx/
Other
62 stars 21 forks source link

param:compiler.acceleratorBinaryContent="dcp" or "bitstream" #7

Open b04902036 opened 6 years ago

b04902036 commented 6 years ago

I use AWS EC2 f1 instance to run gemx. With (this AMI)[https://aws.amazon.com/marketplace/pp/B06VVYBLZZ] however after I compile the gemm_kernel.h into gemm.xclbin and upload it to AWS to create AFI, I got this error:

{
    "FpgaImages": [
        {
            "UpdateTime": "2018-07-23T10:18:14.000Z", 
            "Name": "overlay_2", 
            "Tags": [], 
            "PciId": {
                "SubsystemVendorId": "0xfedd", 
                "VendorId": "0x1d0f", 
                "DeviceId": "0xf000", 
                "SubsystemId": "0x1d51"
            }, 
            "FpgaImageGlobalId": "agfi-066b09c66343492a7", 
            "Public": false, 
            "State": {
                "Message": "UNKNOWN_BITSTREAM_GENERATE_ERROR: An unexpected error occurred generating the bitstream", 
                "Code": "failed"
            }, 
            "ShellVersion": "0x04261818", 
            "OwnerId": "532726133341", 
            "FpgaImageId": "afi-0cb323a4335e95e92", 
            "CreateTime": "2018-07-23T10:13:06.000Z", 
            "Description": "overlay_2"
        }
    ]
}

and here is the log file on aws s3 bucket

#-----------------------------------------------------------
# Vivado v2017.4.op (64-bit)
# SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018
# IP Build 2189296 on Tue Apr 10 19:39:46 MDT 2018
# Start of session at: Mon Jul 23 01:59:17 2018
# Process ID: 1961
# Current directory: /home/builder/scripts
# Command line: vivado -mode batch -source ingest.tcl
# Log file: /home/builder/scripts/vivado.log
# Journal file: /home/builder/scripts/vivado.jou
#-----------------------------------------------------------
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# set CL_PATH WRAPPER_INST/CL
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Mon Jul 23 02:01:30 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# set_param hd.platformVerifyCachedRun false
# check_integrity $userDCP
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '-1'): /home/builder/checkpoints/SH_CL_routed.dcp
INFO: [Common 17-206] Exiting Vivado at Mon Jul 23 02:01:31 2018...
[stdout]

****** Vivado v2017.4.op (64-bit)
  **** SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018
  **** IP Build 2189296 on Tue Apr 10 19:39:46 MDT 2018
    ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# set CL_PATH WRAPPER_INST/CL
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Mon Jul 23 02:01:30 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# set_param hd.platformVerifyCachedRun false
# check_integrity $userDCP
INFO: [Common 17-206] Exiting Vivado at Mon Jul 23 02:01:31 2018...
[stderr]
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '-1'): /home/builder/checkpoints/SH_CL_routed.dcp

After some trial and error, I change a parameter in the step of compiling gemx.xo to gemx.xclbin param:compiler.acceleratorBinaryContent="bitstream" to param:compiler.acceleratorBinaryContent="dcp", and I successfully build the AFI without an error. Can I create an AFI with a xclbin file that param:compiler.acceleratorBinaryContent is set to bitstream? Also, though I build AFI successfully, the example program gemx_api_gemm.exe stuck at waiting the kernel to return. Can somebody help me out with this issue? Thank you!

lisaliu1 commented 6 years ago

which branch of gemx repository are you using?

b04902036 commented 6 years ago

I use 2017.4 branch

lisaliu1 commented 6 years ago

you have to use xilinx_aws-vu9p-f1-04261818_dynamic_5_0 dsa for F1 that has SDx 2017.4 support. I have updated the gemx/Makefile in 2017.4 branch to reflect this change. Please set up DSA_PATH in the Makefile to point to the directory that contains xilinx_aws-vu9p-f1-04261818_dynamic_5_0 folder. Then run the following make command. It should generate a .xclbin that works on F1. Let me know if you have more problems.

make run_hw GEMX_ddrWidth=32 GEMX_XddrWidth=16 GEMX_keepMacBits=1 GEMX_argInstrWidth=1 GEMX_numKernels=1 GEMX_runGemv=0 GEMX_runGemm=1 GEMX_runTransp=0 GEMX_runSpmv=0 GEMX_gemmMBlocks=4 GEMX_gemmKBlocks=4 GEMX_gemmNBlocks=4 GEMX_splitMesh=1 GEMX_part=vu9pf1 GEN_BIN_PROGRAM="gemm 512 512 512 512 512 512 512 1 0 A05 B05 C05 X05 gemm 1024 1024 1024 1024 1024 1024 1024 1 0 A1k B1k C1k X1K gemm 1024 1024 1024 1536 2048 2560 1024 1 0 A1kld B1kld C1kld X1kld" 2>&1 | tee log

b04902036 commented 6 years ago

I type the command, and it seems to generate xclbin properly. Yet it gave some error about xbinst, and I'm not sure what this is for...

INFO: [XOCC 60-586] Created out_hw/gemx.xclbin
INFO: [XOCC 60-791] Total elapsed time: 5h 25m 32s
make dump_config
make[2]: Entering directory `/home/centos/src/project_data/gemx/gemx'
make[2]: Leaving directory `/home/centos/src/project_data/gemx/gemx'
Running xbinst...
/opt/Xilinx/SDx/2017.4.op/bin/xbinst --platform_repo_paths=/home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0 --platform xilinx_aws-vu9p-f1-04261818_dynamic_5_0 -d out_hw

****** xbinst v2017.4 (64-bit)
  **** SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018
    ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

INFO: [XBINST 60-895]    Target platform: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xpfm
INFO: [XBINST 60-267] Packaging for PCIe...
WARNING: [XBINST 60-947] No image(s) discovered to be inserted into the '.dsabin' file.  File not created.
INFO: [XBINST 60-916] The default source directory /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/gem does not exist, so using an alternative source directory: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/classic
ERROR: [XBINST 17-53] User Exception: The source directory does not exist: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/classic

ERROR: [XBINST 60-666] xbinst command failed.

I tried to execute out_host/gemx_host.exe with the following command ./gemx_host.exe ../out_hw/gemx.xclbin app.bin app_out.bin, and this is the error message:

INFO: loading app.bin of size 29892608
INFO: loaded 29892608 bytes from app.bin
[0]user:0xf010:0x1d51:[xocl:2017.4.5:128]
xclProbe found 1 FPGA slots with xocl driver running
Linux:3.10.0-693.21.1.el7.x86_64:#1 SMP Wed Mar 7 19:03:37 UTC 2018:x86_64
Distribution: CentOS Linux release 7.4.1708 (Core) 
GLIBC: 2.17
--- 
XILINX_OPENCL="/home/centos/src/project_data/ml-suite/overlaybins/aws"
LD_LIBRARY_PATH="/home/centos/src/project_data/ml-suite/overlaybins/aws/runtime/lib/x86_64/:/home/centos/src/project_data/ml-suite/xfdnn/rt/xdnn_cpp/build/lib:/home/centos/src/project_data/ml-suite/xfdnn/rt/lib:/home/centos/src/project_data/ml-suite/ext/boost/lib:/home/centos/src/project_data/ml-suite/ext/zmq/libs:/home/centos/src/project_data"
--- 
Segmentation Fault

How can I solve this? Thank you!

lisaliu1 commented 6 years ago

Please comment out line 307, 308, 309 and 311 in Makefile as shown below: 307 #ifeq (${GEMX_part},vu9pf1) 308 #GEMX_fpgaDdrBanks = XCL_MEM_DDR_BANK${K2_DDR},XCL_MEM_DDR_BANK${K1_DDR},XCL_MEM_DDR_BANK${K0_DDR},XCL_MEM_DDR_BANK${K3_DDR} 309 #else 310 GEMX_fpgaDdrBanks = XCL_MEM_DDR_BANK${K0_DDR},XCL_MEM_DDR_BANK${K1_DDR},XCL_MEM_DDR_BANK${K2_DDR},XCL_MEM_DDR_BANK${K3_DDR} 311 #endif

and run following command to make the host executable again rm -rf out_host make host GEMX_ddrWidth=32 GEMX_XddrWidth=16 GEMX_keepMacBits=1 GEMX_argInstrWidth=1 GEMX_numKernels=1 GEMX_runGemv=0 GEMX_runGemm=1 GEMX_runTransp=0 GEMX_runSpmv=0 GEMX_gemmMBlocks=4 GEMX_gemmKBlocks=4 GEMX_gemmNBlocks=4 GEMX_splitMesh=1 GEMX_part=vu9pf1

copy the gemx_host.exe to the F1 instance and rerun the command ./gemx_host.exe ../out_hw/gemx.xclbin app.bin app_out.bin

Let me know if you have any success. We will update the Makefile later.

Mahdi89 commented 6 years ago

@lisaliu1 when tried to run the python test make gemm_test_python GEMX_keepMacBits=1 GEMX_gemmNBlocks=1 GEMX_splitMesh=1 I encountered the following error (I'm using the 2017.4 version) not sure why it's expecting gem software drivers in the platform dir, have I set any variable wrong?:

INFO: wrote 8192 bytes to out_host/app_gold.bin out_host/gemx_gen_bin.exe -read out_host/app.bin > out_host/app.txt out_host/gemx_gen_bin.exe -read out_host/app_gold.bin > out_host/app_gold.txt Running xbinst... /opt/Xilinx/SDx/2017.4.op/bin/xbinst --platform_repo_paths=/home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0 --platform xilinx_aws-vu9p-f1-04261818_dynamic_5_0 -d out_hw

** xbinst v2017.4 (64-bit) ** SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018 Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

INFO: [XBINST 60-895] Target platform: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xpfm INFO: [XBINST 60-267] Packaging for PCIe... WARNING: [XBINST 60-947] No image(s) discovered to be inserted into the '.dsabin' file. File not created. INFO: [XBINST 60-916] The default source directory /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/gem does not exist, so using an alternative source directory: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/classic ERROR: [XBINST 17-53] User Exception: The source directory does not exist: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/sw/driver/classic

Thanks for helping.

UPDATE: sourcing sdaccel_setup.sh doesn't help either, I assumed it dose board installation already.

lisaliu1 commented 6 years ago

@Mahdi89 The README file for 2017.4 branch has been updated to include the steps for building gemx image for aws f1. When building gemx image for aws f1, you don't need to run xbinst to generate driver because the f1 instance you are using should have the driver installed already. So, for solving the error you met, please comment out the xbinst command in the Makefile.