ReconfigureIO / reco-sdaccel

0 stars 1 forks source link

[WIP] Update to use AWS-F1 SDK v1.4.5 #240

Closed zynaptic closed 5 years ago

zynaptic commented 5 years ago

Changes

Since support for the AFI used by the v1.3.x SDK/HDK version has now been discontinued, we need to update all our AWS builds to use the latest SDK/HDK. This also updates the required Vivado version to 2018.2.

Status At Wind-Down

This PR was addressing the issue of Xilinx/AWS tooling churn which had broken our public AWS offering towards the end of 2018. Prior to wind-down AWS had offered to maintain support for the old tooling for a limited period while the work in this PR was completed. Based on past experience it is highly likely that there will be further tooling churn on the part of Xilinx/AWS which will eventually render this PR obsolete. However, it may still be used as a template for identifying the changes required to support future Xilinx/AWS tool and framework versions.

Outstanding issues

pwaller commented 5 years ago
CampGareth commented 5 years ago
CampGareth commented 5 years ago

Upgrading create_sdaccel_afi.sh didn't fix AFI generation :(

Logs from AFI generation:

#-----------------------------------------------------------
# Vivado v2017.1_sdx_AR70350 (64-bit)
# SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
# IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
# Start of session at: Thu Dec 13 16:16:05 2018
# Process ID: 1903
# Current directory: /home/builder/scripts
# Command line: vivado -mode batch -source ingest.tcl
# Log file: /home/builder/scripts/vivado.log
# Journal file: /home/builder/scripts/vivado.jou
#-----------------------------------------------------------
INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl 
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...
[stdout]

****** Vivado v2017.1_sdx_AR70350 (64-bit)
  **** SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
  **** IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
    ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl 
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...
[stderr]
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp

I've checked the .dcp file we upload against the hash stored in the manifest we also upload, it's a match, so the DCP + manifest isn't being corrupted on upload.

CampGareth commented 5 years ago

I've hacked on a manifest file and while the original was this:

manifest_format_version=1
pci_vendor_id=0x1D0F
pci_device_id=0xF000
pci_subsystem_id=0x1D51
pci_subsystem_vendor_id=0xFEDD
dcp_hash=22bdf81b6dc3f6143fe447b08c20490f85ef59213cc222590ac22a6462c9506c
shell_version=0x071417d3
dcp_file_name=18_12_13-161259_SH_CL_routed.dcp
hdk_version=1.3.0
date=18_12_13-161259
clock_main_a0=250
clock_extra_b0=500
clock_extra_c0=250

The modified version looks like this:

manifest_format_version=2
pci_vendor_id=0x1D0F
pci_device_id=0xF010
pci_subsystem_id=0x1D51
pci_subsystem_vendor_id=0xFEDD
dcp_hash=22bdf81b6dc3f6143fe447b08c20490f85ef59213cc222590ac22a6462c9506c
shell_version=0x04261818
dcp_file_name=18_12_13-161259_SH_CL_routed.dcp
hdk_version=1.4.5
tool_version=v2018.2
date=18_12_13-161259
clock_main_a0=250
clock_extra_b0=500
clock_extra_c0=250

When I run AFI generation on a DCP that came out of our 2018.2 build process and include the manifest above (hash matches the DCP), AFI generation works. This manifest is created by create_sdaccel_afi.sh so upgrading that did improve things. For whatever reason the previous hardware build that failed used the old create_sdaccel_afi.sh so I'm re-running it, PR #240 build #38 on jenkins.

pwaller commented 5 years ago

Awaiting result of hardware build in

pwaller commented 5 years ago

Steps for getting this into production:

pwaller commented 5 years ago

Status update: Looks like AFI generation succeeded, and now it is "just" the deploy part which has failed.

http://jenkins.nerabus-infra.com:8080/blue/organizations/jenkins/Reconfigure.io%2Freco-sdaccel/detail/PR-240/55/pipeline#step-67-log-2043

I'm not surprised something is broken there. Hopefully should be fairly easy to fix outside of CI.

+ STREAM=staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ aws logs get-log-events --log-group-name /aws/batch/job --log-stream-name staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ jq -r '.events | .[] | .message'
+ aws s3 cp --quiet s3://reconfigureio-builds/tmp/8f769704-020d-11e9-9f06-1243484308a4.dist.zip /tmp/bundle.zip --region us-east-1
+ unzip /tmp/bundle.zip -d /
Archive:  /tmp/bundle.zip
   creating: /.reco-work/sdaccel/dist/
  inflating: /.reco-work/sdaccel/dist/test-histogram  
  inflating: /.reco-work/sdaccel/dist/bench-histogram  
   creating: /.reco-work/sdaccel/dist/xclbin/
  inflating: /.reco-work/sdaccel/dist/xclbin/top_sp.ltx  
  inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin.raw  
  inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin  
+ fpga-clear-local-image -S 0
AFI          0       none                    cleared           1        ok               0       0x04261818
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-09c2a21805a8b9257
AFI          0       agfi-09c2a21805a8b9257  loaded            0        ok               0       0x0729172b
AFIDEVICE    0       0x1d0f      0xf001      0000:00:1d.0
+ fpga-clear-local-image -S 0
AFI          0       none                    cleared           1        ok               0       0x0729172b
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       none                    cleared           1        ok               0       0x0729172b
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-093f6efe5d1441a64
AFI          0       agfi-093f6efe5d1441a64  loaded            0        ok               0       0x04261818
AFIDEVICE    0       0x1d0f      0xf010      0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       agfi-093f6efe5d1441a64  loaded            0        ok               0       0x04261818
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0xf010      0000:00:1d.0
+ export XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ export RTE=/opt/Xilinx/SDx/2018.2.rte
+ RTE=/opt/Xilinx/SDx/2018.2.rte
+ compgen -G '//.reco-work/sdaccel/dist/xclbin/*-f1_1ddr-xpr-2pr_4_0.xclbin'
+ source /opt/Xilinx/SDx/2018.2.op/settings64.sh
/run.sh: line 29: /opt/Xilinx/SDx/2018.2.op/settings64.sh: No such file or directory
+ exit 1
pwaller commented 5 years ago

I thought it was a little odd that the deploy script was running with $PWD as /, but I dug out an old job and that seems to be how it always was... So I guess we'll run with that for now.

pwaller commented 5 years ago

Here's what I'm currently running with on an f1 instance:

docker run -ti --rm --privileged -v /dev:/dev -v /opt/Xilinx/:/opt/Xilinx tmp
export DIST_URL=s3://reconfigureio-builds/tmp/8f769704-020d-11e9-9f06-1243484308a4.dist.zip AGFI=agfi-093f6efe5d1441a64 CMD=test-histogram
export XILINX_SDX=/opt/Xilinx/SDx/2018.2.op2258646/

aws s3 cp --quiet "$DIST_URL" /tmp/bundle.zip --region us-east-1
unzip /tmp/bundle.zip -d "$PWD"

source "${XILINX_SDX}/settings64.sh"
/.reco-work/sdaccel/dist/test-histogram

And it fails like so:

root@80f553386f74:/# /.reco-work/sdaccel/dist/test-histogram
ERROR: No devices found
Error: no platforms available or OpenCL install broken

I'm currently trying to figure out how to achieve the effects of https://github.com/ReconfigureIO/reco-sdaccel/blob/master/docker-staging-deploy/run.sh given that all of the paths have been rearranged (there is no .rte directory any more that I can ascertain).

Note that I have tried sourcing the sdk_setup.sh script in aws-fpga but that doesn't help either.

My next plan is to dive in with a debugger and try and figure out why it's not working.

Slack convo link: https://reconfigure.slack.com/archives/C2XQS6J6B/p1545152019101200

pwaller commented 5 years ago

I'm currently working on a new runtime image over on https://github.com/ReconfigureIO/docker-aws-fpga-runtime/pull/1/files. I successfully got things working on the host, but I struggled to get it to work in docker. It is still showing:

ERROR: No devices found
Error: no platforms available or OpenCL install broken

Last thing I've done just now is to trace XRT inside the docker container and see what's going on. We definitely enter libxrt_aws.so:

Thread 1 "test-histogram" hit Breakpoint 1, (anonymous namespace)::createHalDevices (devices=std::vector of length 0, capacity 0, dll="/opt/xilinx/xrt/lib/libxrt_aws.so", count=0)
    at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:100
100 {
(gdb) next
104   auto handle = handle_type(dlopen(dll.c_str(), RTLD_LAZY | RTLD_GLOBAL),delHandle);
(gdb) 
105   if (!handle)
(gdb) 
110   auto probeFunc = (probeFuncType)dlsym(handle.get(), propeFunc().c_str());
(gdb) 
111   if (!probeFunc)
(gdb) 
114   unsigned pmdCount = 0;
(gdb) 
116   if (count || (count = probeFunc()) || pmdCount) {
(gdb) 
Linux:3.10.0-862.11.6.el7.x86_64:#1 SMP Tue Aug 14 21:49:04 UTC 2018:x86_64
Distribution: Ubuntu 18.04.1 LTS
GLIBC: 2.27
--- 
XILINX_OPENCL=""
LD_LIBRARY_PATH="/opt/xilinx/xrt/lib:"
--- 

I'm probably going to have to pick this up after Christmas at this point.

pwaller commented 5 years ago

So weird. It's definitely finding the device:

Thread 1 "test-histogram" hit Breakpoint 2, xcldev::pci_device_scanner::add_device (this=0x7fffffffdb10, device=...)
    at /tmp/XRT-2018.3.RC1/src/runtime_src/driver/xclng/xrt/user_aws/scan.cpp:118
118         if ( device.func == 2) {//AWS Pegasus mgmtPF is 2; On AWS F1 mgmtPF is not visible
(gdb) next
121         } else if ( device.func == 0) {
(gdb) 
123             user_devices.emplace_back(device);
(gdb) 
128         return true;
(gdb) 
129     }
(gdb) 

But then later:

xrt::hal::loadDevices () at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:164
164     bfs::path p(xrt / "lib/libxrt_aws.so");
(gdb) 

169   if (!xrt.empty() && isEmulationMode()) {
(gdb) 
183   if (!xrt.empty() && isEmulationMode()) {
(gdb) 
197   if (xrt.empty())
(gdb) 
200   return devices;
(gdb) print devices.size()
$12 = 0

I've ran out of time, so gotta run for now. May poke around later.

CampGareth commented 5 years ago

I am attempting to generate an AFI from our compiled .xclbin but am hitting the following error:

/opt/create_sdaccel_afi.sh: line 105: sdaccel_setup.sh: No such file or directory
ERROR: Env variable RELEASE_VER not set, did you ?

RELEASE_VER is the version of the vivado tooling we're using, it does get set by sdaccel_setup.sh but since there are so many changes required to support a new vivado version already I think I'll just hard-code the vivado version number and remove this dependency. This'll also increase the greppability of these scripts.

edit this script worked and we now have an AFI to work with: afi-06437cf7254575b62

CampGareth commented 5 years ago

This PR's grown a bit long in the tooth but is full of useful changes. I'm going to merge it and then maybe create a second PR to update the sdaccel library embedded in examples.