ReconfigureIO / reco-sdaccel

0 stars 1 forks source link

[WIP] Update to use AWS-F1 SDK v1.4.5 #240

Closed zynaptic closed 5 years ago

zynaptic commented 5 years ago


Since support for the AFI used by the v1.3.x SDK/HDK version has now been discontinued, we need to update all our AWS builds to use the latest SDK/HDK. This also updates the required Vivado version to 2018.2.

Status At Wind-Down

This PR was addressing the issue of Xilinx/AWS tooling churn which had broken our public AWS offering towards the end of 2018. Prior to wind-down AWS had offered to maintain support for the old tooling for a limited period while the work in this PR was completed. Based on past experience it is highly likely that there will be further tooling churn on the part of Xilinx/AWS which will eventually render this PR obsolete. However, it may still be used as a template for identifying the changes required to support future Xilinx/AWS tool and framework versions.

Outstanding issues

pwaller commented 5 years ago
CampGareth commented 5 years ago
CampGareth commented 5 years ago

Upgrading didn't fix AFI generation :(

Logs from AFI generation:

# Vivado v2017.1_sdx_AR70350 (64-bit)
# SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
# IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
# Start of session at: Thu Dec 13 16:16:05 2018
# Process ID: 1903
# Current directory: /home/builder/scripts
# Command line: vivado -mode batch -source ingest.tcl
# Log file: /home/builder/scripts/vivado.log
# Journal file: /home/builder/scripts/vivado.jou
INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl 
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...

****** Vivado v2017.1_sdx_AR70350 (64-bit)
  **** SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
  **** IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
    ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl 
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP  "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT  "../reports/power_report.static.rpt"
# set timingRPT       "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT           "../reports/report_io.rpt"
# set partialBIT      "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX      "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp

I've checked the .dcp file we upload against the hash stored in the manifest we also upload, it's a match, so the DCP + manifest isn't being corrupted on upload.

CampGareth commented 5 years ago

I've hacked on a manifest file and while the original was this:


The modified version looks like this:


When I run AFI generation on a DCP that came out of our 2018.2 build process and include the manifest above (hash matches the DCP), AFI generation works. This manifest is created by so upgrading that did improve things. For whatever reason the previous hardware build that failed used the old so I'm re-running it, PR #240 build #38 on jenkins.

pwaller commented 5 years ago

Awaiting result of hardware build in

pwaller commented 5 years ago

Steps for getting this into production:

pwaller commented 5 years ago

Status update: Looks like AFI generation succeeded, and now it is "just" the deploy part which has failed.

I'm not surprised something is broken there. Hopefully should be fairly easy to fix outside of CI.

+ STREAM=staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ aws logs get-log-events --log-group-name /aws/batch/job --log-stream-name staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ jq -r '.events | .[] | .message'
+ aws s3 cp --quiet s3://reconfigureio-builds/tmp/ /tmp/ --region us-east-1
+ unzip /tmp/ -d /
Archive:  /tmp/
   creating: /.reco-work/sdaccel/dist/
  inflating: /.reco-work/sdaccel/dist/test-histogram  
  inflating: /.reco-work/sdaccel/dist/bench-histogram  
   creating: /.reco-work/sdaccel/dist/xclbin/
  inflating: /.reco-work/sdaccel/dist/xclbin/top_sp.ltx  
  inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin.raw  
  inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin  
+ fpga-clear-local-image -S 0
AFI          0       none                    cleared           1        ok               0       0x04261818
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-09c2a21805a8b9257
AFI          0       agfi-09c2a21805a8b9257  loaded            0        ok               0       0x0729172b
AFIDEVICE    0       0x1d0f      0xf001      0000:00:1d.0
+ fpga-clear-local-image -S 0
AFI          0       none                    cleared           1        ok               0       0x0729172b
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       none                    cleared           1        ok               0       0x0729172b
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0x1042      0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-093f6efe5d1441a64
AFI          0       agfi-093f6efe5d1441a64  loaded            0        ok               0       0x04261818
AFIDEVICE    0       0x1d0f      0xf010      0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       agfi-093f6efe5d1441a64  loaded            0        ok               0       0x04261818
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0xf010      0000:00:1d.0
+ export XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ export RTE=/opt/Xilinx/SDx/2018.2.rte
+ RTE=/opt/Xilinx/SDx/2018.2.rte
+ compgen -G '//.reco-work/sdaccel/dist/xclbin/*-f1_1ddr-xpr-2pr_4_0.xclbin'
+ source /opt/Xilinx/SDx/2018.2.op/
/ line 29: /opt/Xilinx/SDx/2018.2.op/ No such file or directory
+ exit 1
pwaller commented 5 years ago

I thought it was a little odd that the deploy script was running with $PWD as /, but I dug out an old job and that seems to be how it always was... So I guess we'll run with that for now.

pwaller commented 5 years ago

Here's what I'm currently running with on an f1 instance:

docker run -ti --rm --privileged -v /dev:/dev -v /opt/Xilinx/:/opt/Xilinx tmp
export DIST_URL=s3://reconfigureio-builds/tmp/ AGFI=agfi-093f6efe5d1441a64 CMD=test-histogram
export XILINX_SDX=/opt/Xilinx/SDx/2018.2.op2258646/

aws s3 cp --quiet "$DIST_URL" /tmp/ --region us-east-1
unzip /tmp/ -d "$PWD"

source "${XILINX_SDX}/"

And it fails like so:

root@80f553386f74:/# /.reco-work/sdaccel/dist/test-histogram
ERROR: No devices found
Error: no platforms available or OpenCL install broken

I'm currently trying to figure out how to achieve the effects of given that all of the paths have been rearranged (there is no .rte directory any more that I can ascertain).

Note that I have tried sourcing the script in aws-fpga but that doesn't help either.

My next plan is to dive in with a debugger and try and figure out why it's not working.

Slack convo link:

pwaller commented 5 years ago

I'm currently working on a new runtime image over on I successfully got things working on the host, but I struggled to get it to work in docker. It is still showing:

ERROR: No devices found
Error: no platforms available or OpenCL install broken

Last thing I've done just now is to trace XRT inside the docker container and see what's going on. We definitely enter

Thread 1 "test-histogram" hit Breakpoint 1, (anonymous namespace)::createHalDevices (devices=std::vector of length 0, capacity 0, dll="/opt/xilinx/xrt/lib/", count=0)
    at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:100
100 {
(gdb) next
104   auto handle = handle_type(dlopen(dll.c_str(), RTLD_LAZY | RTLD_GLOBAL),delHandle);
105   if (!handle)
110   auto probeFunc = (probeFuncType)dlsym(handle.get(), propeFunc().c_str());
111   if (!probeFunc)
114   unsigned pmdCount = 0;
116   if (count || (count = probeFunc()) || pmdCount) {
Linux:3.10.0-862.11.6.el7.x86_64:#1 SMP Tue Aug 14 21:49:04 UTC 2018:x86_64
Distribution: Ubuntu 18.04.1 LTS
GLIBC: 2.27

I'm probably going to have to pick this up after Christmas at this point.

pwaller commented 5 years ago

So weird. It's definitely finding the device:

Thread 1 "test-histogram" hit Breakpoint 2, xcldev::pci_device_scanner::add_device (this=0x7fffffffdb10, device=...)
    at /tmp/XRT-2018.3.RC1/src/runtime_src/driver/xclng/xrt/user_aws/scan.cpp:118
118         if ( device.func == 2) {//AWS Pegasus mgmtPF is 2; On AWS F1 mgmtPF is not visible
(gdb) next
121         } else if ( device.func == 0) {
123             user_devices.emplace_back(device);
128         return true;
129     }

But then later:

xrt::hal::loadDevices () at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:164
164     bfs::path p(xrt / "lib/");

169   if (!xrt.empty() && isEmulationMode()) {
183   if (!xrt.empty() && isEmulationMode()) {
197   if (xrt.empty())
200   return devices;
(gdb) print devices.size()
$12 = 0

I've ran out of time, so gotta run for now. May poke around later.

CampGareth commented 5 years ago

I am attempting to generate an AFI from our compiled .xclbin but am hitting the following error:

/opt/ line 105: No such file or directory
ERROR: Env variable RELEASE_VER not set, did you ?

RELEASE_VER is the version of the vivado tooling we're using, it does get set by but since there are so many changes required to support a new vivado version already I think I'll just hard-code the vivado version number and remove this dependency. This'll also increase the greppability of these scripts.

edit this script worked and we now have an AFI to work with: afi-06437cf7254575b62

CampGareth commented 5 years ago

This PR's grown a bit long in the tooth but is full of useful changes. I'm going to merge it and then maybe create a second PR to update the sdaccel library embedded in examples.