Closed zynaptic closed 5 years ago
ami-0bf2595a8c6d6358e
reconfigure_io_sdaccel_builder_stub_0_1
with rio_sda_stub_0_1
Upgrading create_sdaccel_afi.sh
didn't fix AFI generation :(
Logs from AFI generation:
#-----------------------------------------------------------
# Vivado v2017.1_sdx_AR70350 (64-bit)
# SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
# IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
# Start of session at: Thu Dec 13 16:16:05 2018
# Process ID: 1903
# Current directory: /home/builder/scripts
# Command line: vivado -mode batch -source ingest.tcl
# Log file: /home/builder/scripts/vivado.log
# Journal file: /home/builder/scripts/vivado.jou
#-----------------------------------------------------------
INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT "../reports/power_report.static.rpt"
# set timingRPT "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT "../reports/report_io.rpt"
# set partialBIT "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...
[stdout]
****** Vivado v2017.1_sdx_AR70350 (64-bit)
**** SW Build 1933108 on Fri Jul 14 11:54:19 MDT 2017
**** IP Build 1908669 on Fri Jul 14 13:31:24 MDT 2017
** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.
INFO: [Common 17-1460] Use of init.tcl in /opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl is deprecated. Please use Vivado_init.tcl
Sourcing tcl script '/opt/Xilinx/SDx/2017.1.op/Vivado/scripts/init.tcl'
0 Beta devices matching pattern found, 0 enabled.
Loaded SDSoC Platform Tcl Library
source ingest.tcl
# set userDCP "../checkpoints/SH_CL_routed.dcp"
# set awsDCP "../checkpoints/SH_CL_BB_routed.dcp"
# set powerDefaultRPT "../reports/power_report.default.rpt"
# set powerStaticRPT "../reports/power_report.static.rpt"
# set timingRPT "../reports/SH_CL_final_timing_summary.rpt"
# set ioRPT "../reports/report_io.rpt"
# set partialBIT "../bitstreams/SH_CL_final_pblock_CL_partial.bit"
# set partialLTX "../bitstreams/SH_CL_final_pblock_CL_partial.ltx"
# puts "Ingest start time: \[[clock format [clock seconds] -format {%a %b %d %H:%M:%S %Y}]\]"
Ingest start time: [Thu Dec 13 16:18:05 2018]
# set_param hd.supportClockNetCrossDiffReconfigurablePartitions 1
# check_integrity $userDCP
INFO: [Common 17-206] Exiting Vivado at Thu Dec 13 16:18:10 2018...
[stderr]
ERROR: [Vivado 12-5532] The design checkpoint file failed integrity check (code '1'): /home/builder/checkpoints/SH_CL_routed.dcp
I've checked the .dcp file we upload against the hash stored in the manifest we also upload, it's a match, so the DCP + manifest isn't being corrupted on upload.
I've hacked on a manifest file and while the original was this:
manifest_format_version=1
pci_vendor_id=0x1D0F
pci_device_id=0xF000
pci_subsystem_id=0x1D51
pci_subsystem_vendor_id=0xFEDD
dcp_hash=22bdf81b6dc3f6143fe447b08c20490f85ef59213cc222590ac22a6462c9506c
shell_version=0x071417d3
dcp_file_name=18_12_13-161259_SH_CL_routed.dcp
hdk_version=1.3.0
date=18_12_13-161259
clock_main_a0=250
clock_extra_b0=500
clock_extra_c0=250
The modified version looks like this:
manifest_format_version=2
pci_vendor_id=0x1D0F
pci_device_id=0xF010
pci_subsystem_id=0x1D51
pci_subsystem_vendor_id=0xFEDD
dcp_hash=22bdf81b6dc3f6143fe447b08c20490f85ef59213cc222590ac22a6462c9506c
shell_version=0x04261818
dcp_file_name=18_12_13-161259_SH_CL_routed.dcp
hdk_version=1.4.5
tool_version=v2018.2
date=18_12_13-161259
clock_main_a0=250
clock_extra_b0=500
clock_extra_c0=250
When I run AFI generation on a DCP that came out of our 2018.2 build process and include the manifest above (hash matches the DCP), AFI generation works. This manifest is created by create_sdaccel_afi.sh
so upgrading that did improve things. For whatever reason the previous hardware build that failed used the old create_sdaccel_afi.sh
so I'm re-running it, PR #240 build #38 on jenkins.
Awaiting result of hardware build in
fatal: not a git repository (or any of the parent directories): .git
- need to run script in the appropriate directory)./run.sh: line 29: /opt/Xilinx/SDx/2018.2.op/settings64.sh: No such file or directory
) (build succeeded, but Hardware test failed!)Steps for getting this into production:
reco
tool.reco
tool.reco
.Status update: Looks like AFI generation succeeded, and now it is "just" the deploy part which has failed.
I'm not surprised something is broken there. Hopefully should be fairly easy to fix outside of CI.
+ STREAM=staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ aws logs get-log-events --log-group-name /aws/batch/job --log-stream-name staging-deploy/default/72eea46c-f563-4b5a-878b-3b4118384a14
+ jq -r '.events | .[] | .message'
+ aws s3 cp --quiet s3://reconfigureio-builds/tmp/8f769704-020d-11e9-9f06-1243484308a4.dist.zip /tmp/bundle.zip --region us-east-1
+ unzip /tmp/bundle.zip -d /
Archive: /tmp/bundle.zip
creating: /.reco-work/sdaccel/dist/
inflating: /.reco-work/sdaccel/dist/test-histogram
inflating: /.reco-work/sdaccel/dist/bench-histogram
creating: /.reco-work/sdaccel/dist/xclbin/
inflating: /.reco-work/sdaccel/dist/xclbin/top_sp.ltx
inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin.raw
inflating: /.reco-work/sdaccel/dist/xclbin/kernel_test.hw.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin
+ fpga-clear-local-image -S 0
AFI 0 none cleared 1 ok 0 0x04261818
AFIDEVICE 0 0x1d0f 0x1042 0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-09c2a21805a8b9257
AFI 0 agfi-09c2a21805a8b9257 loaded 0 ok 0 0x0729172b
AFIDEVICE 0 0x1d0f 0xf001 0000:00:1d.0
+ fpga-clear-local-image -S 0
AFI 0 none cleared 1 ok 0 0x0729172b
AFIDEVICE 0 0x1d0f 0x1042 0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type FpgaImageSlot FpgaImageId StatusName StatusCode ErrorName ErrorCode ShVersion
AFI 0 none cleared 1 ok 0 0x0729172b
Type FpgaImageSlot VendorId DeviceId DBDF
AFIDEVICE 0 0x1d0f 0x1042 0000:00:1d.0
+ fpga-load-local-image -S 0 -I agfi-093f6efe5d1441a64
AFI 0 agfi-093f6efe5d1441a64 loaded 0 ok 0 0x04261818
AFIDEVICE 0 0x1d0f 0xf010 0000:00:1d.0
+ fpga-describe-local-image -S 0 -H $'\342\200\223R'
Type FpgaImageSlot FpgaImageId StatusName StatusCode ErrorName ErrorCode ShVersion
AFI 0 agfi-093f6efe5d1441a64 loaded 0 ok 0 0x04261818
Type FpgaImageSlot VendorId DeviceId DBDF
AFIDEVICE 0 0x1d0f 0xf010 0000:00:1d.0
+ export XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ XCL_BINDIR=//.reco-work/sdaccel/dist/xclbin
+ export RTE=/opt/Xilinx/SDx/2018.2.rte
+ RTE=/opt/Xilinx/SDx/2018.2.rte
+ compgen -G '//.reco-work/sdaccel/dist/xclbin/*-f1_1ddr-xpr-2pr_4_0.xclbin'
+ source /opt/Xilinx/SDx/2018.2.op/settings64.sh
/run.sh: line 29: /opt/Xilinx/SDx/2018.2.op/settings64.sh: No such file or directory
+ exit 1
I thought it was a little odd that the deploy script was running with $PWD
as /
, but I dug out an old job and that seems to be how it always was... So I guess we'll run with that for now.
Here's what I'm currently running with on an f1 instance:
docker run -ti --rm --privileged -v /dev:/dev -v /opt/Xilinx/:/opt/Xilinx tmp
export DIST_URL=s3://reconfigureio-builds/tmp/8f769704-020d-11e9-9f06-1243484308a4.dist.zip AGFI=agfi-093f6efe5d1441a64 CMD=test-histogram
export XILINX_SDX=/opt/Xilinx/SDx/2018.2.op2258646/
aws s3 cp --quiet "$DIST_URL" /tmp/bundle.zip --region us-east-1
unzip /tmp/bundle.zip -d "$PWD"
source "${XILINX_SDX}/settings64.sh"
/.reco-work/sdaccel/dist/test-histogram
And it fails like so:
root@80f553386f74:/# /.reco-work/sdaccel/dist/test-histogram
ERROR: No devices found
Error: no platforms available or OpenCL install broken
I'm currently trying to figure out how to achieve the effects of https://github.com/ReconfigureIO/reco-sdaccel/blob/master/docker-staging-deploy/run.sh given that all of the paths have been rearranged (there is no .rte directory any more that I can ascertain).
Note that I have tried sourcing the sdk_setup.sh
script in aws-fpga but that doesn't help either.
My next plan is to dive in with a debugger and try and figure out why it's not working.
Slack convo link: https://reconfigure.slack.com/archives/C2XQS6J6B/p1545152019101200
I'm currently working on a new runtime image over on https://github.com/ReconfigureIO/docker-aws-fpga-runtime/pull/1/files. I successfully got things working on the host, but I struggled to get it to work in docker. It is still showing:
ERROR: No devices found
Error: no platforms available or OpenCL install broken
Last thing I've done just now is to trace XRT inside the docker container and see what's going on. We definitely enter libxrt_aws.so
:
Thread 1 "test-histogram" hit Breakpoint 1, (anonymous namespace)::createHalDevices (devices=std::vector of length 0, capacity 0, dll="/opt/xilinx/xrt/lib/libxrt_aws.so", count=0)
at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:100
100 {
(gdb) next
104 auto handle = handle_type(dlopen(dll.c_str(), RTLD_LAZY | RTLD_GLOBAL),delHandle);
(gdb)
105 if (!handle)
(gdb)
110 auto probeFunc = (probeFuncType)dlsym(handle.get(), propeFunc().c_str());
(gdb)
111 if (!probeFunc)
(gdb)
114 unsigned pmdCount = 0;
(gdb)
116 if (count || (count = probeFunc()) || pmdCount) {
(gdb)
Linux:3.10.0-862.11.6.el7.x86_64:#1 SMP Tue Aug 14 21:49:04 UTC 2018:x86_64
Distribution: Ubuntu 18.04.1 LTS
GLIBC: 2.27
---
XILINX_OPENCL=""
LD_LIBRARY_PATH="/opt/xilinx/xrt/lib:"
---
I'm probably going to have to pick this up after Christmas at this point.
So weird. It's definitely finding the device:
Thread 1 "test-histogram" hit Breakpoint 2, xcldev::pci_device_scanner::add_device (this=0x7fffffffdb10, device=...)
at /tmp/XRT-2018.3.RC1/src/runtime_src/driver/xclng/xrt/user_aws/scan.cpp:118
118 if ( device.func == 2) {//AWS Pegasus mgmtPF is 2; On AWS F1 mgmtPF is not visible
(gdb) next
121 } else if ( device.func == 0) {
(gdb)
123 user_devices.emplace_back(device);
(gdb)
128 return true;
(gdb)
129 }
(gdb)
But then later:
xrt::hal::loadDevices () at /tmp/XRT-2018.3.RC1/src/runtime_src/xrt/device/hal.cpp:164
164 bfs::path p(xrt / "lib/libxrt_aws.so");
(gdb)
169 if (!xrt.empty() && isEmulationMode()) {
(gdb)
183 if (!xrt.empty() && isEmulationMode()) {
(gdb)
197 if (xrt.empty())
(gdb)
200 return devices;
(gdb) print devices.size()
$12 = 0
I've ran out of time, so gotta run for now. May poke around later.
I am attempting to generate an AFI from our compiled .xclbin but am hitting the following error:
/opt/create_sdaccel_afi.sh: line 105: sdaccel_setup.sh: No such file or directory
ERROR: Env variable RELEASE_VER not set, did you ?
RELEASE_VER
is the version of the vivado tooling we're using, it does get set by sdaccel_setup.sh
but since there are so many changes required to support a new vivado version already I think I'll just hard-code the vivado version number and remove this dependency. This'll also increase the greppability of these scripts.
edit this script worked and we now have an AFI to work with: afi-06437cf7254575b62
This PR's grown a bit long in the tooth but is full of useful changes. I'm going to merge it and then maybe create a second PR to update the sdaccel
library embedded in examples.
Changes
Since support for the AFI used by the v1.3.x SDK/HDK version has now been discontinued, we need to update all our AWS builds to use the latest SDK/HDK. This also updates the required Vivado version to 2018.2.
Status At Wind-Down
This PR was addressing the issue of Xilinx/AWS tooling churn which had broken our public AWS offering towards the end of 2018. Prior to wind-down AWS had offered to maintain support for the old tooling for a limited period while the work in this PR was completed. Based on past experience it is highly likely that there will be further tooling churn on the part of Xilinx/AWS which will eventually render this PR obsolete. However, it may still be used as a template for identifying the changes required to support future Xilinx/AWS tool and framework versions.
Outstanding issues
set -eux
toinstall_platform.sh
to catch errors copying files.sdaccel-builder cmds
gives:/usr/bin/ld: cannot find -llmx6.0
(it lives in/opt/Xilinx/SDx/2017.4.op/lib/lnx64.o/liblmx6.0.so
, fix is to give correct LIBRARY_PATH).build
fails withmake_xo: line 13: vivado: command not found
, fix is to update PATH (/Vivado/)export DEVICE=xilinx_aws-vu9p-f1-04261818_dynamic_5_0 DEVICE_FULL=xilinx_aws-vu9p-f1-04261818_dynamic_5_0 PART=xcvu9p-flgb2104-2-i PART_FAMILY=virtexuplus
XO_NAME
and settingmake_xo
-vendor
,-library
etc for maximum symbol length.dpkg-reconfigure dash
/opt/Xilinx/Vivado/2018.2.op2258646/lib/lnx64.o/../../tps/lnx64/gcc-6.2.0/bin/../../binutils-2.26/bin/ld: cannot find crti.o: No such file or directory
- issue here may be that we're using an unsupported OS.root@fac70f49ea51:/opt/Xilinx/Vivado/2018.2.op2258646/tps/lnx64/gcc-6.2.0/libexec/gcc/x86_64-pc-linux-gnu/6.2.0/install-tools# ./mkheaders /opt/Xilinx/Vivado/2018.2.op2258646/tps/lnx64/gcc-6.2.0/
- see https://stackoverflow.com/a/53178106/465384jq: error: Could not open file /tmp/workspace/.reco-work/sdaccel/build/reports/reconfigure_io_sdaccel_builder_stub_0_1_util.json: No such file or directory
(need to deal with xo being renamed)./opt/sdaccel-builder/sdaccel-builder
right before executing the simulation, but make is failing before then when it tries to invokeemconfigutil
). Note this is now done for simulate but might need to happen elsewhere too (e.g. image building).build
(for some reason it has built.hw_emu.
instead of.hw.
XCL_EMULATION_MODE=true
(https://github.com/ReconfigureIO/sdaccel/pull/9)