GaloisInc / BESSPIN-CloudGFE

The AWS cloud deployment of the BESSPIN GFE platform.
Apache License 2.0
2 stars 2 forks source link

AWSteria GFE AWS FPGA build #58

Closed rsnikhil closed 4 years ago

rsnikhil commented 4 years ago

AWSteria GFE Steps 1 and 2 were to get things running in standard AWS XSIM flow and in a verilator/Bluesim flow. This step is to take it through the flow for FPGA build/deploy/run on actual AWS cloud hardware (load and run all ISA tests).

rsnikhil commented 4 years ago

@swm11 suggested this should be tackled next, now that the Step 1 AWSteria XSIM flow is working. I am now turning my attention to this.

rsnikhil commented 4 years ago

As a first step, I'm just trying to build and run the standard HDK Hello World example. 12 hours of batting broken Python scripts, Vivado licensing issues, etc. Finally, succeeding in creating a DCP (Design Checkpoint), uploading it to a bucket, and submitting it for an AFI build. Note: DCP build took 1h33m. If it works (tomorrow), will then try AWSteria build and run.

kiniry commented 4 years ago

Do you have any learnings for others who might want to attempt that battle themselves that come from your point of view, @rsnikhil?

rsnikhil commented 4 years ago

Yes, a bit more than that: I am creating a Makefile that has the steps I took, so I can repeat the process conveniently. I'm writing extensive comments on what each step does, the issues I faced, and how I got around them.

I'm happy to share this.

One of the frustrations is that I was running in an official 'FPGA Developer AMI-1.8.1' provided by Amazon, that ought to work out-of-the-box, but didn't.

rsnikhil commented 4 years ago

Today, successful run of the Hello World example on FPGA on an F1 instance. Now working on taking the AWSteria first-demo (which loads and runs ISA tests) through the flow.

jrtc27 commented 4 years ago

You will likely need to merge https://github.com/bluespec/Flute/pull/24 if you want to have any hope of passing timing, given you're also using that deburster in a synthesised design (within Flute it's only used in the simulated testbench). The Connectal-based design already has that fix in its local copy.

charlie-bluespec commented 4 years ago

progress update from @rsnikhil:

I've made some progress on the AWSteria bringup on FPGA (running it at 87 MHz):

Background: this first test setup is supposed to do the following (this whole sequence works fine in XSIM simulation): 1 DMA 16 GB to DDR A 2 DMA 16 GB to DDR B 3 DMA 16 GB to DDR C 4 DMA 16 GB to DDR D 5 DMA 16 GB back from DDR A and check it 6 DMA 16 GB back from DDR B and check 7 DMA 16 GB back from DDR C and check it 8 DMA 16 GB back from DDR D and check it 9 DMA a Mem.Hex file into DDR A, representing the code for ISA test rv64ui-p-add 10 Read back 128 bytes of that just to cross-check

11  Write through the OCL port to set verbosity
12  Write through the OCL port to set watch_tohost and tohost_addr
13  Write through the OCL port to "allow" Flute to access DDR (although
    it came up running, it's stuck so far, waiting for the response to
    its first fetch from the DDR).
14  Poll the HW for a response status word indicating the final value
    written to the 'tohost' location

As mentioned before, Step 8 was failing (readback DMA from DDR D).

I changed the program to omit steps 4 and 8 (DMA to and from DDR D).

I get past 9 and 10 (the read-back in 10 indicates it actually DMA'd the data correctly into DDR A).

Up to 10, we're using the DMA_PCIS channel. From 11 onwards we're using the OCL channel.

I get past 11-13; then in step 14, I eventually timeout waiting for Flute's status response.

Note: although we get past 11-13, these are pure writes, so there's no feedback that that hardware reacted correctly. So, somewhere in the OCL interaction, we're going 'silent'.

I'm pondering my next move on this.

joestoy commented 4 years ago

I'm also trying to run AWSteria on an AWS instance. I'm haveing trouble installing the xdma drivers. I get

[centos@ip-172-31-59-216 xdma]$ sudo make install
Makefile:25: XVC_FLAGS: .
make -C /lib/modules/3.10.0-1127.8.2.el7.x86_64/build M=/home/centos/git-repos/aws-fpga/sdk/linux_kernel_drivers/xdma modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-1127.8.2.el7.x86_64'
/home/centos/git-repos/aws-fpga/sdk/linux_kernel_drivers/xdma/Makefile:25: XVC_FLAGS: .
  Building modules, stage 2.
/home/centos/git-repos/aws-fpga/sdk/linux_kernel_drivers/xdma/Makefile:25: XVC_FLAGS: .
  MODPOST 1 modules
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1127.8.2.el7.x86_64'
make -C /lib/modules/3.10.0-1127.8.2.el7.x86_64/build M=/home/centos/git-repos/aws-fpga/sdk/linux_kernel_drivers/xdma modules_install
make[1]: Entering directory `/usr/src/kernels/3.10.0-1127.8.2.el7.x86_64'
  INSTALL /home/centos/git-repos/aws-fpga/sdk/linux_kernel_drivers/xdma/xdma.ko
Can't read private key
  DEPMOD  3.10.0-1127.8.2.el7.x86_64
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1127.8.2.el7.x86_64'
depmod -a
install -m 644 10-xdma.rules /etc/udev/rules.d
rmmod -s xdma || true
modprobe xdma
modprobe: ERROR: could not insert 'xdma': Exec format error
make: [install] Error 1 (ignored)
[centos@ip-172-31-59-216 xdma]$ 

Does anyone know what private key it's talking about?

joestoy commented 4 years ago

Googling suggests that this might have to do with signed kernels/drivers and the fact that I'm not a kernel developer (who would know the keys). I note that @rsnikhil doesn't seem to have encountered this problem. One difference is that he's running on us-west2 whereas I'm on us-east1.

jameyhicks commented 4 years ago

Look at the output from dmesg.

Most likely the driver was not compiled to match the kernel. If the kernel version was updated since you compiled it you could see this error message.

joestoy commented 4 years ago

@jameyhicks I've just done make clean; make; sudo make install and get the same failure.

joestoy commented 4 years ago

Panic over. dmesg suggested that xocl had come back (perhaps due to the reboot after updating the kernel). After removing it again with rmmod the "Can't read private key" messge still happens; but now xdma does get installed.

kiniry commented 4 years ago

How did today's experiments go, @joestoy and @rsnikhil?

rsnikhil commented 4 years ago

I don't have any progress/results beyond what I discussed in today's call. I am waiting to hear @joestoy 's results. For today I turned my attention towards getting the rest of the system up in simulation, using Bluesim/Verilator sim.

rsnikhil commented 4 years ago

We are now successfully running, on FPGA, the AWSteria test described in a previous message in this issues thread. Briefly:

The technical problem seems to have been the '1-character typo' in the name of a clock signal for DDR4 D in the top-level SystemVerilog shim. @joestoy mentioned this discovery in last Friday's stand-up. This was causing DDR4 D never to assert it's 'ready' signal which, in turn, was causing everything to get stuck. With this fix, everything (DMA PCIS and OCL) has started working.