NVIDIA / jetson-rdma-picoevb

Minimal HW-based demo of GPUDirect RDMA on NVIDIA Jetson AGX Xavier running L4T
Other
159 stars 44 forks source link

Working with a KCU105 #2

Closed jascondley closed 4 years ago

jascondley commented 4 years ago

Hi, I've spent this week attempting to get this driver up and running with a KCU105 and a Jetson Xavier AGX with very mixed results.

I've built a number of variants and have been seeing some very strange behavior: Variant 1. 128kB BRAM attached to the XDMA core. I modify the size in the kernel driver and when running the rdma-malloc client application the PCIE bus appears to hit some significant difficulty:

[ 1174.411610] pcieport 0005:00:00.0: Root Port link has been reset
[ 1174.411633] pcieport 0005:00:00.0: AER: Device recovery failed
[ 1174.411642] pcieport 0005:00:00.0: AER: Uncorrected (Fatal) error received: id=0000
[ 1174.411690] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0000(Receiver ID)
[ 1174.411912] pcieport 0005:00:00.0:   device [10de:1ad0] error status/mask=00040020/00400000
[ 1174.412053] pcieport 0005:00:00.0:    [ 5] Surprise Down Error   
[ 1174.412159] pcieport 0005:00:00.0:    [18] Malformed TLP          (First)
[ 1174.412272] pcieport 0005:00:00.0:   TLP Header: 20002080 010006fd 00000004 ff800a00
[ 1174.412429] pcieport 0005:00:00.0: broadcast error_detected message
[ 1174.412433] picoevb-rdma 0005:01:00.0: device has no AER-aware driver

Variant 2: FPGA built with a DDR MIG. The KCU105 has 2GB of on-board DDR. This does not suffer from the same PCIE bus failure, but instead has a problem where a large chunk of the data is corrupted in host memory.

I've been able to debug in the FPGA side using an ILA, and see that data going in is correct and data as it comes out of the FPGA's DDR is correct, but somewhere between there and the host becomes corrupted.

I'm thoroughly stumped, both of these designs appear to work well in desktop PC's but suffer from these problems on the Xavier. Any thoughts on what might be going on here?

Thanks!

swarren commented 4 years ago

Did you have a look at the version for the XCKU060 that I pushed a few days ago? That might be closer to the FPGA you're using than the original PicoEVB version.

When synthesizing/implementing the FPGA, were there any errors? My first guess here would be a timing closure or logic problem in the FPGA; are you sure all the relevant module ports are hooked together properly in the new design?

IIUC, both of your modified designs work well in a PC simply by swapping the FPGA card there and no HW modifications? What power requirements does the FPGA card have; have you checked the Jetson specs to make sure it can provide that much power, or have you provided an external power source? Even for the smaller XCKU060 card in the HTG-K800 we used an external PSU for the FPGA. I don't know if that was required; whoever set up the HW (which is remote to me) just did that from the start.

jascondley commented 4 years ago

We definitely meet timing (with a huge amount of margin.)

We've been using a very similar design in house for hardware accelerators for about a year now without issues in a desktop PC.

Power was another question I ran down. In this particular card the Jetson isn't required to supply any power. 100% of power is from an external 12V supply very similar to your HTK-K800 configuration.

At first glance it certainly feels like an FPGA issue, but the fact this works in a desktop (and correct, no modifications between the two) is what's leaving me baffled. We've tried now across 2 Xaviers with 2 separate FPGAs and see identical results.

Even things like the internal bit-width of the bus seem to change the behavior - in exactly one configuration everything magically "just works" - 64 bits wide, 64K BRAM. Any modifications to this cause failure. This does sound like a timing issue, but I'm starting to wonder if this may be a PCI-E/Tegra timing issue instead of an FPGA one.

I have hardware arriving Monday to test out the PicoEVB implementation, I'll report any findings once that is complete.

swarren commented 4 years ago

Do you have a PCIe bus analyzer (which hopefully can produce eye diagrams too) that could help show signal integrity, or what causes the AER. This seems like it'd be most relevant for variant 1, but perhaps variant 2 too.

Before the failure occurs (or perhaps after too), does lspci show any corrected errors on the bus? That'd perhaps indicate a signal integrity issue too.

Does running at gen1/2 rather than gen3, or at a lower PCIe bus width, help the issue? What if you synthesize the FPGA to only support a slower narrower PCIe bus?

I don't have a lot of good ideas what might be causing this issue; it feels like a HW related issue more than SW, and I'm primarily in the SW domain.

You might be able to engage more HW-oriented support via: https://developer.nvidia.com/embedded/community/support-resources Also, make sure you've read any PCIe-related design docs or sections in the HW manuals from the link below (is this a custom FPGA board, or off-the shelf?): https://developer.nvidia.com/embedded/downloads

jascondley commented 4 years ago

Hey, just wanted to close the loop on this one. The core issue wound up being an inappropriately constrained clock on the FPGA. This led to the sometimes working, sometimes not bizarre behaviors.

swarren commented 4 years ago

That makes sense. Thanks for following up.