Xilinx / ROCm-air-platforms

A POC platform and example for an experimental ROCm runtime release for the AMD AI Engine
MIT License
9 stars 0 forks source link

XRT compatibility #8

Open gabrielrodcanal opened 6 months ago

gabrielrodcanal commented 6 months ago

Hi there, Is the VCK5000 AIR platform compatible with XRT? Otherwise, what would be the easiest way to reset the card without access to xbutil?

eddierichter-amd commented 6 months ago

Hi Gabriel,

Thanks for the note. You are not able to use xbutil or other XRT management commands with the VCK5000 AIR platform. By reset the card, what exactly do you mean? When the host is rebooted, we perform reset of the PCIe functionality, additionally we perform certain resets on the AI Engine array when the firmware is loaded and when we run an application.

Thanks,

Eddie Richter

gabrielrodcanal commented 6 months ago

Hi Eddie, I would like to be able to reboot the card from the host in the same way I'd be able to do it with an official XRT platform by doing xbutil reset --device [...]. This is a functionality I need in case the card hangs while I'm playing around with it. Is there any way I can do the same thing that xbutil command does with this platform?

eddierichter-amd commented 6 months ago

The ARM can be reset via the xsdb tool which is supported on the platform via USB JTAG. When the firmware is loaded it also resets all of the AIEs in the device. It doesn't have exactly the same functionality but given our current platform has the primary functionality in the ARM and AIEs I have found this to be sufficient. Let me know if this works for you, if not happy to discuss your use case.

mesham commented 6 months ago

The ARM can be reset via the xsdb tool which is supported on the platform via USB JTAG. When the firmware is loaded it also resets all of the AIEs in the device. It doesn't have exactly the same functionality but given our current platform has the primary functionality in the ARM and AIEs I have found this to be sufficient. Let me know if this works for you, if not happy to discuss your use case.

Thanks very much for this, is there a specific command you use to reset the ARM here (we have tried rst -por but that didn't seem to help)? I guess JTAGing the PDI onto the card again won't necessarily help as it could reset the PCIe endpoint and require a reboot? Hence the question to see what your flow is here as that would be most helpful for us.

Thanks, Nick

eddierichter-amd commented 6 months ago

If you do a rst -processor and then download the elf (dow <filename>.elf) and then run the firmware (con) that should work to reset the ARM firmware, which will reset the AIEs when it is first launched. You can use a software package like screen or minicom to see the output of the firmware. Thanks for the questions, it is clear that this should be described in the firmware README. I will update the documentation with instructions on how to do this.

If you could explain the issue that you are running into that would be helpful. Are you running into issues after programming the card? Or are you running into issues with the firmware or AIEs and trying to reset them?

gabrielrodcanal commented 6 months ago

Hi Eddie, Thank you for your answers. I'll tell you what I'm doing so that hopefully you can reproduce the issue. I'm running this test in the mlir-air repository https://github.com/Xilinx/mlir-air/tree/main/test/airhost/06_air_link_shared with a modification to access an out-of-bounds memory position. This is the aie.mlir code with the modification:

module {

func.func @graph(%arg0 : memref<32x16xi32>, %arg1 : memref<32x16xi32>) -> () {
  %herd_cols = arith.constant 1 : index
  %herd_rows = arith.constant 1 : index
  air.herd tile(%tx, %ty) in (%size_x = %herd_cols, %size_y = %herd_rows) args(%ext0 = %arg0, %ext1 = %arg1) : memref<32x16xi32>, memref<32x16xi32> attributes { sym_name="copyherd"} {
    %c0 = arith.constant 0 : index
    %c32 = arith.constant 32 : index
    %c16 = arith.constant 16 : index
    %c8 = arith.constant 9 : index
    %buf0 = memref.alloc() {sym_name = "scratch"}: memref<16x8xi32, 2>
    %buf1 = memref.alloc() {sym_name = "scratch_copy"}: memref<16x8xi32, 2>
    air.dma_memcpy_nd (%buf0[][][], %ext0[%c0, %c0][%c8, %c16][%c32, %c0]) {id = 1 : i32} : (memref<16x8xi32, 2>, memref<32x16xi32>)
    affine.for %j = 0 to 8 { 
      affine.for %i = 0 to 16 {
        %0 = affine.load %buf0[%i, %j] : memref<16x8xi32, 2>
        affine.store %0, %buf1[%i, %j] : memref<16x8xi32, 2>
      }   
    }   
    air.dma_memcpy_nd (%ext1[%c0, %c0][%c8, %c16][%c32, %c0], %buf1[][][]) {id = 2 : i32} : (memref<32x16xi32>, memref<16x8xi32, 2>) 
    memref.dealloc %buf1 : memref<16x8xi32, 2>
    memref.dealloc %buf0 : memref<16x8xi32, 2>
    air.herd_terminator
  }
  return
}

}

Note how I change %c8 to 9. What happens when I run this code is that it won't finish and the same would happen for any attempts to run code that's correct from then on.

mesham commented 6 months ago

And just to add here, the process hangs (and needs to be killed) and then further (correct) AIE executables don't run at all. So we think that the card is locking up (and hence the question to reset it) - but equally it might be the amdair kernel module which is getting into a funny state (we tried to unload and reload it, but then get segfaults on running correct executables).

eddierichter-amd commented 6 months ago

This is interesting. The example doesn't hang for me, it just returns that the output is incorrect and additional tests work. Anyway, can you try the following. In XSDB reset the firmware as follows:

xsdb
connect
ta 6 (This should be APU #0. If you have multiple cards in the system this might be different)
rst --processor
dow airrt_cpp.elf (Can download this from the release)
con

That should reset the firmware. If you use screen (screen /dev/ttyUSB<#, mine is 2> 115200) in another terminal you should be able to see the firmware get loaded when you run the final command.

After reloading the firmware, reload the driver, and run the test again. This is not an ideal flow, but if this works, that at least lets us know that resetting the AIE tiles fixes the problem.

Is there a reason you are accessing an out-of-bounds address? Is this by design?

gabrielrodcanal commented 6 months ago

@eddierichter-amd I've tried the steps you described in your last comment and I get the following message in xsdb after running rst -processor:

xsdb% rst -processor
WARNING: If the reset is being triggered after powering on the device,
         write bootloop at reset vector address (0xffff0000), or use
         -clear-registers option, to avoid unpredictable behavior.
         Further warnings will be suppressed
WARNING: Default system will be activated before triggering reset.
         Use skip-activate-subsystem to skip this.
         Further warnings will be suppressed
WARNING: Cannot activate default subsystem. This may cause runtime issues if PM
         API is used.
         Memory read error at 0xFF380004. AP transaction timeout
Cannot reset Cortex-A72 #0. AXI AP transaction error, DAP status 0xF0000021

And when I try dow airrt_cpp.elf I get: AXI AP transaction error, DAP status 0x30000021

There's no specific reason why we want to access an out-of-bounds address. I just wanted to make sure the test wouldn't pass with a slight modification and by chance discovered the card locks up. Now I'm interested in knowing how to recover the card from that state.

eddierichter-amd commented 6 months ago

Hmmm that seems like the system is in a bad state, more so than just the AIE tile needing to be reset from the out-of-bounds access. Are you able to reprogram the board and warm reboot to get it in a healthy state? If so, can you do that and then try programming the ARM? This will tell us if there is a problem with the system or just the state that the card is currently in. With a fresh design you should be able to program the ARM if USB JTAG is connected.