Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.
Other
305 stars 86 forks source link

std::runtime_error -- what(): qds_device::wait() unexpected command state #1751

Open hecmay opened 2 months ago

hecmay commented 2 months ago

I was running cascade matrix multiple example from programming examples: https://github.com/Xilinx/mlir-aie/tree/main/programming_examples/basic/matrix_multiplication/cascade

I noticed that the code only works when M or m is greater than 64. When I set them to smaller values, say M = 32, n = 32, the host code will throw the following error:

Running Kernel (iteration 0).
terminate called after throwing an instance of 'std::runtime_error'
  what():  qds_device::wait() unexpected command state
Aborted (core dumped)
make: *** [/home/ubuntu/mlir-aie-test/programming_examples/basic/matrix_multiplication/cascade/../makefile-common:114: run] Error 134

No error from XCLBIN compilation process, and it seems to be something wrong with runtime? Any idea how this can be fixed?

hecmay commented 2 months ago

Running into exactly the same error in matrix vector sample code if the n_cores is > 1: https://github.com/Xilinx/mlir-aie/blob/main/programming_examples/basic/matrix_multiplication/matrix_vector/aie2.py#L21

hecmay commented 2 months ago

I am not sure if this error is caused by my environment setup. I am using Phoenix Point Mini PC: Minisforum UM790 Pro : AMD Ryzen™ 9 7940HS. I followed every single step in this README: https://github.com/Xilinx/mlir-aie/blob/main/docs/buildHostLin.md

Linux kernel: 6.10
Vitis 2023.2
AMDXDNA: 2.18.0_20240825, 537a509a3ab1b698c9c9f6ebcd88035b2fe8359b

Can anyone reproduce the issue? Any help would be highly appreciated. Thanks! @stephenneuendorffer @fifield @hunhoffe @Yu-Zhewen @makslevental

PisonJay commented 1 month ago

Me too. Getting same error with Ryzen AI 9 365.

PisonJay commented 1 month ago

I guess that current infrastructure only supports XDNA1 (AIE2) architecture. Strix Point, XDNA2 (with code name AIEP) is not supported yet. So does Peano, listing XNDA2 as "coming soon". The only usable runtime is ONNX runtime from Ryzen AI SDK, only available on Windows at current time.

hecmay commented 1 month ago

I guess that current infrastructure only supports XDNA1 (AIE2) architecture. Strix Point, XDNA2 (with code name AIEP) is not supported yet. So does Peano, listing XNDA2 as "coming soon". The only usable runtime is ONNX runtime from Ryzen AI SDK, only available on Windows at current time.

Not so sure if that's the cause. On my side, the program is still runnable in some cases if the M/N/K values make the runtime happy.

And I do not think ONNX runtime is used in these examples. It should be some weird problems from Xilinx Runtime: https://github.com/Xilinx/mlir-aie/blob/main/programming_examples/basic/matrix_multiplication/test.cpp#L23-L25