Smaller models do not run on the NPU.

Montzsuma commented 1 day ago

Hi all,

I modified the hello_world model to perform a single MatMul operation instead of the Conv2d/Relu operations, and i'm unable to make it run on the NPU.

The code is mostly the same, the changes were mostly in the model:

class MatMulModel(nn.Module):
    def __init__(self):
        super(MatMulModel, self).__init__()

    def forward(self, x1, x2):
        x = torch.matmul(x1[0][0], x2[0][0])
        return x

the dummy inputs:

batch_size = 1
input_channels = 1
input_size = 2
dummy_input1 = torch.Tensor([[[[1., 2.], [3., 4.]]]])
dummy_input2 = torch.Tensor([[[[5., 6.], [7., 8.]]]])

The actual inputs:

input_data1 = [[[[1., 2.], [3., 4.]]]]
input_data2 = [[[[5., 6.], [7., 8.]]]]

I am also looping the inference a few hundred times to check the NPU usage.

I also tried to add to the vaip_config.json:

        "minimum_num_of_conv": 0,
        "minimum_num_of_fc": 0,
        "minimum_num_of_matmul": 0,
        "minimum_num_of_ops": 0,
        "enable_all_ops": true

Which, being honest, ChatGPT brought me these properties, but i'm not sure from where they are, and couldn't find them in the documentation.

Output:

C:\hello_world\>python matmul.py
MatMulModel()
[VAI_Q_ONNX_INFO]: Time information:
2024-10-07 06:36:22.413222
[VAI_Q_ONNX_INFO]: OS and CPU information:
                                        system --- Windows
                                          node --- LAPTOP-2660VGSA
                                       release --- 10
                                       version --- 10.0.26100
                                       machine --- AMD64
                                     processor --- AMD64 Family 26 Model 36 Stepping 0, AuthenticAMD
[VAI_Q_ONNX_INFO]: Tools version information:
                                        python --- 3.10.0
                                          onnx --- 1.16.2
                                   onnxruntime --- 1.17.0
                                    vai_q_onnx --- 1.17.0+511d6f4
[VAI_Q_ONNX_INFO]: Quantized Configuration information:
                                   model_input --- models/matmul_model.onnx
                                  model_output --- models/matmul_model_quantized.onnx
                       calibration_data_reader --- None
                         calibration_data_path --- None
                                  quant_format --- QDQ
                                   input_nodes --- []
                                  output_nodes --- []
                          op_types_to_quantize --- []
                random_data_reader_input_shape --- []
                                   per_channel --- False
                                  reduce_range --- False
                               activation_type --- QUInt8
                                   weight_type --- QInt8
                             nodes_to_quantize --- []
                              nodes_to_exclude --- []
                                optimize_model --- True
                      use_external_data_format --- False
                              calibrate_method --- PowerOfTwoMethod.MinMSE
                           execution_providers --- ['CPUExecutionProvider']
                                enable_ipu_cnn --- True
                        enable_ipu_transformer --- False
                     specific_tensor_precision --- False
                                    debug_mode --- False
                          convert_fp16_to_fp32 --- False
                          convert_nchw_to_nhwc --- False
                                   include_cle --- False
                                    include_sq --- False
                               include_fast_ft --- False
                                 extra_options --- {'ActivationSymmetric': True}
INFO:vai_q_onnx.quantize:calibration_data_reader is None, using random data for calibration
INFO:vai_q_onnx.quant_utils:The input ONNX model models/matmul_model.onnx can create InferenceSession successfully
INFO:vai_q_onnx.quant_utils:Random input name input1 shape [1, 1, 2, 2] type <class 'numpy.float32'>
INFO:vai_q_onnx.quant_utils:Random input name input2 shape [1, 1, 2, 2] type <class 'numpy.float32'>
INFO:vai_q_onnx.quant_utils:Obtained calibration data with 1 iters
INFO:vai_q_onnx.quantize:Removed initializers from input
INFO:vai_q_onnx.quantize:Simplified model sucessfully
INFO:vai_q_onnx.quantize:Loading model...
INFO:vai_q_onnx.quant_utils:The input ONNX model C:/Temp/vai.simp.30h79mjk/model_simp.onnx can run inference successfully
INFO:vai_q_onnx.quantize:optimize the model for better hardware compatibility.
INFO:vai_q_onnx.quantize:Start calibration...
INFO:vai_q_onnx.quantize:Start collecting data, runtime depends on your model size and the number of calibration dataset.
INFO:vai_q_onnx.calibrate:Finding optimal threshold for each tensor using PowerOfTwoMethod.MinMSE algorithm ...
INFO:vai_q_onnx.calibrate:Use all calibration data to calculate min mse
Computing range: 100%|█████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 7000.51tensor/s]
INFO:vai_q_onnx.quantize:Finished the calibration of PowerOfTwoMethod.MinMSE which costs 0.0s
INFO:vai_q_onnx.qdq_quantizer:Remove QuantizeLinear & DequantizeLinear on certain operations(such as conv-relu).
INFO:vai_q_onnx.refine:Adjust the quantize info to meet the compiler constraints
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Op Type              ┃ Float Model                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Constant             │ 1                                  │
│ Gather               │ 4                                  │
│ MatMul               │ 1                                  │
├──────────────────────┼────────────────────────────────────┤
│ Quantized model path │ models/matmul_model_quantized.onnx │
└──────────────────────┴────────────────────────────────────┘
Calibrated and quantized model saved at: models/matmul_model_quantized.onnx
APU Type: STX
Setting environment for STX
XLNX_VART_FIRMWARE= C:\Program Files\RyzenAI\1.2.0\voe-4.0-win_amd64\xclbins\strix\AMD_AIE2P_Nx4_Overlay.xclbin
NUM_OF_DPU_RUNNERS= 1
XLNX_TARGET_NAME= AMD_AIE2_Nx4_Overlay
Directory 'C:\hello_world\cache\matmul_cache' deleted successfully.
[Vitis AI EP] No. of Operators :   CPU    11
[Vitis AI EP] No. of Subgraphs :   CPU     3
[array([[0.9921875, 0.9921875],
       [0.9921875, 0.9921875]], dtype=float32)]
NPU Execution Time: 5.287397000007331

Is there a limitation on simpler models, or is some property missing?

I also just noticed that the quantized model does'nt actually return the matrix multiplication, just a random array everytime I run the script.

uday610 commented 1 day ago

Hi @Montzsuma,

Yes, The software is tuned to the real models. I suggest you try with actual CNN models with convolution layers (ResNet, MobileNet etc). It is not for experimenting with one operator, etc.

Thanks Uday

Montzsuma commented 8 hours ago

Hi @uday610 !

Thanks for the quick reply.

Do you know how does RyzenAI decides if will run on the NPU? You mentioned that one operator alone won't run on the NPU, but is there a specific amount of operations that the software checks for before deciding where it will run? For example, would recursively running MatMul a high number of times be enough for it to run on the NPU, or only very specific operations do?

In addition, I saw this link, specifically the "This graph partitioning and deployment technique across CPU and NPU is fully automated by the VAI EP and is totally transparent to the end-user." part. Does it means that the execution of a single model can have its load shared between CPU/NPU?

Thanks a lot!

uday610 commented 7 hours ago

Ryzen AI's official installer flow supports CNN-based models, so the model must have a few convolution (I think at least 2) layers.

Yes, the execution of a single model can be shared between NPU and CPU, depending on the operators. If some operators cannot run on NPU, they will run on CPU, and this happens automatically.

Montzsuma commented 4 hours ago

Then, if I'm understanding it right, I can't run single operator models in the NPU, and if the model is complex enough to run on NPU, there is no guarantee that everything in the model will actually run on the NPU, even if only operations listed in the Model Compatibility page are used, because the framework can decide to optimize the performance and share the load between CPU/NPU. Is that so?

What I was actually trying was to benchmark single operators performance on the NPU. Basically run the operators in the previous link with varying input sizes and compare performance between CPU and NPU.

amd / RyzenAI-SW

Smaller models do not run on the NPU. #126