amd / RyzenAI-SW

MIT License
366 stars 60 forks source link

All operators are computed on CPU with VitisAIExecutionProvider #85

Closed qz233 closed 4 months ago

qz233 commented 4 months ago

Hi team, I an trying to deploy my model on AMD NPU device using VitisAIExecutionProvider. I thought that all supported operators can be computed on NPU, but often I encounter this notice:

I20240527 12:49:21.131315 16676 pass_main.cpp:245] [VITIS AI EP] This model is not a supported CNN model which will not be compiled with DPU.

So is this tool only supporting pure CNN architecture? And how can I use this tool for other kinds of model (such a Transformers)

Here is the tests I did:

(which is weird because the ResNet test case has a similar structure)

My test code:

import os
import shutil
import logging
import torch
import torch.nn as nn
import torch.nn.functional as F
import onnxruntime
import vai_q_onnx
from onnxruntime.quantization import shape_inference, CalibrationDataReader

# global variable
OUTPUT_PATH = "./tmp"
VITIS_CONFIG_PATH = "./ryzen-ai-sw-1.1/ryzen-ai-sw-1.1/voe-4.0-win_amd64/vaip_config.json"

if not os.path.exists(OUTPUT_PATH):
    os.mkdir(OUTPUT_PATH)

logging.basicConfig(level=logging.INFO)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Conv2d(1,10, 3, padding=1)#nn.Linear(2,100)
        self.layer2 = nn.Linear(10, 1)#nn.Linear(100,1)

    def forward(self, x):
        h1 = F.relu(self.layer1(x))
        h1 = F.adaptive_avg_pool2d(h1, (1,1)).flatten(1)
        return F.sigmoid(self.layer2(h1))

    def onnx_export(self, x, path):
        torch.onnx.export(
            self,
            x,
            path, 
            opset_version=13,   # suggested by the tutorial 
            input_names=["input"],
            output_names=["out"],
            dynamic_axes={
                "input": {0 : "batch_size"},
            },
        )

class RandomDataGenerator(CalibrationDataReader):
    def __init__(self):
        super().__init__()
        self.data = iter(torch.randn((10000,1,1,10, 10)))

    def get_next(self):
        try:
            return {"input": next(self.data).numpy()}
        except Exception:
            pass

dataloader = RandomDataGenerator()

def integrate_quantitize(source_path, target_path):
    shape_inference.quant_pre_process(
        input_model_path=source_path,
        output_model_path=target_path, 
        skip_optimization = False,
        skip_onnx_shape = False,
        skip_symbolic_shape = False,
        auto_merge = False,
        int_max = 2**31 - 1,
        guess_output_rank = False,
        verbose = 3,
        save_as_external_data = False,
        all_tensors_to_one_file = False,
        external_data_location = "./",
    )
    vai_q_onnx.quantize_static(
        model_input=target_path,
        model_output=target_path,
        calibration_data_reader=dataloader,
        quant_format=vai_q_onnx.QuantFormat.QDQ,
        calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
        activation_type=vai_q_onnx.QuantType.QUInt8,
        execution_providers=['CPUExecutionProvider'],
        enable_dpu=True,
        extra_options={'ActivationSymmetric':True},
    )

def run_quant_model(path, model_name, inputs, device="npu"):
    provider = "VitisAIExecutionProvider" if device == "npu" else "CPUExecutionProvider"
    cache_dir = os.path.join(OUTPUT_PATH, model_name)
    if device == "npu" and os.path.exists(cache_dir):
        print("=== Clean cache ===")
        # expect compiling during each run for dubug
        shutil.rmtree(cache_dir)
    sess = onnxruntime.InferenceSession(
        path,
        providers=[provider],
        provider_options=[{"config_file":VITIS_CONFIG_PATH,
                            "cacheDir": OUTPUT_PATH,
                            'cacheKey': model_name}]
    )
    out = sess.run(None, {
        "input": inputs.numpy()
    })[0]
    return out

if __name__ == "__main__":
    inputs = torch.from_numpy(next(dataloader)["input"])
    model = Model()
    model.onnx_export(inputs, f"{OUTPUT_PATH}/MLP.onnx")

    with torch.no_grad():
        print(f"original out {model(inputs)}")

    integrate_quantitize(f"{OUTPUT_PATH}/MLP.onnx", f"{OUTPUT_PATH}/MLP_Q.onnx")

    out = run_quant_model(f"{OUTPUT_PATH}/MLP.onnx", "mlp", inputs, "cpu")
    print("cpu(fp32) result: ", out)
    out = run_quant_model(f"{OUTPUT_PATH}/MLP_Q.onnx", "mlp", inputs)
    print("npu result: ", out)

    # ensure the testcase runs well
    #out = run_quant_model(f"./ryzen-ai-sw-1.1/ryzen-ai-sw-1.1/quicktest/test_model.onnx", "resnet", torch.randn([1,3,32,32]))
savitha-srinivasan commented 4 months ago

Hi @qz233,

You can find the list of NPU-supported ops here. Your MLP and the model with Conv2d -> flatten -> Linear pattern should work. The provider config file (vaip_config.json) currently has a field named "minimum_num_of_conv" which sets the threshold for number of conv layers in the model for it to be off-loaded to the NPU. The default value is set to 2. Since your models have less than 2 conv layers, you see the "not a supported CNN model" message. You can modify the config file to change this default value, and your models will then run on the NPU.

qz233 commented 4 months ago

Hi @savitha-srinivasan

Thanks you very much! I will try again once I return home. And may I ask if conv1d is supported by Vitisaiexecuter? Since my ryzen-ai project works on audio, there are quite alot of these operators.

rhenry74 commented 4 months ago

did you get it working @savitha-srinivasan I'm having similar problems.

savitha-srinivasan commented 4 months ago

@qz233 @rhenry74 we do have support for conv1d. We were able to reproduce the issue for your model. Could you try setting XLNX_ENABLE_CONV1D=1 in your environment and retry? This should fix it. Please also delete the cache while retrying.

qz233 commented 4 months ago

@savitha-srinivasan Yep, all my problems are solved. Thanks.