apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.76k stars 3.47k forks source link

[Bug] how to fix rocm platform bugs? #13768

Closed wangzy0327 closed 1 year ago

wangzy0327 commented 1 year ago

At present, my workmate and I use AMD GPU to support the SYCL platform on tvm, and want to do related evaluations on opencl, rocm, and sycl on AMD GPU. At present, I have made the sycl platform on tvm, but there are still problems with the rocm platform, Can you provide help to solve this problem together?

Expected behavior

I tried to execute mnist-model by tvm in rocm platform(rocm 5.2). The result of execution is error.

Actual behavior

The result of rocm platform not equal result of opencl platform

Environment

Operating System:Ubuntu 20.04 TVM version : https://github.com/apache/tvm/commit/7f1856d34f03113dc3a7733c010be43446161944 platform: rocm 5.2

Steps to reproduce

There is the my test example.

onnx_rocm.py ``` from pyexpat import model import onnx #from tvm.driver import tvmc import numpy as np import tvm import tvm.relay as relay from tvm.contrib import graph_executor import tvm.testing import numpy as np dtype="float32" common_prefix_str = "onnx-model/vision/classification/" tol_paras = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2] input_name = "Input3" input_size = (1,1,28,28) output_size = (1,10) import logging logging.basicConfig(level=logging.ERROR) import warnings warnings.filterwarnings('ignore') def build(target:str,mod:tvm.IRModule, params:dict, input_name:str, input_data:np.ndarray, input:tuple, output: tuple) -> np.ndarray: tgt = tvm.target.Target(target=target, host="llvm") with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev)) module.set_input(input_name, input_data) module.run() output_shape = output tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() return tvm_output def main(): np.random.seed(0) I_np = np.random.uniform(size = input_size).astype(dtype) print(I_np[0][0][0][:10]) onnx_model = onnx.load("onnx-model/vision/classification/mnist/model/mnist-7.onnx") mod,params = relay.frontend.from_onnx(onnx_model,{"Input3":I_np.shape}) rocm_output = build("rocm",mod = mod,params = params,input_name = input_name,input_data = I_np, input = I_np.shape, output = output_size) opencl_output = build("opencl",mod = mod,params = params,input_name = input_name,input_data = I_np, input = I_np.shape, output = output_size) print(rocm_output[0][:10]) print(opencl_output[0][:10]) main() ```

The result of output is as follow.

image

Thereis the execution result of my own implementation using sycl target.

image

Triage

sycl

masahi commented 1 year ago

Don't create a new issue when you already have https://github.com/apache/tvm/issues/13768

We didn't change any rocm code recently and I'm sure we had a working rocm backend before. So I have no idea what broke our backend.

wangzy0327 commented 1 year ago

Don't create a new issue when you already have #13768

We didn't change any rocm code recently and I'm sure we had a working rocm backend before. So I have no idea what broke our backend.

OK,Can you tell me the tvm version of the rocm backend that used to be available and rocm version that is working?

masahi commented 1 year ago

I don't remember, but we didn't touch any rocm stuff. So I think recent rocm introduced some incompatibility with TVM. I'd try rocm 4 or even 3.

wangzy0327 commented 1 year ago

OK, Thank you. I try rocm 4.

masahi commented 1 year ago

cc @mvermeulen , is rocm 5 known to work with TVM?

wangzy0327 commented 1 year ago

cc @mvermeulen , is rocm 5 known to work with TVM?

Can you clarify which version of rocm has been verified under tvm? @masahi @mvermeulen

mvermeulen commented 1 year ago

@masahi and @wangzy0327 I have built ROCm with TVM periodically, last time was a few months ago with ROCm 5.2.3. When doing that, I ran through basic unit tests but haven't done full verification.

wangzy0327 commented 1 year ago

Could you help me resolve the problem as above example? @mvermeulen

mvermeulen commented 1 year ago

@wangzy0327 - it will take a little before I have time - and then I'll try first to update to ROCm 5.4 for my testing. However, a few things that can be tried in the interim:

  1. The version of ROCm+TVM I verified with 5.2.3, I had posted to dockerhub (https://hub.docker.com/r/mevermeulen/rocm-tvm/tags) - so curious if you also see it there.
  2. Does the issue happen with both target "rocm" and target "rocm -libs=miopen"? Depending on the answer to this, might provide interesting difference.
wangzy0327 commented 1 year ago

I didn't try rocm -libs=miopen,just try rocm