Closed wllqwzx closed 1 year ago
Can you examine what the function https://github.com/apache/tvm/blob/7fd0cdb230ac58f2311b07a6fbea3ff7cb98aa07/python/tvm/topi/cuda/tensorcore_alter_op.py#L133 is returning for this input? For such shape I expect it to be padded so that it can be consumed by tensorcore.
No, this function seems to be invoked in the relay's Legalize pass, while this input is a prim_func.
cc @vinx13
I'll take a look. There was a previous attempt #14030 to solve this, but it is alleviate most cases but it's not a full solution. The problem is current arithmetic analysis can't not handle arbitrary block predicates, it can only handle simple bounds like min < var
or var < max
where min/max
are constants. We also updated the search space in #14108. I'll double check if this still occurs in the new search space
I meet with fellow error when trying to reproduce:
(base) root@356a70204ac9:/workspace/tvm/debug/issue14137# python main.py
error: module 'tvm.target._ffi_api' has no attribute 'llvm_lookup_intrinsic_id'
--> /root/anaconda3/lib/python3.9/site-packages/tvm-0.12.dev387+gccc0b9162-py3.9-linux-x86_64.egg/tvm/tir/tensor_intrin/arm_cpu.py:65:13
|
65 | T.llvm_lookup_intrinsic_id("llvm.aarch64.neon.smull.v8i16"),
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: run with TVM_BACKTRACE=1
environment variable to display a backtrace.
Environment: CPU: AMD EPYC 7543 GPU: NVIDIA A100
tvm commit id: ccc0b9162f2e983a8810e99c903c7141dbec81b6
@lileidev, It looks like you are trying to build an arm module on an x86 CPU, which not works
I compiled tvm based on default cmake/config.cmake file, didn't specify ARM platform. And you can find that the package path "tvm-0.12.dev387+gccc0b9162-py3.9-linux-x86_64.egg" is x86_64. This error can be produced just by "import tvm.tir.tensor_intrin"
error: module 'tvm.target._ffi_api' has no attribute 'llvm_lookup_intrinsic_id'
T
should be imported by from tvm.script import tir as T
, but looks like it becomes tvm.target
for some reason. I have no specific idea about it, but the branch works on my env and CI.
Both Module1 and Module0 can run pass on my machine.
Total trials: 10 Total latency (us): 6.96011
Total trials: 10 Total latency (us): 6.52825
I found that when tuning the fp16 tensorcore
dense_add
kernel, the tuning fails on some shapes and the reported error is non-deterministic.For example, when the workload is
N=1, M=1000, K=512
, the tuning fails.There are two kinds of reported errors. From my observation, the following error may be reported more frequently:
Click me
``` 2023-02-27 14:11:46 [INFO] Logging directory: /tmp/tmp71o3_ldv/logs 2023-02-27 14:11:46 [INFO] LocalBuilder: max_workers = 11 2023-02-27 14:11:47 [INFO] LocalRunner: max_workers = 1 2023-02-27 14:11:48 [INFO] [task_scheduler.cc:159] Initializing Task #0: "main" 2023-02-27 14:11:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "main" Traceback (most recent call last): File "bug_tune_dense_add.py", line 507, inand may report this error with a lower frequency:
Click me
``` 2023-02-27 14:20:13 [INFO] Logging directory: /tmp/tmputfxvrl5/logs 2023-02-27 14:20:13 [INFO] LocalBuilder: max_workers = 11 2023-02-27 14:20:14 [INFO] LocalRunner: max_workers = 1 2023-02-27 14:20:15 [INFO] [task_scheduler.cc:159] Initializing Task #0: "main" Traceback (most recent call last): File "bug_tune_dense_add.py", line 507, inI tried different
N
, and found that whenN=2, 4, 8, 12, 17, 18, 24
the tuning still fails, but whenN=16, 32
it succeeds. I guess it may be because of the alignment requirement ofm16n16k16
tensor core.Expected behavior
The tuning succeeds
Environment
Steps to reproduce
Triage
cc @ibsidorenko