apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.77k stars 3.47k forks source link

[Tracking][Vulkan] Extending topi/relay tests to run on Vulkan #8903

Closed Lunderberg closed 1 month ago

Lunderberg commented 3 years ago

Summary

Currently, some unit tests fail when running on the Vulkan runtime. PRs https://github.com/apache/tvm/pull/8903 and https://github.com/apache/tvm/pull/8947 parametrized the tests that are currently failing, so that the vulkan target can be marked as xfail without impacting any other runtimes. The Vulkan runtime should be improved so that these unit tests can pass on vulkan as well.

Status

File Test Parameters Failure Step Observed on Status Owner PR
test_topi_math.py test_ewise topi_name="tan" Codegen NVIDIA/AMD TODO
test_topi_math.py test_ewise topi_name="erf" Codegen NVIDIA/AMD TODO
test_topi_math.py test_ewise topi_name="isnan" Codegen NVIDIA/AMD TODO
test_topi_math.py test_ewise topi_name="isfinite" Codegen NVIDIA/AMD TODO
test_topi_math.py test_ewise topi_name="isinf" Codegen NVIDIA/AMD TODO
test_topi_reduce.py test_reduce_map reduce_type="sum" Codegen NVIDIA/AMD TODO
test_topi_reduce.py test_reduce_map reduce_type="any" Codegen NVIDIA/AMD TODO
test_topi_reduce.py test_reduce_map reduce_type="all" Codegen NVIDIA/AMD TODO
test_topi_vision.py test_proposal Codegen NVIDIA/AMD TODO
test_topi_conv1d_transpose test_conv1d_transpose_ncw Numeric Output NVIDIA only TODO
test_topi_softmax.py test_softmax dtype="float64" Codegen NVIDIA/AMD TODO
tests/python/relay/test_vm.py test_cond Codegen NVIDIA/AMD TODO
tests/python/relay/test_vm.py test_simple_if Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level4.py test_reduce_functions Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level3.py test_sparse_reshape Codegen NVIDIA/AMD TODO
tests/python/relay/test_any.py test_any_reduce Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level5.py TestResize1D Numeric Output NVIDIA/AMD TODO
tests/python/relay/test_op_level5.py TestResize2D Numeric Output NVIDIA/AMD TODO
tests/python/relay/test_op_level5.py TestCropAndResize Numeric Output NVIDIA only TODO
tests/python/relay/test_op_level3.py test_take Numeric Output NVIDIA only TODO
tests/python/relay/test_op_level2.py test_conv2d_run Codegen NVIDIA/AMD Fixed #9014
tests/python/relay/test_op_level3.py test_segment_sum Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level3.py test_scatter_add Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level1.py test_unary_op relay_op=erf Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level1.py test_unary_op relay_op=tan Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_level1.py test_unary_op relay_op=atan Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_grad_level10.py test_cross_entropy_grad Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_grad_level1.py test_log_softmax_grad Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_grad_level1.py test_softmax_grad Codegen NVIDIA/AMD TODO
tests/python/relay/test_op_grad_level1.py test_unary_op Several Codegen NVIDIA/AMD TODO
tests/python/relay/test_any.py test_any_batch_matmul Codegen NVIDIA/AMD TODO
tests/python/relay/test_any.py test_any_conv2d_NCHWc Codegen NVIDIA/AMD TODO
tests/python/relay/test_any.py test_any_dense Codegen NVIDIA/AMD TODO
Lunderberg commented 3 years ago

@mbrookhart Regarding your comments that several of the failing unit tests had run correctly on vulkan in the past, the main breaking point was in #8127, which reads the device parameters from the physical device when the target is "vulkan -from_device=0". Several of the unit tests had a hard-coded target of "vulkan", tried to run with the minimum vulkan capabilities, and failed at codegen because the capability requested (e.g. 64-bit float support) wasn't listed in the target. Those fixes came along for free by parametrizing the topi tests, since the default vulkan test target uses the device query.

That said, at some point I want to ensure all tests either run correctly or have an appropriate xfail for the minimum vulkan feature set, but that will be a different issue.

masahi commented 3 years ago

This result is on a NV driver, or do they also fail on AMD?

Lunderberg commented 3 years ago

Thank you for checking, and all except the test_conv1d_transpose_ncw occur on AMD as well. It's the only one that is a numerical failure, while the rest of errors that occur during codegen. I'll update the table with that information.

Lunderberg commented 3 years ago

Following #8947 , added the failing relay tests to the tracking issue.

masahi commented 3 years ago

@Lunderberg Are these two test cases any different? One has pytest.xfail("Known failing test for vulkan") but not for the other.

https://github.com/apache/tvm/blob/548675fddcf9e2ad7203fc7610189d0e94e68bc6/tests/python/relay/test_op_level2.py#L199

https://github.com/apache/tvm/blob/548675fddcf9e2ad7203fc7610189d0e94e68bc6/tests/python/relay/test_op_level2.py#L360

Lunderberg commented 3 years ago

Thank you for that catch. When refactoring the tests in #8947, I added the updated version of test_conv2d_run, but didn't remove the original. I have https://github.com/apache/tvm/pull/8993 open to remove the redundant test_conv2d_run, and have double-checked that there aren't any others that snuck in.

masahi commented 3 years ago

@Lunderberg The last three items in test_any.py are not specific to vulkan (fails on cuda as well), so I think we should drop them from the list.

They don't work on gpu targets since we don't support dynamic height or width in conv2d, for example.