[Tracking][Vulkan] Extending topi/relay tests to run on Vulkan

Lunderberg commented 3 years ago

Summary

Currently, some unit tests fail when running on the Vulkan runtime. PRs https://github.com/apache/tvm/pull/8903 and https://github.com/apache/tvm/pull/8947 parametrized the tests that are currently failing, so that the vulkan target can be marked as xfail without impacting any other runtimes. The Vulkan runtime should be improved so that these unit tests can pass on vulkan as well.

Status

File	Test	Parameters	Failure Step	Observed on	Status	PR
test_topi_math.py	test_ewise	topi_name="tan"	Codegen	NVIDIA/AMD	TODO
test_topi_math.py	test_ewise	topi_name="erf"	Codegen	NVIDIA/AMD	TODO
test_topi_math.py	test_ewise	topi_name="isnan"	Codegen	NVIDIA/AMD	TODO
test_topi_math.py	test_ewise	topi_name="isfinite"	Codegen	NVIDIA/AMD	TODO
test_topi_math.py	test_ewise	topi_name="isinf"	Codegen	NVIDIA/AMD	TODO
test_topi_reduce.py	test_reduce_map	reduce_type="sum"	Codegen	NVIDIA/AMD	TODO
test_topi_reduce.py	test_reduce_map	reduce_type="any"	Codegen	NVIDIA/AMD	TODO
test_topi_reduce.py	test_reduce_map	reduce_type="all"	Codegen	NVIDIA/AMD	TODO
test_topi_vision.py	test_proposal		Codegen	NVIDIA/AMD	TODO
test_topi_conv1d_transpose	test_conv1d_transpose_ncw		Numeric Output	NVIDIA only	TODO
test_topi_softmax.py	test_softmax	dtype="float64"	Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_vm.py	test_cond		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_vm.py	test_simple_if		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level4.py	test_reduce_functions		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level3.py	test_sparse_reshape		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_any.py	test_any_reduce		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level5.py	TestResize1D		Numeric Output	NVIDIA/AMD	TODO
tests/python/relay/test_op_level5.py	TestResize2D		Numeric Output	NVIDIA/AMD	TODO
tests/python/relay/test_op_level5.py	TestCropAndResize		Numeric Output	NVIDIA only	TODO
tests/python/relay/test_op_level3.py	test_take		Numeric Output	NVIDIA only	TODO
tests/python/relay/test_op_level2.py	test_conv2d_run		Codegen	NVIDIA/AMD	Fixed	#9014
tests/python/relay/test_op_level3.py	test_segment_sum		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level3.py	test_scatter_add		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level1.py	test_unary_op	relay_op=erf	Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level1.py	test_unary_op	relay_op=tan	Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_level1.py	test_unary_op	relay_op=atan	Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_grad_level10.py	test_cross_entropy_grad		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_grad_level1.py	test_log_softmax_grad		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_grad_level1.py	test_softmax_grad		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_op_grad_level1.py	test_unary_op	Several	Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_any.py	test_any_batch_matmul		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_any.py	test_any_conv2d_NCHWc		Codegen	NVIDIA/AMD	TODO
tests/python/relay/test_any.py	test_any_dense		Codegen	NVIDIA/AMD	TODO

Lunderberg commented 3 years ago

@mbrookhart Regarding your comments that several of the failing unit tests had run correctly on vulkan in the past, the main breaking point was in #8127, which reads the device parameters from the physical device when the target is "vulkan -from_device=0". Several of the unit tests had a hard-coded target of "vulkan", tried to run with the minimum vulkan capabilities, and failed at codegen because the capability requested (e.g. 64-bit float support) wasn't listed in the target. Those fixes came along for free by parametrizing the topi tests, since the default vulkan test target uses the device query.

That said, at some point I want to ensure all tests either run correctly or have an appropriate xfail for the minimum vulkan feature set, but that will be a different issue.

masahi commented 3 years ago

This result is on a NV driver, or do they also fail on AMD?

Lunderberg commented 3 years ago

Thank you for checking, and all except the test_conv1d_transpose_ncw occur on AMD as well. It's the only one that is a numerical failure, while the rest of errors that occur during codegen. I'll update the table with that information.

Lunderberg commented 3 years ago

Following #8947 , added the failing relay tests to the tracking issue.

masahi commented 3 years ago

@Lunderberg Are these two test cases any different? One has pytest.xfail("Known failing test for vulkan") but not for the other.

https://github.com/apache/tvm/blob/548675fddcf9e2ad7203fc7610189d0e94e68bc6/tests/python/relay/test_op_level2.py#L199

https://github.com/apache/tvm/blob/548675fddcf9e2ad7203fc7610189d0e94e68bc6/tests/python/relay/test_op_level2.py#L360

Lunderberg commented 3 years ago

Thank you for that catch. When refactoring the tests in #8947, I added the updated version of test_conv2d_run, but didn't remove the original. I have https://github.com/apache/tvm/pull/8993 open to remove the redundant test_conv2d_run, and have double-checked that there aren't any others that snuck in.

masahi commented 3 years ago

@Lunderberg The last three items in test_any.py are not specific to vulkan (fails on cuda as well), so I think we should drop them from the list.

They don't work on gpu targets since we don't support dynamic height or width in conv2d, for example.

apache / tvm

[Tracking][Vulkan] Extending topi/relay tests to run on Vulkan #8903

Summary

Status