Open TyraVex opened 1 year ago
I ended up modifying ./include/ck/ck.hpp
I added || defined(__gfx1010__)
, in order to use the gfx1030 value of CK_BUFFER_RESOURCE_3RD_DWORD
with the gfx1010, since it's the closest architecture I can think of.
Here is what it looks like:
// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx908__) || \
defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) || \
defined(__gfx942__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx1030__) || defined(__gfx1010__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31004000
#endif
Then, the build was successful. I haven't tested it yet.
@TyraVex Thanks for reporting this issue. Please let us know if it works or not. We do not have Navi10 GPUs to test it.
Sorry for the late response. Back when I was using ubuntu for rocm compilation, it didn't worked, I was getting gpu invalid instructions errors, but cloning the project today and using my modifications led to a successfull build, surprisingly.
Here are the test results
Running tests...
Test project /home/tyra/composable_kernel/build
Start 1: example_gemm_dl_fp32
1/119 Test #1: example_gemm_dl_fp32 ...................................................... Passed 2.95 sec
Start 2: example_gemm_dl_fp16
2/119 Test #2: example_gemm_dl_fp16 ...................................................... Passed 2.84 sec
Start 3: example_gemm_dl_int8
3/119 Test #3: example_gemm_dl_int8 ...................................................... Passed 2.73 sec
Start 4: example_gemm_xdl_fp16
4/119 Test #4: example_gemm_xdl_fp16 ..................................................... Passed 2.76 sec
Start 5: example_gemm_xdl_wavelet_fp16
5/119 Test #5: example_gemm_xdl_wavelet_fp16 ............................................. Passed 2.77 sec
Start 6: example_gemm_xdl_bf16
6/119 Test #6: example_gemm_xdl_bf16 ..................................................... Passed 2.66 sec
Start 7: example_gemm_xdl_int8
7/119 Test #7: example_gemm_xdl_int8 ..................................................... Passed 2.66 sec
Start 8: example_gemm_xdl_skip_b_lds_fp16
8/119 Test #8: example_gemm_xdl_skip_b_lds_fp16 .......................................... Passed 0.28 sec
Start 9: example_convnd_fwd_dl_fp16
9/119 Test #9: example_convnd_fwd_dl_fp16 ................................................ Passed 5.37 sec
Start 10: example_convnd_fwd_dl_fp32
10/119 Test #10: example_convnd_fwd_dl_fp32 ................................................ Passed 4.92 sec
Start 11: example_convnd_fwd_dl_int8
11/119 Test #11: example_convnd_fwd_dl_int8 ................................................ Passed 4.80 sec
Start 12: example_reduce_blockwise
12/119 Test #12: example_reduce_blockwise .................................................. Passed 6.04 sec
Start 13: example_reduce_multiblock_atomic_add
13/119 Test #13: example_reduce_multiblock_atomic_add ...................................... Passed 3.71 sec
Start 14: example_reduce_blockwise_two_call
14/119 Test #14: example_reduce_blockwise_two_call ......................................... Passed 25.22 sec
Start 15: example_pool2d_fwd_fp16
15/119 Test #15: example_pool2d_fwd_fp16 ................................................... Passed 2.93 sec
Start 16: example_pool2d_fwd_fp32
16/119 Test #16: example_pool2d_fwd_fp32 ................................................... Passed 2.81 sec
Start 17: example_gemm_dl_quantization_int8
17/119 Test #17: example_gemm_dl_quantization_int8 .........................................Subprocess aborted***Exception: 0.64 sec
Start 18: example_grouped_gemm_xdl_fp32
18/119 Test #18: example_grouped_gemm_xdl_fp32 ............................................. Passed 0.01 sec
Start 19: example_grouped_gemm_xdl_fp16
19/119 Test #19: example_grouped_gemm_xdl_fp16 ............................................. Passed 0.01 sec
Start 20: example_grouped_gemm_xdl_bfp16
20/119 Test #20: example_grouped_gemm_xdl_bfp16 ............................................ Passed 0.01 sec
Start 21: example_grouped_gemm_xdl_int8
21/119 Test #21: example_grouped_gemm_xdl_int8 ............................................. Passed 0.01 sec
Start 22: example_grouped_gemm_multiple_d_dl_fp16
22/119 Test #22: example_grouped_gemm_multiple_d_dl_fp16 ................................... Passed 0.01 sec
Start 23: example_grouped_gemm_xdl_splitk_fp16
23/119 Test #23: example_grouped_gemm_xdl_splitk_fp16 ...................................... Passed 0.01 sec
Start 24: example_convnd_bwd_data_dl_fp16
24/119 Test #24: example_convnd_bwd_data_dl_fp16 ........................................... Passed 1.65 sec
Start 25: example_broadcast_add_2d_amn_bn
25/119 Test #25: example_broadcast_add_2d_amn_bn ........................................... Passed 0.28 sec
Start 26: example_broadcast_add_3d_am_bmnk
26/119 Test #26: example_broadcast_add_3d_am_bmnk .......................................... Passed 0.25 sec
Start 27: example_elementwise_add_1d
27/119 Test #27: example_elementwise_add_1d ................................................ Passed 0.25 sec
Start 28: example_elementwise_add_4d
28/119 Test #28: example_elementwise_add_4d ................................................ Passed 0.25 sec
Start 29: example_grouped_conv_bwd_weight_dl_fp16
29/119 Test #29: example_grouped_conv_bwd_weight_dl_fp16 ................................... Passed 0.26 sec
Start 30: example_cgemm_xdl_bf16
30/119 Test #30: example_cgemm_xdl_bf16 .................................................... Passed 0.01 sec
Start 31: example_cgemm_xdl_fp16
31/119 Test #31: example_cgemm_xdl_fp16 .................................................... Passed 0.01 sec
Start 32: example_cgemm_xdl_fp32
32/119 Test #32: example_cgemm_xdl_fp32 .................................................... Passed 0.01 sec
Start 33: example_cgemm_xdl_int8
33/119 Test #33: example_cgemm_xdl_int8 .................................................... Passed 0.01 sec
Start 34: example_softmax_blockwise
34/119 Test #34: example_softmax_blockwise ................................................. Passed 1.53 sec
Start 35: example_batched_gemm_xdl_fp32
35/119 Test #35: example_batched_gemm_xdl_fp32 ............................................. Passed 0.01 sec
Start 36: example_batched_gemm_xdl_fp16
36/119 Test #36: example_batched_gemm_xdl_fp16 ............................................. Passed 0.01 sec
Start 37: example_batched_gemm_xdl_bfp16
37/119 Test #37: example_batched_gemm_xdl_bfp16 ............................................ Passed 0.01 sec
Start 38: example_batched_gemm_xdl_int8
38/119 Test #38: example_batched_gemm_xdl_int8 ............................................. Passed 0.01 sec
Start 39: example_gemm_bias_e_permute_g1m3n2k1_xdl_fp16
39/119 Test #39: example_gemm_bias_e_permute_g1m3n2k1_xdl_fp16 ............................. Passed 1.68 sec
Start 40: example_gemm_bias_e_permute_g1m2n3k1_xdl_fp16
40/119 Test #40: example_gemm_bias_e_permute_g1m2n3k1_xdl_fp16 ............................. Passed 0.34 sec
Start 41: example_contraction_bilinear_xdl_fp32
41/119 Test #41: example_contraction_bilinear_xdl_fp32 ..................................... Passed 0.75 sec
Start 42: example_contraction_scale_xdl_fp32
42/119 Test #42: example_contraction_scale_xdl_fp32 ........................................ Passed 0.56 sec
Start 43: example_contraction_bilinear_xdl_fp64
43/119 Test #43: example_contraction_bilinear_xdl_fp64 ..................................... Passed 0.79 sec
Start 44: example_contraction_scale_xdl_fp64
44/119 Test #44: example_contraction_scale_xdl_fp64 ........................................ Passed 0.59 sec
Start 45: example_layernorm_fp16
45/119 Test #45: example_layernorm_fp16 .................................................... Passed 0.29 sec
Start 46: example_layernorm_splitk_fp16
46/119 Test #46: example_layernorm_splitk_fp16 ............................................. Passed 0.29 sec
Start 47: example_grouped_gemm_bias_e_permute_xdl_fp16
47/119 Test #47: example_grouped_gemm_bias_e_permute_xdl_fp16 .............................. Passed 0.01 sec
Start 48: example_batched_gemm_bias_e_permute_xdl_fp16
48/119 Test #48: example_batched_gemm_bias_e_permute_xdl_fp16 .............................. Passed 0.40 sec
Start 49: example_batched_gemm_scale_softmax_gemm_xdl_fp16
49/119 Test #49: example_batched_gemm_scale_softmax_gemm_xdl_fp16 .......................... Passed 0.27 sec
Start 50: example_batched_gemm_scale_softmax_gemm_xdl_bf16
50/119 Test #50: example_batched_gemm_scale_softmax_gemm_xdl_bf16 .......................... Passed 0.27 sec
Start 51: example_batched_gemm_scale_softmax_gemm_permute_xdl_fp16
51/119 Test #51: example_batched_gemm_scale_softmax_gemm_permute_xdl_fp16 .................. Passed 0.75 sec
Start 52: example_batched_gemm_scale_softmax_gemm_permute_xdl_bf16
52/119 Test #52: example_batched_gemm_scale_softmax_gemm_permute_xdl_bf16 .................. Passed 0.68 sec
Start 53: example_grouped_gemm_scale_softmax_gemm_permute_xdl_fp16
53/119 Test #53: example_grouped_gemm_scale_softmax_gemm_permute_xdl_fp16 .................. Passed 0.36 sec
Start 54: example_batched_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
54/119 Test #54: example_batched_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 ... Passed 0.75 sec
Start 55: example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
55/119 Test #55: example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 ... Passed 0.36 sec
Start 56: example_dual_reduce_multiblock
56/119 Test #56: example_dual_reduce_multiblock ............................................ Passed 3.35 sec
Start 57: example_dual_reduce_threadwise
57/119 Test #57: example_dual_reduce_threadwise ............................................ Passed 0.27 sec
Start 58: example_batchnorm_forward_training
58/119 Test #58: example_batchnorm_forward_training ........................................ Passed 1.12 sec
Start 59: example_batchnorm_forward_training_obsolete
59/119 Test #59: example_batchnorm_forward_training_obsolete ............................... Passed 1.10 sec
Start 60: example_batchnorm_forward_inferring
60/119 Test #60: example_batchnorm_forward_inferring ....................................... Passed 2.53 sec
Start 61: example_batchnorm_backward
61/119 Test #61: example_batchnorm_backward ................................................ Passed 3.34 sec
Start 62: example_sparse_embedding3_forward_layernorm
62/119 Test #62: example_sparse_embedding3_forward_layernorm ............................... Passed 47.00 sec
Start 63: example_batched_gemm_add_add_relu_gemm_add_xdl_fp16
63/119 Test #63: example_batched_gemm_add_add_relu_gemm_add_xdl_fp16 ....................... Passed 0.48 sec
Start 64: example_permute_1xHxW_fp16
64/119 Test #64: example_permute_1xHxW_fp16 ................................................ Passed 0.47 sec
Start 65: example_permute_NxHxW_fp16
65/119 Test #65: example_permute_NxHxW_fp16 ................................................ Passed 0.88 sec
Start 66: example_permute_HxWx4_fp16
66/119 Test #66: example_permute_HxWx4_fp16 ................................................ Passed 1.13 sec
Start 67: example_conv2d_fwd_dl_perlayer_quantization_int8
67/119 Test #67: example_conv2d_fwd_dl_perlayer_quantization_int8 ..........................Subprocess aborted***Exception: 0.70 sec
Start 68: example_conv2d_fwd_dl_perchannel_quantization_int8
68/119 Test #68: example_conv2d_fwd_dl_perchannel_quantization_int8 ........................Subprocess aborted***Exception: 0.68 sec
Start 69: example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8
69/119 Test #69: example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8 ................Subprocess aborted***Exception: 0.71 sec
Start 70: example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8
70/119 Test #70: example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8 ..............Subprocess aborted***Exception: 0.68 sec
Start 71: example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8
71/119 Test #71: example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8 ................Subprocess aborted***Exception: 0.71 sec
Start 72: example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8
72/119 Test #72: example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8 ..............Subprocess aborted***Exception: 0.69 sec
Start 73: example_groupnorm_sigmoid_mul_fp16
73/119 Test #73: example_groupnorm_sigmoid_mul_fp16 ........................................ Passed 7.67 sec
Start 74: example_groupnorm_splitk_fp16
74/119 Test #74: example_groupnorm_splitk_fp16 ............................................. Passed 7.51 sec
Start 75: example_groupnorm_swish_fp16
75/119 Test #75: example_groupnorm_swish_fp16 .............................................. Passed 7.55 sec
Start 76: example_splitk_gemm_bias_e_permute_xdl_fp16
76/119 Test #76: example_splitk_gemm_bias_e_permute_xdl_fp16 ............................... Passed 0.41 sec
Start 77: example_splitk_gemm_bias_e_permute_xdl_fp32
77/119 Test #77: example_splitk_gemm_bias_e_permute_xdl_fp32 ............................... Passed 0.40 sec
Start 78: example_elementwise_permute_4D_fp16
78/119 Test #78: example_elementwise_permute_4D_fp16 ....................................... Passed 0.37 sec
Start 79: example_elementwise_permute_4D_fp16_2d
79/119 Test #79: example_elementwise_permute_4D_fp16_2d .................................... Passed 12.55 sec
Start 80: example_elementwise_layernorm_blockwise
80/119 Test #80: example_elementwise_layernorm_blockwise ................................... Passed 1.10 sec
Start 81: example_gemm_add_multiply_dl_fp16
81/119 Test #81: example_gemm_add_multiply_dl_fp16 ......................................... Passed 1.50 sec
Start 82: example_gemm_add_multiply_xdl_fp16
82/119 Test #82: example_gemm_add_multiply_xdl_fp16 ........................................ Passed 1.49 sec
Start 83: example_pool3d_fwd_fp16
83/119 Test #83: example_pool3d_fwd_fp16 ................................................... Passed 0.30 sec
Start 84: example_maxpool2d_bwd_bf16
84/119 Test #84: example_maxpool2d_bwd_bf16 ................................................ Passed 0.26 sec
Start 85: example_maxpool2d_bwd_fp16
85/119 Test #85: example_maxpool2d_bwd_fp16 ................................................ Passed 0.26 sec
Start 86: example_maxpool2d_bwd_fp32
86/119 Test #86: example_maxpool2d_bwd_fp32 ................................................ Passed 0.25 sec
Start 87: example_put_element_fp16
87/119 Test #87: example_put_element_fp16 .................................................. Passed 0.25 sec
Start 88: test_magic_number_division
88/119 Test #88: test_magic_number_division ................................................ Passed 1.26 sec
Start 89: test_space_filling_curve
89/119 Test #89: test_space_filling_curve .................................................. Passed 0.01 sec
Start 90: test_conv_util
90/119 Test #90: test_conv_util ............................................................ Passed 0.01 sec
Start 91: test_reference_conv_fwd
91/119 Test #91: test_reference_conv_fwd ................................................... Passed 0.01 sec
Start 92: test_gemm_fp32
92/119 Test #92: test_gemm_fp32 ............................................................ Passed 46.32 sec
Start 93: test_gemm_fp16
93/119 Test #93: test_gemm_fp16 ............................................................ Passed 378.44 sec
Start 94: test_gemm_bf16
94/119 Test #94: test_gemm_bf16 ............................................................ Passed 18.93 sec
Start 95: test_gemm_int8
95/119 Test #95: test_gemm_int8 ............................................................ Passed 63.21 sec
Start 96: test_gemm_standalone_xdl_fp16
96/119 Test #96: test_gemm_standalone_xdl_fp16 ............................................. Passed 363.97 sec
Start 97: test_gemm_reduce_fp16
97/119 Test #97: test_gemm_reduce_fp16 ..................................................... Passed 0.35 sec
Start 98: test_reduce_no_index
98/119 Test #98: test_reduce_no_index ...................................................... Passed 3.31 sec
Start 99: test_reduce_with_index
99/119 Test #99: test_reduce_with_index .................................................... Passed 5.96 sec
Start 100: test_grouped_convnd_fwd
100/119 Test #100: test_grouped_convnd_fwd ................................................... Passed 905.83 sec
Start 101: test_block_to_ctile_map
101/119 Test #101: test_block_to_ctile_map ................................................... Passed 0.01 sec
Start 102: test_softmax_rank3
102/119 Test #102: test_softmax_rank3 ........................................................ Passed 56.83 sec
Start 103: test_softmax_rank4
103/119 Test #103: test_softmax_rank4 ........................................................ Passed 123.09 sec
Start 104: test_softmax_interface
104/119 Test #104: test_softmax_interface .................................................... Passed 0.01 sec
Start 105: test_layernorm2d_fp32
105/119 Test #105: test_layernorm2d_fp32 ..................................................... Passed 0.81 sec
Start 106: test_layernorm2d_fp16
106/119 Test #106: test_layernorm2d_fp16 ..................................................... Passed 1.24 sec
Start 107: test_groupnorm_fp16
107/119 Test #107: test_groupnorm_fp16 ....................................................... Passed 1.04 sec
Start 108: test_groupnorm_fp32
108/119 Test #108: test_groupnorm_fp32 ....................................................... Passed 0.48 sec
Start 109: test_fp8
109/119 Test #109: test_fp8 .................................................................. Passed 0.01 sec
Start 110: test_elementwise_layernorm_fp16
110/119 Test #110: test_elementwise_layernorm_fp16 ........................................... Passed 7.02 sec
Start 111: test_batchnorm_fwd_rank_4
111/119 Test #111: test_batchnorm_fwd_rank_4 ................................................. Passed 69.16 sec
Start 112: test_batchnorm_bwd_rank_4
112/119 Test #112: test_batchnorm_bwd_rank_4 ................................................. Passed 98.97 sec
Start 113: test_batchnorm_infer_rank_4
113/119 Test #113: test_batchnorm_infer_rank_4 ............................................... Passed 21.39 sec
Start 114: test_contraction
114/119 Test #114: test_contraction .......................................................... Passed 14.43 sec
Start 115: test_avg_pool2d_fwd
115/119 Test #115: test_avg_pool2d_fwd ....................................................... Passed 0.26 sec
Start 116: test_avg_pool3d_fwd
116/119 Test #116: test_avg_pool3d_fwd ....................................................... Passed 1.00 sec
Start 117: test_max_pool2d_fwd
117/119 Test #117: test_max_pool2d_fwd ....................................................... Passed 0.27 sec
Start 118: test_max_pool3d_fwd
118/119 Test #118: test_max_pool3d_fwd ....................................................... Passed 1.83 sec
Start 119: test_batched_gemm_multi_d
119/119 Test #119: test_batched_gemm_multi_d ................................................. Passed 1.05 sec
94% tests passed, 7 tests failed out of 119
Total Test time (real) = 2384.13 sec
The following tests FAILED:
17 - example_gemm_dl_quantization_int8 (Subprocess aborted)
67 - example_conv2d_fwd_dl_perlayer_quantization_int8 (Subprocess aborted)
68 - example_conv2d_fwd_dl_perchannel_quantization_int8 (Subprocess aborted)
69 - example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8 (Subprocess aborted)
70 - example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8 (Subprocess aborted)
71 - example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8 (Subprocess aborted)
72 - example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8 (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /home/tyra/composable_kernel/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:91: test] Error 8
In the 5.7 branch
There is missing support for gfx941, gfx942, gfx1012 and gfx1030 So the only generally useful set is the gfx11.
Folks will be unhappily surprised when MIOpen and similar users of CK fail.
Our experiments at Solus have found the following:
gfx803
, gfx900
, gfx906
can all build fine, just remember to build with BUILD_DEV=OFF
because compiling those architectures may produce warnings that subsequently get treated by the compiler as errors due to -Werror
under BUILD_DEV=ON
. We haven't run any tests on those platforms (yet), so for now we only know that they build :sweat_smile: gfx1010
is probably best to be treated as gfx900
, because even though gfx1030 is theoretically the closest, there are optimizations and LLVM builtins that gfx1010 cannot use while gfx1030 can, such as __builtin_amdgcn_fdot2
.
gfx1010
is probably best to be treated asgfx900
, because even though gfx1030 is theoretically the closest, there are optimizations and LLVM builtins that gfx1010 cannot use while gfx1030 can, such as__builtin_amdgcn_fdot2
.
@GZGavinZhao Its interesting that the __builtin_amdgcn_fdot2 and __builtin_amdgcn_sdot4
features are not supported, maybe we can just add them to the features list (i digged it up here) https://github.com/llvm/llvm-project/blob/8a79dc7e6f765f3f49c5dd9330fc0826d3362858/llvm/lib/Target/AMDGPU/AMDGPU.td#L1496
So if I could just get to understand that we are hitting a hardware limit with gfx1010 (RX 5700 and the likes, Navi10 processor) that it can not do what the gfx1030 (RX6900 and the likes Navi21,23) is doing or its just someones idea for business strategy that gfx1010 is too old to support or whats the case? if its a hardware limitation where did we hit a dead end? why the software is not shipped as well as for gfx1010, im just trying to understand what is going on, whos deciding the stuff?
I know there have been a tremendous amount of efforts all over the place which i was not part of, and I appreciate everyone contribution to where things have gone so far and very thankful to everyone's effort but we need to fix this issue, old GPUS are not going to the trash man (if the plan is to send them to the trash, then too bad of a waste)!
I just read some main instruction set parts for memory shader and some other registers https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna-shader-instruction-set-architecture.pdf both AMD-GCN-GFX10-RDNA2 and AMD-GCN-GFX10-RDNA1 seems to have an identical instructions sets, I understand there are some details i missed, but from a hardware point of view, everything seems to be existing already in gfx1010 so our problem is software, so probably just copying the gfx1030 stuff and use it for gfx1010 and maybe without modifying anything (or with modifiying few things which is a nightmare:)) we would get a better support for that Navi10 (that gddr6 monster can do crazy stuff, so people are using older software by adopting to the gfx900 arch, well its still a toy in comparison to the MI300X)
the problem is that this is not an easy job to do, it should have been considered earlier but we can maybe still do something all together, as the projects involved maybe beyond my capabilities, like:
1-LLVM 2- rocBlas 3- CK 4- MIOpen
I just hit this error while trying to add the gfx1030 features to gfx1010,
Cached asm caps differ from derived asm caps for (10, 1, 0)
which is expected, as I modified the AMDGPU Target for that arch, so now my question, which one is the cached asm caps and which one derived from where?
I passed through the error by forking the Tensile repo and IgnoringAsmCap mismatch
Could you raise an issue in Tensile for this ? Fedora will use the upstream Tensile, not a copy.
Thanks @trixirt , yes, I will also add this issue to it:
below is actually the compileArgs command, and the first element is None, which i guess is the cxx_compiler(it should not be None):
printing our args as we have a None Type error at the moment [None, '-D__HIP_HCC_COMPAT_MODE__=1', '--cuda-device-only', '-x', 'hip', '-O3', '-I', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile', '-Xoffload-linker', '--build-id', '--offload-arch=gfx1010', '--offload-arch=gfx803', '--offload-arch=gfx1030', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile/TensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback.cpp', '-c', '-o', '/home/bargo/projects/rocm-setup/rocBLAS/build/library/src/build_tmp/TENSILE/code_object_tmp/TensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback.o'] launcher: [] hipFlags: ['-D__HIP_HCC_COMPAT_MODE__=1', '--cuda-device-only', '-x', 'hip', '-O3', '-I', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile', '-Xoffload-linker', '--build-id'] archFlags: ['--offload-arch=gfx1010', '--offload-arch=gfx803', '--offload-arch=gfx1030']
I agree this issue/discussion should continue in Tensile, but just to add my 2 cents:
I just read some main instruction set parts for memory shader and some other registers https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna-shader-instruction-set-architecture.pdf both AMD-GCN-GFX10-RDNA2 and AMD-GCN-GFX10-RDNA1 seems to have an identical instructions sets
RDNA1 and RDNA2 are not the same instruction set. For example, consult section 6.3 of the RDNA1 and RDNA2 ISA architecture. You will see that RDNA2 has V_DOT*
instructions like V_DOT2_F32_F16
that RDNA1 doesn't have. Moreover, aside from instruction differences, there are fundamentally different designs in RDNA1 and RDNA2 hardware that makes code compiled for the latter unable to run on the former. For example, the different number of registers. It's possible you may get lucky and the code you compiled against RDNA2 doesn't use the extra registers at all, so it runs fine on RDNA1, but that's pure luck and there's no way we can guarantee that.
So if I could just get to understand that we are hitting a hardware limit with gfx1010 (RX 5700 and the likes, Navi10 processor) that it can not do what the gfx1030 (RX6900 and the likes Navi21,23) is doing or its just someones idea for business strategy that gfx1010 is too old to support or whats the case? if its a hardware limitation where did we hit a dead end? why the software is not shipped as well as for gfx1010, im just trying to understand what is going on, whos deciding the stuff?
As per my reasoning above, in my understanding it is a hardware limit that gfx1010
cannot do the same as gfx1030
. This makes sense anyway, because why would RDNA2 even exist in the first place if it's just more FLOPS and no new instructions/optimizations?
Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail. The best we can do (and what it's already been done at Solus) is to patch libraries like rocBLAS, Tensile, and CK so that they can compile and run RDNA1 code on a best-effort basis. I believe Solus's support for RDNA1 hardware is complete. I assume you own a RDNA1 hardware, so feel free to grab a Solus ISO, install the pytorch
package, give it a try (it should work out of the box, no environment variables or anything needed), and let me know or open an issue at the Solus repo if anything doesn't work.
Thanks for having a look @GZGavinZhao appreciate all your efforts.
RDNA1 and RDNA2 are not the same instruction set. For example, consult section 6.3 of the RDNA1 and RDNA2 ISA architecture. You will see that RDNA2 has V_DOT* instructions like V_DOT2_F32_F16 that RDNA1 doesn't have. Moreover, aside from instruction differences, there are fundamentally different designs in RDNA1 and RDNA2 hardware that makes code compiled for the latter unable to run on the former. For example, the different number of registers. It's possible you may get lucky and the code you compiled against RDNA2 doesn't use the extra registers at all, so it runs fine on RDNA1, but that's pure luck and there's no way we can guarantee that.
Actually regarding the above I have a feeling that we can build these instructions for the gfx1010, you referred to the VOP3P instruction which can also be built for that target, it could be all what im doing is none sense, but I like trying stuff and let me see where that takes me :) I will keep you posted either way.
Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail. The best we can do (and what it's already been done at Solus) is to patch libraries like rocBLAS, Tensile, and CK so that they can compile and run RDNA1 code on a best-effort basis. I believe Solus's support for RDNA1 hardware is complete. I assume you own a RDNA1 hardware, so feel free to grab a Solus ISO, install the pytorch package, give it a try (it should work out of the box, no environment variables or anything needed), and let me know or open an issue at the Solus repo if anything doesn't work.
i think Solus is definitely helpful but currently running the gfx1010 on gfx900 is a downgrade.
It would be my last resort to do that nevertheless I really like what you guys are doing at Solus looks a good approach for a better GPU support,
I have plenty of GPUs as I work as a repairman for myself. And thus i ended up with plenty of Graphic cards from all generations. So i need to be fair to myself :)
i think Solus is definitely helpful but currently running the gfx1010 on gfx900 is a downgrade.
Note that only for CK, gfx1010
has the same code path as gfx900
. For Tensile/rocBLAS, gfx1010
is being compiled normally, so it receives all the optimizations that gfx1010
should get by default.
I do agree that for CK specifically, we may be able to treat gfx1010
as gfx1030
in general and then workaround specific cases where VDOT instructions are used. However, for all other libraries, you shouldn't need to do anything; gfx1010
should just compile and run normally, at least that's the case for Solus with ROCm 5.7, 6.0, and 6.1.
Note that only for CK,
gfx1010
has the same code path asgfx900
. For Tensile/rocBLAS,gfx1010
is being compiled normally, so it receives all the optimizations thatgfx1010
should get by default.I do agree that for CK specifically, we may be able to treat
gfx1010
asgfx1030
in general and then workaround specific cases where VDOT instructions are used. However, for all other libraries, you shouldn't need to do anything;gfx1010
should just compile and run normally, at least that's the case for Solus with ROCm 5.7, 6.0, and 6.1.
Yes I noticed that, and that what brought me here :) I think there is a room of improvement in general :) the repos are huge and there are a lot of stuff like 20 repos :D
Right now I'm just reading some weird stuff like this:
defm V_DOT2_F32_F16 : VOP3P_Real_gfx10_gfx11_gfx12_Triple<0x13>;
this is basically in some llvm file I think this one, llvm-project/llvm/lib/Target/AMDGPU/VOP3PInstructions.td and it shows that this V_DOT2_F32_F16 Which means its already part of that VOP3P Instructions set I dont understand that language very well, very complex for my heart
That file is LLVM TableGen. The problem is again that V_DOT2_F32_F16
is not part of the RDNA1 instruction set, so even if you could make LLVM accept that, you'll very likely have problems running that instruction on the hardware.
By saying "then workaround specific cases where VDOT instructions are used", I meant instead of relying on hardware-accelerated instructions, write portable HIP code that are the equivalent of the VDOT operation.
I think I understand what you mean, but you did not understand what I mean :)
It worked! right now here is my results for building for ck gfx1030;gfx1010;gfx803 with all develop branches using llvm19 and clang19
a fresh make looks like this:
make -j 8
[ 0%] Built target device_max_pool_bwd_instance
[ 0%] Built target device_transpose_instance
[ 0%] Built target device_elementwise_normalization_instance
[ 0%] Built target device_normalization_bwd_data_instance
[ 0%] Built target device_normalization_bwd_gamma_beta_instance
[ 0%] Built target device_avg_pool3d_bwd_instance
[ 14%] Built target device_elementwise_instance
[ 14%] Built target device_column_to_image_instance
[ 14%] Built target device_image_to_column_instance
[ 14%] Built target device_pool3d_fwd_instance
[ 14%] Built target device_normalization_fwd_instance
[ 28%] Built target utility
[ 42%] Built target device_permute_scale_instance
[ 57%] Built target device_softmax_instance
[ 57%] Built target example_gemm_dpp_fp16
[ 57%] Built target example_reduce_blockwise
[100%] Built target example_put_element_fp16
[100%] Built target example_avgpool3d_bwd_bf16
[100%] Built target example_avgpool3d_bwd_fp16
[100%] Built target example_avgpool3d_bwd_fp32
[100%] Built target example_image_to_column_f32
[100%] Built target example_column_to_image_f32
[100%] Built target example_layernorm2d_bwd_fp32
[100%] Built target example_groupnorm_bwd_fp32
[100%] Built target example_layernorm4d_fwd_fp16
[100%] Built target example_layernorm4d_fwd_splitk_fp16
[100%] Built target ckProfiler
#Complete man
I guess now we will fail in some areas which we need to look at more closely ;)
I have no idea how bad is this but it looks good for my eyes
bargo@beta:~/projects/rocm-setup/composable_kernel/build/bin$ ./ckProfiler groupnorm 1 0 2 1 1 --length 1 16 16 32 40
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
found 39 instances
Perf: 0.0956515 ms, 27.5132 GB/s, DeviceNormalizationFwdImpl<64,Cluster_MK_1_64,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0547736 ms, 48.0465 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0424971 ms, 61.9261 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0379853 ms, 69.2816 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0457218 ms, 57.5585 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0262616 ms, 100.21 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_2,XYSrcVectorDim_1,VectorSize_X2_Gamma2_Beta2_Y2>
Perf: 0.0248361 ms, 105.962 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0246625 ms, 106.708 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0268792 ms, 97.9077 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.032927 ms, 79.9246 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0194474 ms, 135.323 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0191178 ms, 137.656 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0225913 ms, 116.491 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.032919 ms, 79.9441 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_2_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0284743 ms, 92.4229 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0181155 ms, 145.273 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.018889 ms, 139.323 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0209594 ms, 125.561 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_2_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0203058 ms, 129.603 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0246889 ms, 106.594 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0954451 ms, 27.5727 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0864766 ms, 30.4323 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0890941 ms, 29.5382 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0908229 ms, 28.976 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf: 0.0523808 ms, 50.2413 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_2,XYSrcVectorDim_1,VectorSize_X2_Gamma2_Beta2_Y2>
Perf: 0.0371093 ms, 70.917 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0310343 ms, 84.7991 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0274904 ms, 95.7309 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0293671 ms, 89.6132 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0351702 ms, 74.8271 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0300303 ms, 87.6342 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0311927 ms, 84.3685 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0290031 ms, 90.7378 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_2_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0326374 ms, 80.6338 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0358837 ms, 73.3391 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0326766 ms, 80.5371 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0352086 ms, 74.7455 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_2_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0421652 ms, 62.4136 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf: 0.0389533 ms, 67.5599 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
length = 1,16,16,32,40
best perf = 0.0181155 ms, 145.273 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
I thank the Lord of Skies and Earth for what he gives and teaches and I also thank everyone for his help and contribution.
we need more documentations, I do not even know how to run the ckPorfiler man.
Really nice! I'm curious, what did you change? Did you only update CK, or did you also update the TableGen files in LLVM as well?
Really nice! I'm curious, what did you change? Did you only update CK, or did you also update the TableGen files in LLVM as well?
Thanks! I actually changed few things in the target code for that, and only made one change in ck, cmakelists to treat gfx1010 as gfx1030, but now I would need to continue the process, I think I will leave it here for now, but at least you know now, that this is not impossible as you thought earlier but rather possible , even it took a lot of my energy, but I need to provide better quality service for my future customers at https://ropotato.com
I think there will be more work to be done generally, but at least we have progress.
Interesting. By "impossible", I meant it's impossible to compile CK as gfx1030
(i.e. only --offload-arch=gfx1030
) and run the resulting library directly on gfx1010
hardware. CK is already runnable on gfx1010
since its inception with a small amount of patches.
No man, gfx1010 is not run-able with these changes, do you get it now?
+++ b/include/ck/ck.hpp
@@ -78,7 +78,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
-#elif defined(__gfx103__)
+#elif defined(__gfx103__) || defined(__gfx1010__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(__gfx11__) || defined(__gfx12__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31004000
@@ -88,7 +88,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
#ifndef __HIP_DEVICE_COMPILE__ // for host code, define nothing
#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
#define CK_USE_AMD_V_MAC_F32
-#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
+#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) || defined(__gfx1010__) // for GPU code
unless you modify the stuff like i did, but you keep changing what you said, and i dont like that, I will quit the discussion
My apologies and I think there are misunderstandings from the start of this conversation. In my first response to you, I said
Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail.
This still holds. If you compile code intended for RDNA2 (gfx1030
) as gfx1010
by passing --offload-arch=gfx1010
and attempt to run the result on RDNA1 hardware, due to the difference in ISA instructions you cannot guarantee the code always compile or work on RDNA1 hardware. Note that I'm saying always. There has been cases where people on RDNA1 hardware try to run code compiled for RDNA2 through HSA_OVERRIDE_GFX_VERSION=10.3.0
; sometimes it works, sometimes it doesn't.
What you seem to be trying to do is to make CK somehow run on RDNA1. This is a solved problem, see the diff below.
As to why the changes you posted don't work, I never said that the changes you posted were the ones needed to make it work. For ROCm 6.0, all you need is a simple sed -i 's/defined(__gfx900__)/defined(__gfx900__) || defined(__gfx1010__)/g' include/ck/ck.hpp
to make it work. In practice, the equivalent changes on the current develop
branch would be:
diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..32551a8da 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx103__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -86,7 +86,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// FMA instruction
#ifndef __HIP_DEVICE_COMPILE__ // for host code, define nothing
-#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx101__) // for GPU code
#define CK_USE_AMD_V_MAC_F32
#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
#define CK_USE_AMD_V_FMAC_F32
If you want to be really precise on using all the optimizations available, it would be this:
diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..4eaeefdae 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx103__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -88,10 +88,12 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
#ifndef __HIP_DEVICE_COMPILE__ // for host code, define nothing
#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
#define CK_USE_AMD_V_MAC_F32
-#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
+#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) || defined(__gfx1011__) || defined(__gfx1012__) // for GPU code
#define CK_USE_AMD_V_FMAC_F32
#define CK_USE_AMD_V_DOT2_F32_F16
#define CK_USE_AMD_V_DOT4_I32_I8
+#elif defined(__gfx101__)
+#define CK_USE_AMD_V_MAC_F32
#elif defined(__gfx11__) || defined(__gfx12__)
#define CK_USE_AMD_V_FMAC_F32
#define CK_USE_AMD_V_DOT2_F32_F16
No changes in LLVM should be necessary. You can do everything in CK. The only concern I have is I don't understand how CK_BUFFER_RESOURCE_3RD_DWORD
is calculated, so I'm not sure whether we're assigning the right value to it for gfx1010
, but in practice I haven't seen that causing any trouble.
No problem at all @GZGavinZhao I think I understand your point of view now, I much appreciate all your good work out there man, dont get me wrong and have a pleasant time.
All my best,
Ok I found out that I messed up few things, also the diff above is missing on additional file:
include/ck_tile/core/config.hpp
also needs to be adjusted
@GZGavinZhao
No changes in LLVM should be necessary. You can do everything in CK. The only concern I have is I don't understand how CK_BUFFER_RESOURCE_3RD_DWORD is calculated,
depending what you want to do, if you want to go a bit deeper than a shallow clone then you would need to have a look and maybe modify few things over there,
so I'm not sure whether we're assigning the right value to it for gfx1010, but in practice I haven't seen that causing any trouble.
The correct CK_BUFFER_RESOURCE_3RD_DWORD
for gfx1010 should be equivalent to the gfx1030 according to the architecture docs. So 0x00020000
is not the correct value, but it will not be a disaster :) the correct value seems to be
0x31014000
I hope that helps
depending what you want to do, if you want to go a bit deeper than a shallow clone then you would need to have a look and maybe modify few things over there,
If you're targeting the amd-staging
branch of the ROCm LLVM, which is unstable and changing as the AMD engineers add more features that may not been more thoroughly-tested, you may get some crashes that are not present in released versions, and consequently you may need to change more code. I didn't want to deal with them so all the changes I made were on top of a tagged and released ref like rocm-6.1.2
. If you don't have a particular reason to not do so, I would suggest that you also work on a released ROCm version like 6.0.0 or 6.1.2. At least for ROCm 6.0, I find that I didn't need to adjust for LLVM; just changing CK is able to make everything (including PyTorch) compile.
The correct
CK_BUFFER_RESOURCE_3RD_DWORD
forgfx1010
should be equivalent to thegfx1030
according to the architecture docs. ... the correct value seems to be0x31014000
Would you mind telling me where did you find this information? I assume you're referencing the RDNA ISA architectures here? If so, could you please tell me the section number of the Instruction Set Architecture reference guide in which you found this information? I agree that it's highly likely CK_BUFFER_RESOURCE_3RD_DWORD
is the same for RDNA1 and RDNA2 since table 37 in section 8.1.8 are equivalent, but I'm not sure if that would qualify as concrete evidence.
If you're targeting the
amd-staging
branch of the ROCm LLVM, which is unstable and changing as the AMD engineers add more features that may not been more thoroughly-tested, you may get some crashes that are not present in released versions, and consequently you may need to change more code. I didn't want to deal with them so all the changes I made were on top of a tagged and released ref likerocm-6.1.2
. If you don't have a particular reason to not do so, I would suggest that you also work on a released ROCm version like 6.0.0 or 6.1.2. At least for ROCm 6.0, I find that I didn't need to adjust for LLVM; just changing CK is able to make everything (including PyTorch) compile.
Well, its good to find bugs, or issues as early as possible, so I'm trying to help myself and possibly others. So I would go for the amd-staging as I'm also in development mode and I need to change stuff anyway. But I get your point its definitely much easier to deal with a released version because then you may not need to build the whole stack (big difference of time for sure)
Would you mind telling me where did you find this information? I assume you're referencing the RDNA ISA architectures here? If so, could you please tell me the section number of the Instruction Set Architecture reference guide in which you found this information? I agree that it's highly likely CK_BUFFER_RESOURCE_3RD_DWORD is the same for RDNA1 and RDNA2 since table 37 in section 8.1.8 are equivalent, but I'm not sure if that would qualify as concrete evidence.
I got the information from the RDNA1 and RDNA2 architecture you posted above, obviously I'm not guessing, and FYI both have the same section number:
8.1.8, check it out ;)
Ok I obviously did not read the last sentence in your reply, so updating: 8.1.8 clarifies the BUFFER_RESOURCE 128-bit structure which are identical in both architectures and the 3RD_DWORD is derived from that, which are equivalent (of course I assumed few things are also equivalent, and that might not be true but the closest to it).
Adding more explanation:
So, the third DWORD (bits 96-127) contains the following fields:
Num_records (bits 95:64)
Dst_sel_x, Dst_sel_y, Dst_sel_z, Dst_sel_w (bits 98:96, 101:99, 104:102, 107:105)
Format (bits 114:108)
Index stride (bits 118:117)
Add tid enable (bit 119)
Resource Level (bit 120)
OOB_SELECT (bits 125:124)
Type (bits 127:126)
So I may be totally incorrect in my assumptions but also I maybe correct in all the assumptions except for the Num_records I have a feeling that one needs to be set to (no I think it should be 0x14000) I was comparing wrong values, nvm 0x7800
instead of 0x14000
Thats it for Today :)
and here more detailed documentation, this one was hidden :D https://doc-july-17.readthedocs.io/en/latest/GCN_ISA_Manuals/testdocbook.html?highlight=vector-memory-buffer-instructions#vector-memory-buffer-instructions
Ok I'm partially done, here is my results, now running on all latest develop as of two days ago :) i think the performance is not bad, but I dont know how it compares to the HSA_OVERRIDE to 1030, if anyone have the gfx1010 running on the latest official rocm versions I would like to see your results (7.3 seconds for a standard no-half generation on SD) :
The torch version is actually main as of yesterday so its 2.5.rc0 I think the 2.3.1 is just the number i choose to let pip3 manage the stuff correctly
i think there is still high disk latency as I'm using a physical hdd from 2009 for this test machine :D
I applied patches to the ROCm 6.2 release to make composable kernels compile for the 5500 XT (gfx1012) as described, as well as unit tests for verification. It seems to work, but there's a lot more fine-tuning to do, since several tests failed the check.
Test project /root/rocm-build/composable-kernel/test
Start 1: test_magic_number_division
1/43 Test #1: test_magic_number_division ................... Passed 1.79 sec
Start 2: test_space_filling_curve
2/43 Test #2: test_space_filling_curve ..................... Passed 0.02 sec
Start 3: test_conv_util
3/43 Test #3: test_conv_util ............................... Passed 0.02 sec
Start 4: test_reference_conv_fwd
4/43 Test #4: test_reference_conv_fwd ...................... Passed 0.02 sec
Start 5: test_gemm_fp32
5/43 Test #5: test_gemm_fp32 ............................... Passed 0.04 sec
Start 6: test_gemm_fp16
6/43 Test #6: test_gemm_fp16 ............................... Passed 0.04 sec
Start 7: test_gemm_bf16
7/43 Test #7: test_gemm_bf16 ............................... Passed 0.04 sec
Start 8: test_gemm_int8
8/43 Test #8: test_gemm_int8 ............................... Passed 0.04 sec
Start 9: test_reduce_no_index
9/43 Test #9: test_reduce_no_index .........................***Failed 1.91 sec
Start 10: test_reduce_with_index
10/43 Test #10: test_reduce_with_index .......................***Failed 3.82 sec
Start 11: test_grouped_convnd_fwd_multi_ab_interface
11/43 Test #11: test_grouped_convnd_fwd_multi_ab_interface ... Passed 0.04 sec
Start 12: test_block_to_ctile_map
12/43 Test #12: test_block_to_ctile_map ...................... Passed 0.02 sec
Start 13: test_softmax_rank3
13/43 Test #13: test_softmax_rank3 ...........................***Failed 52.44 sec
Start 14: test_softmax_rank4
14/43 Test #14: test_softmax_rank4 ...........................***Failed 113.14 sec
Start 15: test_softmax_interface
15/43 Test #15: test_softmax_interface ....................... Passed 0.06 sec
Start 16: test_layernorm2d_fwd_fp32
16/43 Test #16: test_layernorm2d_fwd_fp32 .................... Passed 0.83 sec
Start 17: test_groupnorm_fwd_fp32
17/43 Test #17: test_groupnorm_fwd_fp32 ...................... Passed 0.51 sec
Start 18: test_layernorm2d_fwd_fp16
18/43 Test #18: test_layernorm2d_fwd_fp16 ....................***Failed 0.66 sec
Start 19: test_layernorm4d_fwd_fp16
19/43 Test #19: test_layernorm4d_fwd_fp16 ....................***Failed 0.33 sec
Start 20: test_groupnorm_fwd_fp16
20/43 Test #20: test_groupnorm_fwd_fp16 ......................***Failed 0.64 sec
Start 21: test_layernorm2d_bwd_data_fp32
21/43 Test #21: test_layernorm2d_bwd_data_fp32 ............... Passed 0.52 sec
Start 22: test_groupnorm_bwd_data_fp32
22/43 Test #22: test_groupnorm_bwd_data_fp32 ................. Passed 0.45 sec
Start 23: test_layernorm2d_bwd_gamma_beta_fp32
23/43 Test #23: test_layernorm2d_bwd_gamma_beta_fp32 ......... Passed 0.42 sec
Start 24: test_groupnorm_bwd_gamma_beta_fp32
24/43 Test #24: test_groupnorm_bwd_gamma_beta_fp32 ........... Passed 0.40 sec
Start 25: test_fp8
25/43 Test #25: test_fp8 ..................................... Passed 0.02 sec
Start 26: test_bf8
26/43 Test #26: test_bf8 ..................................... Passed 0.02 sec
Start 27: test_type_convert_const
27/43 Test #27: test_type_convert_const ...................... Passed 0.02 sec
Start 28: test_elementwise_layernorm_fp16
28/43 Test #28: test_elementwise_layernorm_fp16 ..............***Failed 3.15 sec
Start 29: test_batchnorm_fwd_rank_4
29/43 Test #29: test_batchnorm_fwd_rank_4 ....................***Failed 46.27 sec
Start 30: test_batchnorm_bwd_rank_4
30/43 Test #30: test_batchnorm_bwd_rank_4 ....................***Failed 67.20 sec
Start 31: test_batchnorm_infer_rank_4
31/43 Test #31: test_batchnorm_infer_rank_4 ..................***Failed 9.82 sec
Start 32: test_avg_pool3d_bwd
32/43 Test #32: test_avg_pool3d_bwd ..........................***Failed 0.68 sec
Start 33: test_max_pool3d_bwd
33/43 Test #33: test_max_pool3d_bwd .......................... Passed 1.05 sec
Start 34: test_avg_pool3d_fwd
34/43 Test #34: test_avg_pool3d_fwd ..........................***Failed 1.09 sec
Start 35: test_max_pool3d_fwd
35/43 Test #35: test_max_pool3d_fwd ..........................***Failed 2.03 sec
Start 36: test_conv_tensor_rearrange
36/43 Test #36: test_conv_tensor_rearrange ...................***Failed 3.61 sec
Start 37: test_conv_tensor_rearrange_interface
37/43 Test #37: test_conv_tensor_rearrange_interface ......... Passed 0.04 sec
Start 38: test_permute_scale
38/43 Test #38: test_permute_scale ........................... Passed 0.41 sec
Start 39: test_wrapper_layout
39/43 Test #39: test_wrapper_layout .......................... Passed 0.02 sec
Start 40: test_wrapper_tensor
40/43 Test #40: test_wrapper_tensor .......................... Passed 0.30 sec
Start 41: test_wrapper_copy
41/43 Test #41: test_wrapper_copy ............................ Passed 0.27 sec
Start 42: test_wrapper_partition
42/43 Test #42: test_wrapper_partition ....................... Passed 0.02 sec
Start 43: test_position_embedding
43/43 Test #43: test_position_embedding ...................... Passed 0.02 sec
65% tests passed, 15 tests failed out of 43
Total Test time (real) = 314.26 sec
The following tests FAILED:
9 - test_reduce_no_index (Failed)
10 - test_reduce_with_index (Failed)
13 - test_softmax_rank3 (Failed)
14 - test_softmax_rank4 (Failed)
18 - test_layernorm2d_fwd_fp16 (Failed)
19 - test_layernorm4d_fwd_fp16 (Failed)
20 - test_groupnorm_fwd_fp16 (Failed)
28 - test_elementwise_layernorm_fp16 (Failed)
29 - test_batchnorm_fwd_rank_4 (Failed)
30 - test_batchnorm_bwd_rank_4 (Failed)
31 - test_batchnorm_infer_rank_4 (Failed)
32 - test_avg_pool3d_bwd (Failed)
34 - test_avg_pool3d_fwd (Failed)
35 - test_max_pool3d_fwd (Failed)
36 - test_conv_tensor_rearrange (Failed)
Errors while running CTest
Output from these tests are in: /root/rocm-build/composable-kernel/test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
If you want more information about this, or to see the output of these tests, ask away.
UPDATE: The failed test results from CK have been compiled with HIPCC using GCC 12 headers on the Debian 12 Bookworm Docker image. Recompiling with GCC 14 headers on the most recent Debian Sid point image, only two tests failed. It's a major improvement but likely still unusable for production.
Test project /root/rocm-build/ck/test
Start 1: test_magic_number_division
1/43 Test #1: test_magic_number_division ................... Passed 3.95 sec
Start 2: test_space_filling_curve
2/43 Test #2: test_space_filling_curve ..................... Passed 0.02 sec
Start 3: test_conv_util
3/43 Test #3: test_conv_util ............................... Passed 0.02 sec
Start 4: test_reference_conv_fwd
4/43 Test #4: test_reference_conv_fwd ...................... Passed 0.02 sec
Start 5: test_gemm_fp32
5/43 Test #5: test_gemm_fp32 ............................... Passed 0.03 sec
Start 6: test_gemm_fp16
6/43 Test #6: test_gemm_fp16 ............................... Passed 0.04 sec
Start 7: test_gemm_bf16
7/43 Test #7: test_gemm_bf16 ............................... Passed 0.04 sec
Start 8: test_gemm_int8
8/43 Test #8: test_gemm_int8 ............................... Passed 0.04 sec
Start 9: test_reduce_no_index
9/43 Test #9: test_reduce_no_index ......................... Passed 3.51 sec
Start 10: test_reduce_with_index
10/43 Test #10: test_reduce_with_index ....................... Passed 6.44 sec
Start 11: test_grouped_convnd_fwd_multi_ab_interface
11/43 Test #11: test_grouped_convnd_fwd_multi_ab_interface ... Passed 0.04 sec
Start 12: test_block_to_ctile_map
12/43 Test #12: test_block_to_ctile_map ...................... Passed 0.02 sec
Start 13: test_softmax_rank3
13/43 Test #13: test_softmax_rank3 ........................... Passed 49.40 sec
Start 14: test_softmax_rank4
14/43 Test #14: test_softmax_rank4 ........................... Passed 105.24 sec
Start 15: test_softmax_interface
15/43 Test #15: test_softmax_interface ....................... Passed 0.04 sec
Start 16: test_layernorm2d_fwd_fp32
16/43 Test #16: test_layernorm2d_fwd_fp32 .................... Passed 0.78 sec
Start 17: test_groupnorm_fwd_fp32
17/43 Test #17: test_groupnorm_fwd_fp32 ...................... Passed 0.48 sec
Start 18: test_layernorm2d_fwd_fp16
18/43 Test #18: test_layernorm2d_fwd_fp16 .................... Passed 2.68 sec
Start 19: test_layernorm4d_fwd_fp16
19/43 Test #19: test_layernorm4d_fwd_fp16 .................... Passed 0.43 sec
Start 20: test_groupnorm_fwd_fp16
20/43 Test #20: test_groupnorm_fwd_fp16 ...................... Passed 2.24 sec
Start 21: test_layernorm2d_bwd_data_fp32
21/43 Test #21: test_layernorm2d_bwd_data_fp32 ............... Passed 0.48 sec
Start 22: test_groupnorm_bwd_data_fp32
22/43 Test #22: test_groupnorm_bwd_data_fp32 ................. Passed 0.42 sec
Start 23: test_layernorm2d_bwd_gamma_beta_fp32
23/43 Test #23: test_layernorm2d_bwd_gamma_beta_fp32 ......... Passed 0.39 sec
Start 24: test_groupnorm_bwd_gamma_beta_fp32
24/43 Test #24: test_groupnorm_bwd_gamma_beta_fp32 ........... Passed 0.42 sec
Start 25: test_fp8
25/43 Test #25: test_fp8 ..................................... Passed 0.02 sec
Start 26: test_bf8
26/43 Test #26: test_bf8 ..................................... Passed 0.02 sec
Start 27: test_type_convert_const
27/43 Test #27: test_type_convert_const ...................... Passed 0.02 sec
Start 28: test_elementwise_layernorm_fp16
28/43 Test #28: test_elementwise_layernorm_fp16 .............. Passed 12.33 sec
Start 29: test_batchnorm_fwd_rank_4
29/43 Test #29: test_batchnorm_fwd_rank_4 .................... Passed 69.13 sec
Start 30: test_batchnorm_bwd_rank_4
30/43 Test #30: test_batchnorm_bwd_rank_4 .................... Passed 72.18 sec
Start 31: test_batchnorm_infer_rank_4
31/43 Test #31: test_batchnorm_infer_rank_4 ..................***Failed 15.15 sec
Start 32: test_avg_pool3d_bwd
32/43 Test #32: test_avg_pool3d_bwd .......................... Passed 1.03 sec
Start 33: test_max_pool3d_bwd
33/43 Test #33: test_max_pool3d_bwd .......................... Passed 1.09 sec
Start 34: test_avg_pool3d_fwd
34/43 Test #34: test_avg_pool3d_fwd .......................... Passed 1.14 sec
Start 35: test_max_pool3d_fwd
35/43 Test #35: test_max_pool3d_fwd .......................... Passed 2.15 sec
Start 36: test_conv_tensor_rearrange
36/43 Test #36: test_conv_tensor_rearrange ...................***Failed 4.74 sec
Start 37: test_conv_tensor_rearrange_interface
37/43 Test #37: test_conv_tensor_rearrange_interface ......... Passed 0.03 sec
Start 38: test_permute_scale
38/43 Test #38: test_permute_scale ........................... Passed 0.36 sec
Start 39: test_wrapper_layout
39/43 Test #39: test_wrapper_layout .......................... Passed 0.02 sec
Start 40: test_wrapper_tensor
40/43 Test #40: test_wrapper_tensor .......................... Passed 0.26 sec
Start 41: test_wrapper_copy
41/43 Test #41: test_wrapper_copy ............................ Passed 0.26 sec
Start 42: test_wrapper_partition
42/43 Test #42: test_wrapper_partition ....................... Passed 0.02 sec
Start 43: test_position_embedding
43/43 Test #43: test_position_embedding ...................... Passed 0.02 sec
95% tests passed, 2 tests failed out of 43
Total Test time (real) = 357.21 sec
The following tests FAILED:
31 - test_batchnorm_infer_rank_4 (Failed)
36 - test_conv_tensor_rearrange (Failed)
Errors while running CTest
Output from these tests are in: /root/rocm-build/ck/test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
@TheTrustedComputer If you could run test with --output-on-failure
and show the output here that'd be helpful. When I ran the tests myself for a separate issue, I got some false positives because some tests allocated more memory than what my card can handle, so I'm wonder whether those two failures are from that.
which patches did you go for the gfx900 one or the gfx1030 one?
@GZGavinZhao Here you go.
https://drive.google.com/file/d/1GxGLxHaJagvSSTukjjLzZMZs1UgtcOEE/view
Above is a link to the verbose output of the two specific tests that failed, as it may be too long to post to GitHub. It looks like some of the four elements in these tests are returning zero instead of expecting non-zero output. Regardless, I hope you find it helpful in diagnosing whether it's a false positive or a legitimate one. If it's the former, perhaps you can modify these tests for cards with less VRAM? To clarify, I'm using the 8GB 5500 XT model and not the 4GB one.
@waheedi I went for the gfx1030 patch to include the __gfx101__
related macros for the definitions ofCK_BUFFER_RESOURCE_3RD_DWORD
and CK_TILE_BUFFER_RESOURCE_3RD_DWORD
. I then compiled it from the ROCm 6.2 source release, targeting the gfx1012 instruction set architecture (GPU_TARGETS=gfx1012
), which is the card I currently have. Compilation succeeded, and initial test results are promising. However, I must say that building it on older Linux distributions is probably not the way to go, as you can see.
@TheTrustedComputer I'm glad you did that :)
Here's my rough attempt at adding composable kernels to the gfx1010 family. It seems I faced the exact issue as https://github.com/ROCm/MIOpen/issues/1528, but on a gfx1012 instead of the OP's gfx1031. So I followed https://github.com/ROCm/MIOpen/pull/1531/commits/74e496d24687fdcf003d886c4d07cefd3adab47c to hopefully get CK and MIOpen working together on RDNA1. However, it appears I already built and installed MIOpen without patching beforehand; the error message is identical as well. Also, I had to endure nearly 17 hours compiling CK for the gfx1012.
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.
MIOpen Error: /root/rocm-6.2/MIOpen/src/hipoc/hipoc_program.cpp:292: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
UPDATE: After patching MIOpenBatchNormFwdTrainSpatial.cl
and other related files to include unofficial support for RDNA1, I was finally able to run composable kernels on my 5500 XT without issues.
@TheTrustedComputer ane progress on that? Do you plan on doing a pull request or something like that? I have a navi10 gpu (RX 5600 XT), so I'm waiting for this, to get ml stuff finally working, and I can also test the patch if needed.
@SzczurekYT Since RDNA1 GPUs are not officially supported, I have no plans to create an upstream pull request.
However, if you wish, I can provide a diff file you can patch automatically with git. This will patch both composable kernels and MIOpen to introduce basic hardware compatibility for this family of cards.
Compile for the architecture of your GPU. In your case, a 5600 XT is a gfx1011. Be prepared to wait several hours for it to complete.
@SzczurekYT Since RDNA1 GPUs are not officially supported, I have no plans to create an upstream pull request.
I don't think the architecture not being officially supoorted is a problem. The way I understand it is that the officially supported gpus are guaranteed to work, but that doesn't ban the community from submitting patches for other architectures. For example here, a change for RDNA1 got accepted and cherry picked into a release. https://github.com/ROCm/Tensile/pull/1916
However, if you wish, I can provide a diff file you can patch automatically with git. This will patch both composable kernels and MIOpen to introduce basic hardware compatibility for this family of cards.
That would be appreciated.
Compile for the architecture of your GPU. In your case, a 5600 XT is a gfx1011.
Interesting, as rocminfo says gfx1010 Maybe it's rounded or wrong.
Be prepared to wait several hours for it to complete.
I can wait.
Hello, I have some trouble to compile composable_kernel for my AMD GPU architecture (gfx1010)
Any ideas about a solution?