ROCm / composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
https://rocm.docs.amd.com/projects/composable_kernel/en/latest/
Other
288 stars 110 forks source link

Compilation error for navi10 (use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD') #775

Open TyraVex opened 1 year ago

TyraVex commented 1 year ago

Hello, I have some trouble to compile composable_kernel for my AMD GPU architecture (gfx1010)

cmake                                                                                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-D CMAKE_CXX_FLAGS="-O3"                                                                          \
-D CMAKE_BUILD_TYPE=Release                                                                       \
-D GPU_TARGETS="gfx1010"                                                                    \
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is Clang 15.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
CMAKE_CXX_COMPILER_ID: Clang
OpenMP_CXX_LIB_NAMES: libomp;libgomp;libiomp5
OpenMP_gomp_LIBRARY: 
OpenMP_pthread_LIBRARY: 
OpenMP_CXX_FLAGS: -fopenmp=libomp -Wno-unused-command-line-argument
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
-- Build with HIP 5.4.22505
-- Clang tidy not found
-- Clang tidy checks: *,-abseil-*,-android-cloexec-fopen,-cert-msc30-c,-bugprone-exception-escape,-bugprone-macro-parentheses,-cert-env33-c,-cert-msc32-c,-cert-msc50-cpp,-cert-msc51-cpp,-cert-dcl37-c,-cert-dcl51-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-clang-diagnostic-unused-command-line-argument,-cppcoreguidelines-avoid-c-arrays,-cppcoreguidelines-avoid-magic-numbers,-cppcoreguidelines-explicit-virtual-functions,-cppcoreguidelines-init-variables,-cppcoreguidelines-macro-usage,-cppcoreguidelines-non-private-member-variables-in-classes,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-fuchsia-*,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-vararg,-hicpp-braces-around-statements,-hicpp-explicit-conversions,-hicpp-named-parameter,-hicpp-no-array-decay,-hicpp-avoid-c-arrays,-hicpp-signed-bitwise,-hicpp-special-member-functions,-hicpp-uppercase-literal-suffix,-hicpp-use-auto,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-llvmlibc-restrict-system-libc-headers,-llvmlibc-callee-namespace,-llvmlibc-implementation-in-namespace,-llvm-else-after-return,-llvm-qualified-auto,-misc-misplaced-const,-misc-non-private-member-variables-in-classes,-misc-no-recursion,-modernize-avoid-bind,-modernize-avoid-c-arrays,-modernize-pass-by-value,-modernize-use-auto,-modernize-use-default-member-init,-modernize-use-equals-default,-modernize-use-trailing-return-type,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-function-cognitive-complexity,-readability-isolate-declaration,-readability-magic-numbers,-readability-named-parameter,-readability-uppercase-literal-suffix,-readability-convert-member-functions-to-static,-readability-qualified-auto,-readability-redundant-string-init,-bugprone-narrowing-conversions,-cppcoreguidelines-narrowing-conversions,-altera-struct-pack-align,-cppcoreguidelines-prefer-member-initializer
CMAKE_CXX_FLAGS: -O3
adding instance device_batched_gemm_instance
adding instance device_batched_gemm_add_relu_gemm_add_instance
adding instance device_batched_gemm_bias_permute_instance
adding instance device_batched_gemm_gemm_instance
adding instance device_batched_gemm_multi_d_instance
adding instance device_batched_gemm_reduce_instance
adding instance device_batched_gemm_softmax_gemm_instance
adding instance device_batched_gemm_softmax_gemm_permute_instance
adding instance device_batchnorm_instance
adding instance device_contraction_bilinear_instance
adding instance device_contraction_scale_instance
adding instance device_conv1d_bwd_data_instance
adding instance device_conv2d_bwd_data_instance
adding instance device_conv2d_fwd_instance
adding instance device_conv2d_fwd_bias_relu_instance
adding instance device_conv2d_fwd_bias_relu_add_instance
adding instance device_conv3d_bwd_data_instance
adding instance device_elementwise_instance
adding instance device_elementwise_normalization_instance
adding instance device_gemm_instance
adding instance device_gemm_add_add_fastgelu_instance
adding instance device_gemm_add_fastgelu_instance
adding instance device_gemm_add_multiply_instance
adding instance device_gemm_add_relu_add_layernorm_instance
adding instance device_gemm_bias_add_reduce_instance
adding instance device_gemm_bilinear_instance
adding instance device_gemm_fastgelu_instance
adding instance device_gemm_reduce_instance
adding instance device_gemm_splitk_instance
adding instance device_grouped_conv1d_bwd_weight_instance
adding instance device_grouped_conv1d_fwd_instance
adding instance device_grouped_conv2d_bwd_data_instance
adding instance device_grouped_conv2d_bwd_weight_instance
adding instance device_grouped_conv2d_fwd_instance
adding instance device_grouped_conv3d_bwd_weight_instance
adding instance device_grouped_conv3d_fwd_instance
adding instance device_grouped_gemm_instance
adding instance device_grouped_gemm_fastgelu_instance
adding instance device_normalization_instance
adding instance device_pool_fwd_instance
adding instance device_quantization_instance
adding instance device_reduce_instance
adding instance device_softmax_instance
adding example example_gemm_dl_fp32
adding example example_gemm_dl_fp16
adding example example_gemm_dl_int8
adding example example_gemm_xdl_fp16
adding example example_gemm_xdl_wavelet_fp16
adding example example_gemm_xdl_bf16
adding example example_gemm_xdl_int8
adding example example_gemm_xdl_skip_b_lds_fp16
adding example example_gemm_xdl_fp64
adding example example_convnd_fwd_dl_fp16
adding example example_convnd_fwd_dl_fp32
adding example example_convnd_fwd_dl_int8
adding example example_reduce_blockwise
adding example example_reduce_multiblock_atomic_add
adding example example_reduce_blockwise_two_call
adding example example_pool2d_fwd_fp16
adding example example_pool2d_fwd_fp32
adding example example_gemm_dl_quantization_int8
adding example example_grouped_gemm_xdl_fp32
adding example example_grouped_gemm_xdl_fp16
adding example example_grouped_gemm_xdl_bfp16
adding example example_grouped_gemm_xdl_int8
adding example example_grouped_gemm_multiple_d_dl_fp16
adding example example_grouped_gemm_xdl_splitk_fp16
adding example example_convnd_bwd_data_dl_fp16
adding example example_broadcast_add_2d_amn_bn
adding example example_broadcast_add_3d_am_bmnk
adding example example_elementwise_add_1d
adding example example_elementwise_add_4d
adding example example_grouped_conv_bwd_weight_dl_fp16
adding example example_cgemm_xdl_bf16
adding example example_cgemm_xdl_fp16
adding example example_cgemm_xdl_fp32
adding example example_cgemm_xdl_int8
adding example example_softmax_blockwise
adding example example_batched_gemm_xdl_fp32
adding example example_batched_gemm_xdl_fp16
adding example example_batched_gemm_xdl_bfp16
adding example example_batched_gemm_xdl_int8
adding example example_gemm_bias_e_permute_g1m3n2k1_xdl_fp16
adding example example_gemm_bias_e_permute_g1m2n3k1_xdl_fp16
adding example example_contraction_bilinear_xdl_fp32
adding example example_contraction_scale_xdl_fp32
adding example example_contraction_bilinear_xdl_fp64
adding example example_contraction_scale_xdl_fp64
adding example example_layernorm_fp16
adding example example_layernorm_splitk_fp16
adding example example_grouped_gemm_bias_e_permute_xdl_fp16
adding example example_batched_gemm_bias_e_permute_xdl_fp16
adding example example_batched_gemm_scale_softmax_gemm_xdl_fp16
adding example example_batched_gemm_scale_softmax_gemm_xdl_bf16
adding example example_batched_gemm_scale_softmax_gemm_permute_xdl_fp16
adding example example_batched_gemm_scale_softmax_gemm_permute_xdl_bf16
adding example example_grouped_gemm_scale_softmax_gemm_permute_xdl_fp16
adding example example_batched_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
adding example example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
adding example example_dual_reduce_multiblock
adding example example_dual_reduce_threadwise
adding example example_batchnorm_forward_training
adding example example_batchnorm_forward_inferring
adding example example_batchnorm_backward
adding example example_sparse_embedding3_forward_layernorm
adding example example_batched_gemm_add_add_relu_gemm_add_xdl_fp16
adding example example_permute_1xHxW_fp16
adding example example_permute_NxHxW_fp16
adding example example_permute_HxWx4_fp16
adding example example_conv2d_fwd_dl_perlayer_quantization_int8
adding example example_conv2d_fwd_dl_perchannel_quantization_int8
adding example example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8
adding example example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8
adding example example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8
adding example example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8
adding example example_groupnorm_sigmoid_mul_fp16
adding example example_groupnorm_splitk_fp16
adding example example_groupnorm_swish_fp16
adding example example_splitk_gemm_bias_e_permute_xdl_fp16
adding example example_splitk_gemm_bias_e_permute_xdl_fp32
adding example example_elementwise_permute_4D_fp16
adding example example_elementwise_permute_4D_fp16_2d
adding example example_elementwise_layernorm_blockwise
adding example example_gemm_add_multiply_dl_fp16
adding example example_gemm_add_multiply_xdl_fp16
adding example example_pool3d_fwd_fp16
adding example example_maxpool2d_bwd_bf16
adding example example_maxpool2d_bwd_fp16
adding example example_maxpool2d_bwd_fp32
adding example example_put_element_fp16
-- Fetching GoogleTest
-- Suppressing googltest warnings with flags: -Wno-undef;-Wno-reserved-identifier;-Wno-global-constructors;-Wno-missing-noreturn;-Wno-disabled-macro-expansion;-Wno-used-but-marked-unused;-Wno-switch-enum;-Wno-zero-as-null-pointer-constant;-Wno-unused-member-function;-Wno-comma;-Wno-old-style-cast;-Wno-deprecated;-Wno-unsafe-buffer-usage
-- Found Python: /usr/local/bin/python3 (found version "3.8.10") found components: Interpreter 
adding test test_magic_number_division
adding test test_space_filling_curve
adding gtest test_conv_util
adding gtest test_reference_conv_fwd
adding test test_gemm_fp32
adding test test_gemm_fp16
adding test test_gemm_bf16
adding test test_gemm_int8
adding test test_gemm_standalone_xdl_fp16
adding test test_gemm_reduce_fp16
adding test test_reduce_no_index
adding test test_reduce_with_index
adding gtest test_grouped_convnd_fwd
adding gtest test_block_to_ctile_map
adding gtest test_softmax_rank3
adding gtest test_softmax_rank4
adding gtest test_softmax_interface
adding gtest test_layernorm2d_fp32
adding gtest test_layernorm2d_fp16
adding gtest test_groupnorm_fp16
adding gtest test_groupnorm_fp32
adding gtest test_fp8
adding gtest test_elementwise_layernorm_fp16
adding gtest test_batchnorm_fwd_rank_4
adding gtest test_batchnorm_bwd_rank_4
adding gtest test_batchnorm_infer_rank_4
adding gtest test_contraction
adding gtest test_avg_pool2d_fwd
adding gtest test_avg_pool3d_fwd
adding gtest test_max_pool2d_fwd
adding gtest test_max_pool3d_fwd
adding gtest test_batched_gemm_multi_d
RPM version 4.14.2.1
-- Configuring done
-- Generating done
-- Build files have been written to: /home/tyra/rocm/composable_kernel/build
Scanning dependencies of target device_softmax_instance
[  0%] Building CXX object library/src/tensor_operation_instance/gpu/softmax/CMakeFiles/device_softmax_instance.dir/device_softmax_f16_f16_instance_rank3_reduce1.cpp.o
In file included from /home/tyra/rocm/composable_kernel/library/src/tensor_operation_instance/gpu/softmax/device_softmax_f16_f16_instance_rank3_reduce1.cpp:9:
In file included from /home/tyra/rocm/composable_kernel/library/include/ck/library/tensor_operation_instance/gpu/softmax/device_softmax_f16_f16_instance_type.hpp:8:
In file included from /home/tyra/rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_softmax_impl.hpp:12:
In file included from /home/tyra/rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_reduce_common.hpp:9:
In file included from /home/tyra/rocm/composable_kernel/include/ck/utility/common_header.hpp:36:
/home/tyra/rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
    wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
                                               ^
/home/tyra/rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
    wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
                                               ^
2 errors generated when compiling for gfx1010.
make[2]: *** [library/src/tensor_operation_instance/gpu/softmax/CMakeFiles/device_softmax_instance.dir/build.make:82: library/src/tensor_operation_instance/gpu/softmax/CMakeFiles/device_softmax_instance.dir/device_softmax_f16_f16_instance_rank3_reduce1.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:14807: library/src/tensor_operation_instance/gpu/softmax/CMakeFiles/device_softmax_instance.dir/all] Error 2
make: *** [Makefile:182: all] Error 2

Any ideas about a solution?

TyraVex commented 1 year ago

I ended up modifying ./include/ck/ck.hpp I added || defined(__gfx1010__), in order to use the gfx1030 value of CK_BUFFER_RESOURCE_3RD_DWORD with the gfx1010, since it's the closest architecture I can think of. Here is what it looks like:

// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx908__) || \
    defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) ||                          \
    defined(__gfx942__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx1030__) || defined(__gfx1010__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31004000
#endif

Then, the build was successful. I haven't tested it yet.

zjing14 commented 1 year ago

@TyraVex Thanks for reporting this issue. Please let us know if it works or not. We do not have Navi10 GPUs to test it.

TyraVex commented 1 year ago

Sorry for the late response. Back when I was using ubuntu for rocm compilation, it didn't worked, I was getting gpu invalid instructions errors, but cloning the project today and using my modifications led to a successfull build, surprisingly.

Here are the test results

Running tests...
Test project /home/tyra/composable_kernel/build
        Start   1: example_gemm_dl_fp32
  1/119 Test   #1: example_gemm_dl_fp32 ......................................................   Passed    2.95 sec
        Start   2: example_gemm_dl_fp16
  2/119 Test   #2: example_gemm_dl_fp16 ......................................................   Passed    2.84 sec
        Start   3: example_gemm_dl_int8
  3/119 Test   #3: example_gemm_dl_int8 ......................................................   Passed    2.73 sec
        Start   4: example_gemm_xdl_fp16
  4/119 Test   #4: example_gemm_xdl_fp16 .....................................................   Passed    2.76 sec
        Start   5: example_gemm_xdl_wavelet_fp16
  5/119 Test   #5: example_gemm_xdl_wavelet_fp16 .............................................   Passed    2.77 sec
        Start   6: example_gemm_xdl_bf16
  6/119 Test   #6: example_gemm_xdl_bf16 .....................................................   Passed    2.66 sec
        Start   7: example_gemm_xdl_int8
  7/119 Test   #7: example_gemm_xdl_int8 .....................................................   Passed    2.66 sec
        Start   8: example_gemm_xdl_skip_b_lds_fp16
  8/119 Test   #8: example_gemm_xdl_skip_b_lds_fp16 ..........................................   Passed    0.28 sec
        Start   9: example_convnd_fwd_dl_fp16
  9/119 Test   #9: example_convnd_fwd_dl_fp16 ................................................   Passed    5.37 sec
        Start  10: example_convnd_fwd_dl_fp32
 10/119 Test  #10: example_convnd_fwd_dl_fp32 ................................................   Passed    4.92 sec
        Start  11: example_convnd_fwd_dl_int8
 11/119 Test  #11: example_convnd_fwd_dl_int8 ................................................   Passed    4.80 sec
        Start  12: example_reduce_blockwise
 12/119 Test  #12: example_reduce_blockwise ..................................................   Passed    6.04 sec
        Start  13: example_reduce_multiblock_atomic_add
 13/119 Test  #13: example_reduce_multiblock_atomic_add ......................................   Passed    3.71 sec
        Start  14: example_reduce_blockwise_two_call
 14/119 Test  #14: example_reduce_blockwise_two_call .........................................   Passed   25.22 sec
        Start  15: example_pool2d_fwd_fp16
 15/119 Test  #15: example_pool2d_fwd_fp16 ...................................................   Passed    2.93 sec
        Start  16: example_pool2d_fwd_fp32
 16/119 Test  #16: example_pool2d_fwd_fp32 ...................................................   Passed    2.81 sec
        Start  17: example_gemm_dl_quantization_int8
 17/119 Test  #17: example_gemm_dl_quantization_int8 .........................................Subprocess aborted***Exception:   0.64 sec
        Start  18: example_grouped_gemm_xdl_fp32
 18/119 Test  #18: example_grouped_gemm_xdl_fp32 .............................................   Passed    0.01 sec
        Start  19: example_grouped_gemm_xdl_fp16
 19/119 Test  #19: example_grouped_gemm_xdl_fp16 .............................................   Passed    0.01 sec
        Start  20: example_grouped_gemm_xdl_bfp16
 20/119 Test  #20: example_grouped_gemm_xdl_bfp16 ............................................   Passed    0.01 sec
        Start  21: example_grouped_gemm_xdl_int8
 21/119 Test  #21: example_grouped_gemm_xdl_int8 .............................................   Passed    0.01 sec
        Start  22: example_grouped_gemm_multiple_d_dl_fp16
 22/119 Test  #22: example_grouped_gemm_multiple_d_dl_fp16 ...................................   Passed    0.01 sec
        Start  23: example_grouped_gemm_xdl_splitk_fp16
 23/119 Test  #23: example_grouped_gemm_xdl_splitk_fp16 ......................................   Passed    0.01 sec
        Start  24: example_convnd_bwd_data_dl_fp16
 24/119 Test  #24: example_convnd_bwd_data_dl_fp16 ...........................................   Passed    1.65 sec
        Start  25: example_broadcast_add_2d_amn_bn
 25/119 Test  #25: example_broadcast_add_2d_amn_bn ...........................................   Passed    0.28 sec
        Start  26: example_broadcast_add_3d_am_bmnk
 26/119 Test  #26: example_broadcast_add_3d_am_bmnk ..........................................   Passed    0.25 sec
        Start  27: example_elementwise_add_1d
 27/119 Test  #27: example_elementwise_add_1d ................................................   Passed    0.25 sec
        Start  28: example_elementwise_add_4d
 28/119 Test  #28: example_elementwise_add_4d ................................................   Passed    0.25 sec
        Start  29: example_grouped_conv_bwd_weight_dl_fp16
 29/119 Test  #29: example_grouped_conv_bwd_weight_dl_fp16 ...................................   Passed    0.26 sec
        Start  30: example_cgemm_xdl_bf16
 30/119 Test  #30: example_cgemm_xdl_bf16 ....................................................   Passed    0.01 sec
        Start  31: example_cgemm_xdl_fp16
 31/119 Test  #31: example_cgemm_xdl_fp16 ....................................................   Passed    0.01 sec
        Start  32: example_cgemm_xdl_fp32
 32/119 Test  #32: example_cgemm_xdl_fp32 ....................................................   Passed    0.01 sec
        Start  33: example_cgemm_xdl_int8
 33/119 Test  #33: example_cgemm_xdl_int8 ....................................................   Passed    0.01 sec
        Start  34: example_softmax_blockwise
 34/119 Test  #34: example_softmax_blockwise .................................................   Passed    1.53 sec
        Start  35: example_batched_gemm_xdl_fp32
 35/119 Test  #35: example_batched_gemm_xdl_fp32 .............................................   Passed    0.01 sec
        Start  36: example_batched_gemm_xdl_fp16
 36/119 Test  #36: example_batched_gemm_xdl_fp16 .............................................   Passed    0.01 sec
        Start  37: example_batched_gemm_xdl_bfp16
 37/119 Test  #37: example_batched_gemm_xdl_bfp16 ............................................   Passed    0.01 sec
        Start  38: example_batched_gemm_xdl_int8
 38/119 Test  #38: example_batched_gemm_xdl_int8 .............................................   Passed    0.01 sec
        Start  39: example_gemm_bias_e_permute_g1m3n2k1_xdl_fp16
 39/119 Test  #39: example_gemm_bias_e_permute_g1m3n2k1_xdl_fp16 .............................   Passed    1.68 sec
        Start  40: example_gemm_bias_e_permute_g1m2n3k1_xdl_fp16
 40/119 Test  #40: example_gemm_bias_e_permute_g1m2n3k1_xdl_fp16 .............................   Passed    0.34 sec
        Start  41: example_contraction_bilinear_xdl_fp32
 41/119 Test  #41: example_contraction_bilinear_xdl_fp32 .....................................   Passed    0.75 sec
        Start  42: example_contraction_scale_xdl_fp32
 42/119 Test  #42: example_contraction_scale_xdl_fp32 ........................................   Passed    0.56 sec
        Start  43: example_contraction_bilinear_xdl_fp64
 43/119 Test  #43: example_contraction_bilinear_xdl_fp64 .....................................   Passed    0.79 sec
        Start  44: example_contraction_scale_xdl_fp64
 44/119 Test  #44: example_contraction_scale_xdl_fp64 ........................................   Passed    0.59 sec
        Start  45: example_layernorm_fp16
 45/119 Test  #45: example_layernorm_fp16 ....................................................   Passed    0.29 sec
        Start  46: example_layernorm_splitk_fp16
 46/119 Test  #46: example_layernorm_splitk_fp16 .............................................   Passed    0.29 sec
        Start  47: example_grouped_gemm_bias_e_permute_xdl_fp16
 47/119 Test  #47: example_grouped_gemm_bias_e_permute_xdl_fp16 ..............................   Passed    0.01 sec
        Start  48: example_batched_gemm_bias_e_permute_xdl_fp16
 48/119 Test  #48: example_batched_gemm_bias_e_permute_xdl_fp16 ..............................   Passed    0.40 sec
        Start  49: example_batched_gemm_scale_softmax_gemm_xdl_fp16
 49/119 Test  #49: example_batched_gemm_scale_softmax_gemm_xdl_fp16 ..........................   Passed    0.27 sec
        Start  50: example_batched_gemm_scale_softmax_gemm_xdl_bf16
 50/119 Test  #50: example_batched_gemm_scale_softmax_gemm_xdl_bf16 ..........................   Passed    0.27 sec
        Start  51: example_batched_gemm_scale_softmax_gemm_permute_xdl_fp16
 51/119 Test  #51: example_batched_gemm_scale_softmax_gemm_permute_xdl_fp16 ..................   Passed    0.75 sec
        Start  52: example_batched_gemm_scale_softmax_gemm_permute_xdl_bf16
 52/119 Test  #52: example_batched_gemm_scale_softmax_gemm_permute_xdl_bf16 ..................   Passed    0.68 sec
        Start  53: example_grouped_gemm_scale_softmax_gemm_permute_xdl_fp16
 53/119 Test  #53: example_grouped_gemm_scale_softmax_gemm_permute_xdl_fp16 ..................   Passed    0.36 sec
        Start  54: example_batched_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
 54/119 Test  #54: example_batched_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 ...   Passed    0.75 sec
        Start  55: example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
 55/119 Test  #55: example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 ...   Passed    0.36 sec
        Start  56: example_dual_reduce_multiblock
 56/119 Test  #56: example_dual_reduce_multiblock ............................................   Passed    3.35 sec
        Start  57: example_dual_reduce_threadwise
 57/119 Test  #57: example_dual_reduce_threadwise ............................................   Passed    0.27 sec
        Start  58: example_batchnorm_forward_training
 58/119 Test  #58: example_batchnorm_forward_training ........................................   Passed    1.12 sec
        Start  59: example_batchnorm_forward_training_obsolete
 59/119 Test  #59: example_batchnorm_forward_training_obsolete ...............................   Passed    1.10 sec
        Start  60: example_batchnorm_forward_inferring
 60/119 Test  #60: example_batchnorm_forward_inferring .......................................   Passed    2.53 sec
        Start  61: example_batchnorm_backward
 61/119 Test  #61: example_batchnorm_backward ................................................   Passed    3.34 sec
        Start  62: example_sparse_embedding3_forward_layernorm
 62/119 Test  #62: example_sparse_embedding3_forward_layernorm ...............................   Passed   47.00 sec
        Start  63: example_batched_gemm_add_add_relu_gemm_add_xdl_fp16
 63/119 Test  #63: example_batched_gemm_add_add_relu_gemm_add_xdl_fp16 .......................   Passed    0.48 sec
        Start  64: example_permute_1xHxW_fp16
 64/119 Test  #64: example_permute_1xHxW_fp16 ................................................   Passed    0.47 sec
        Start  65: example_permute_NxHxW_fp16
 65/119 Test  #65: example_permute_NxHxW_fp16 ................................................   Passed    0.88 sec
        Start  66: example_permute_HxWx4_fp16
 66/119 Test  #66: example_permute_HxWx4_fp16 ................................................   Passed    1.13 sec
        Start  67: example_conv2d_fwd_dl_perlayer_quantization_int8
 67/119 Test  #67: example_conv2d_fwd_dl_perlayer_quantization_int8 ..........................Subprocess aborted***Exception:   0.70 sec
        Start  68: example_conv2d_fwd_dl_perchannel_quantization_int8
 68/119 Test  #68: example_conv2d_fwd_dl_perchannel_quantization_int8 ........................Subprocess aborted***Exception:   0.68 sec
        Start  69: example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8
 69/119 Test  #69: example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8 ................Subprocess aborted***Exception:   0.71 sec
        Start  70: example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8
 70/119 Test  #70: example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8 ..............Subprocess aborted***Exception:   0.68 sec
        Start  71: example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8
 71/119 Test  #71: example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8 ................Subprocess aborted***Exception:   0.71 sec
        Start  72: example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8
 72/119 Test  #72: example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8 ..............Subprocess aborted***Exception:   0.69 sec
        Start  73: example_groupnorm_sigmoid_mul_fp16
 73/119 Test  #73: example_groupnorm_sigmoid_mul_fp16 ........................................   Passed    7.67 sec
        Start  74: example_groupnorm_splitk_fp16
 74/119 Test  #74: example_groupnorm_splitk_fp16 .............................................   Passed    7.51 sec
        Start  75: example_groupnorm_swish_fp16
 75/119 Test  #75: example_groupnorm_swish_fp16 ..............................................   Passed    7.55 sec
        Start  76: example_splitk_gemm_bias_e_permute_xdl_fp16
 76/119 Test  #76: example_splitk_gemm_bias_e_permute_xdl_fp16 ...............................   Passed    0.41 sec
        Start  77: example_splitk_gemm_bias_e_permute_xdl_fp32
 77/119 Test  #77: example_splitk_gemm_bias_e_permute_xdl_fp32 ...............................   Passed    0.40 sec
        Start  78: example_elementwise_permute_4D_fp16
 78/119 Test  #78: example_elementwise_permute_4D_fp16 .......................................   Passed    0.37 sec
        Start  79: example_elementwise_permute_4D_fp16_2d
 79/119 Test  #79: example_elementwise_permute_4D_fp16_2d ....................................   Passed   12.55 sec
        Start  80: example_elementwise_layernorm_blockwise
 80/119 Test  #80: example_elementwise_layernorm_blockwise ...................................   Passed    1.10 sec
        Start  81: example_gemm_add_multiply_dl_fp16
 81/119 Test  #81: example_gemm_add_multiply_dl_fp16 .........................................   Passed    1.50 sec
        Start  82: example_gemm_add_multiply_xdl_fp16
 82/119 Test  #82: example_gemm_add_multiply_xdl_fp16 ........................................   Passed    1.49 sec
        Start  83: example_pool3d_fwd_fp16
 83/119 Test  #83: example_pool3d_fwd_fp16 ...................................................   Passed    0.30 sec
        Start  84: example_maxpool2d_bwd_bf16
 84/119 Test  #84: example_maxpool2d_bwd_bf16 ................................................   Passed    0.26 sec
        Start  85: example_maxpool2d_bwd_fp16
 85/119 Test  #85: example_maxpool2d_bwd_fp16 ................................................   Passed    0.26 sec
        Start  86: example_maxpool2d_bwd_fp32
 86/119 Test  #86: example_maxpool2d_bwd_fp32 ................................................   Passed    0.25 sec
        Start  87: example_put_element_fp16
 87/119 Test  #87: example_put_element_fp16 ..................................................   Passed    0.25 sec
        Start  88: test_magic_number_division
 88/119 Test  #88: test_magic_number_division ................................................   Passed    1.26 sec
        Start  89: test_space_filling_curve
 89/119 Test  #89: test_space_filling_curve ..................................................   Passed    0.01 sec
        Start  90: test_conv_util
 90/119 Test  #90: test_conv_util ............................................................   Passed    0.01 sec
        Start  91: test_reference_conv_fwd
 91/119 Test  #91: test_reference_conv_fwd ...................................................   Passed    0.01 sec
        Start  92: test_gemm_fp32
 92/119 Test  #92: test_gemm_fp32 ............................................................   Passed   46.32 sec
        Start  93: test_gemm_fp16
 93/119 Test  #93: test_gemm_fp16 ............................................................   Passed  378.44 sec
        Start  94: test_gemm_bf16
 94/119 Test  #94: test_gemm_bf16 ............................................................   Passed   18.93 sec
        Start  95: test_gemm_int8
 95/119 Test  #95: test_gemm_int8 ............................................................   Passed   63.21 sec
        Start  96: test_gemm_standalone_xdl_fp16
 96/119 Test  #96: test_gemm_standalone_xdl_fp16 .............................................   Passed  363.97 sec
        Start  97: test_gemm_reduce_fp16
 97/119 Test  #97: test_gemm_reduce_fp16 .....................................................   Passed    0.35 sec
        Start  98: test_reduce_no_index
 98/119 Test  #98: test_reduce_no_index ......................................................   Passed    3.31 sec
        Start  99: test_reduce_with_index
 99/119 Test  #99: test_reduce_with_index ....................................................   Passed    5.96 sec
        Start 100: test_grouped_convnd_fwd
100/119 Test #100: test_grouped_convnd_fwd ...................................................   Passed  905.83 sec
        Start 101: test_block_to_ctile_map
101/119 Test #101: test_block_to_ctile_map ...................................................   Passed    0.01 sec
        Start 102: test_softmax_rank3
102/119 Test #102: test_softmax_rank3 ........................................................   Passed   56.83 sec
        Start 103: test_softmax_rank4
103/119 Test #103: test_softmax_rank4 ........................................................   Passed  123.09 sec
        Start 104: test_softmax_interface
104/119 Test #104: test_softmax_interface ....................................................   Passed    0.01 sec
        Start 105: test_layernorm2d_fp32
105/119 Test #105: test_layernorm2d_fp32 .....................................................   Passed    0.81 sec
        Start 106: test_layernorm2d_fp16
106/119 Test #106: test_layernorm2d_fp16 .....................................................   Passed    1.24 sec
        Start 107: test_groupnorm_fp16
107/119 Test #107: test_groupnorm_fp16 .......................................................   Passed    1.04 sec
        Start 108: test_groupnorm_fp32
108/119 Test #108: test_groupnorm_fp32 .......................................................   Passed    0.48 sec
        Start 109: test_fp8
109/119 Test #109: test_fp8 ..................................................................   Passed    0.01 sec
        Start 110: test_elementwise_layernorm_fp16
110/119 Test #110: test_elementwise_layernorm_fp16 ...........................................   Passed    7.02 sec
        Start 111: test_batchnorm_fwd_rank_4
111/119 Test #111: test_batchnorm_fwd_rank_4 .................................................   Passed   69.16 sec
        Start 112: test_batchnorm_bwd_rank_4
112/119 Test #112: test_batchnorm_bwd_rank_4 .................................................   Passed   98.97 sec
        Start 113: test_batchnorm_infer_rank_4
113/119 Test #113: test_batchnorm_infer_rank_4 ...............................................   Passed   21.39 sec
        Start 114: test_contraction
114/119 Test #114: test_contraction ..........................................................   Passed   14.43 sec
        Start 115: test_avg_pool2d_fwd
115/119 Test #115: test_avg_pool2d_fwd .......................................................   Passed    0.26 sec
        Start 116: test_avg_pool3d_fwd
116/119 Test #116: test_avg_pool3d_fwd .......................................................   Passed    1.00 sec
        Start 117: test_max_pool2d_fwd
117/119 Test #117: test_max_pool2d_fwd .......................................................   Passed    0.27 sec
        Start 118: test_max_pool3d_fwd
118/119 Test #118: test_max_pool3d_fwd .......................................................   Passed    1.83 sec
        Start 119: test_batched_gemm_multi_d
119/119 Test #119: test_batched_gemm_multi_d .................................................   Passed    1.05 sec

94% tests passed, 7 tests failed out of 119

Total Test time (real) = 2384.13 sec

The following tests FAILED:
    17 - example_gemm_dl_quantization_int8 (Subprocess aborted)
    67 - example_conv2d_fwd_dl_perlayer_quantization_int8 (Subprocess aborted)
    68 - example_conv2d_fwd_dl_perchannel_quantization_int8 (Subprocess aborted)
    69 - example_conv2d_fwd_dl_bias_relu_perlayer_quantization_int8 (Subprocess aborted)
    70 - example_conv2d_fwd_dl_bias_relu_perchannel_quantization_int8 (Subprocess aborted)
    71 - example_conv2d_fwd_dl_bias_tanh_perlayer_quantization_int8 (Subprocess aborted)
    72 - example_conv2d_fwd_dl_bias_tanh_perchannel_quantization_int8 (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /home/tyra/composable_kernel/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:91: test] Error 8

LastTest.log

trixirt commented 10 months ago

In the 5.7 branch

There is missing support for gfx941, gfx942, gfx1012 and gfx1030 So the only generally useful set is the gfx11.

Folks will be unhappily surprised when MIOpen and similar users of CK fail.

GZGavinZhao commented 8 months ago

Our experiments at Solus have found the following:

waheedi commented 1 month ago
  • gfx1010 is probably best to be treated as gfx900, because even though gfx1030 is theoretically the closest, there are optimizations and LLVM builtins that gfx1010 cannot use while gfx1030 can, such as __builtin_amdgcn_fdot2.

@GZGavinZhao Its interesting that the __builtin_amdgcn_fdot2 and __builtin_amdgcn_sdot4 features are not supported, maybe we can just add them to the features list (i digged it up here) https://github.com/llvm/llvm-project/blob/8a79dc7e6f765f3f49c5dd9330fc0826d3362858/llvm/lib/Target/AMDGPU/AMDGPU.td#L1496

So if I could just get to understand that we are hitting a hardware limit with gfx1010 (RX 5700 and the likes, Navi10 processor) that it can not do what the gfx1030 (RX6900 and the likes Navi21,23) is doing or its just someones idea for business strategy that gfx1010 is too old to support or whats the case? if its a hardware limitation where did we hit a dead end? why the software is not shipped as well as for gfx1010, im just trying to understand what is going on, whos deciding the stuff?

I know there have been a tremendous amount of efforts all over the place which i was not part of, and I appreciate everyone contribution to where things have gone so far and very thankful to everyone's effort but we need to fix this issue, old GPUS are not going to the trash man (if the plan is to send them to the trash, then too bad of a waste)!

I just read some main instruction set parts for memory shader and some other registers https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna-shader-instruction-set-architecture.pdf both AMD-GCN-GFX10-RDNA2 and AMD-GCN-GFX10-RDNA1 seems to have an identical instructions sets, I understand there are some details i missed, but from a hardware point of view, everything seems to be existing already in gfx1010 so our problem is software, so probably just copying the gfx1030 stuff and use it for gfx1010 and maybe without modifying anything (or with modifiying few things which is a nightmare:)) we would get a better support for that Navi10 (that gddr6 monster can do crazy stuff, so people are using older software by adopting to the gfx900 arch, well its still a toy in comparison to the MI300X)

the problem is that this is not an easy job to do, it should have been considered earlier but we can maybe still do something all together, as the projects involved maybe beyond my capabilities, like:

1-LLVM 2- rocBlas 3- CK 4- MIOpen

waheedi commented 1 month ago

I just hit this error while trying to add the gfx1030 features to gfx1010,

Cached asm caps differ from derived asm caps for (10, 1, 0)

which is expected, as I modified the AMDGPU Target for that arch, so now my question, which one is the cached asm caps and which one derived from where?

I passed through the error by forking the Tensile repo and IgnoringAsmCap mismatch

trixirt commented 1 month ago

Could you raise an issue in Tensile for this ? Fedora will use the upstream Tensile, not a copy.

waheedi commented 1 month ago

Thanks @trixirt , yes, I will also add this issue to it:

below is actually the compileArgs command, and the first element is None, which i guess is the cxx_compiler(it should not be None):

printing our args as we have a None Type error at the moment [None, '-D__HIP_HCC_COMPAT_MODE__=1', '--cuda-device-only', '-x', 'hip', '-O3', '-I', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile', '-Xoffload-linker', '--build-id', '--offload-arch=gfx1010', '--offload-arch=gfx803', '--offload-arch=gfx1030', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile/TensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback.cpp', '-c', '-o', '/home/bargo/projects/rocm-setup/rocBLAS/build/library/src/build_tmp/TENSILE/code_object_tmp/TensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback.o'] launcher: [] hipFlags: ['-D__HIP_HCC_COMPAT_MODE__=1', '--cuda-device-only', '-x', 'hip', '-O3', '-I', '/home/bargo/projects/rocm-setup/rocBLAS/build/Tensile', '-Xoffload-linker', '--build-id'] archFlags: ['--offload-arch=gfx1010', '--offload-arch=gfx803', '--offload-arch=gfx1030']

GZGavinZhao commented 1 month ago

I agree this issue/discussion should continue in Tensile, but just to add my 2 cents:

I just read some main instruction set parts for memory shader and some other registers https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna-shader-instruction-set-architecture.pdf both AMD-GCN-GFX10-RDNA2 and AMD-GCN-GFX10-RDNA1 seems to have an identical instructions sets

RDNA1 and RDNA2 are not the same instruction set. For example, consult section 6.3 of the RDNA1 and RDNA2 ISA architecture. You will see that RDNA2 has V_DOT* instructions like V_DOT2_F32_F16 that RDNA1 doesn't have. Moreover, aside from instruction differences, there are fundamentally different designs in RDNA1 and RDNA2 hardware that makes code compiled for the latter unable to run on the former. For example, the different number of registers. It's possible you may get lucky and the code you compiled against RDNA2 doesn't use the extra registers at all, so it runs fine on RDNA1, but that's pure luck and there's no way we can guarantee that.

So if I could just get to understand that we are hitting a hardware limit with gfx1010 (RX 5700 and the likes, Navi10 processor) that it can not do what the gfx1030 (RX6900 and the likes Navi21,23) is doing or its just someones idea for business strategy that gfx1010 is too old to support or whats the case? if its a hardware limitation where did we hit a dead end? why the software is not shipped as well as for gfx1010, im just trying to understand what is going on, whos deciding the stuff?

As per my reasoning above, in my understanding it is a hardware limit that gfx1010 cannot do the same as gfx1030. This makes sense anyway, because why would RDNA2 even exist in the first place if it's just more FLOPS and no new instructions/optimizations?

Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail. The best we can do (and what it's already been done at Solus) is to patch libraries like rocBLAS, Tensile, and CK so that they can compile and run RDNA1 code on a best-effort basis. I believe Solus's support for RDNA1 hardware is complete. I assume you own a RDNA1 hardware, so feel free to grab a Solus ISO, install the pytorch package, give it a try (it should work out of the box, no environment variables or anything needed), and let me know or open an issue at the Solus repo if anything doesn't work.

waheedi commented 1 month ago

Thanks for having a look @GZGavinZhao appreciate all your efforts.

RDNA1 and RDNA2 are not the same instruction set. For example, consult section 6.3 of the RDNA1 and RDNA2 ISA architecture. You will see that RDNA2 has V_DOT* instructions like V_DOT2_F32_F16 that RDNA1 doesn't have. Moreover, aside from instruction differences, there are fundamentally different designs in RDNA1 and RDNA2 hardware that makes code compiled for the latter unable to run on the former. For example, the different number of registers. It's possible you may get lucky and the code you compiled against RDNA2 doesn't use the extra registers at all, so it runs fine on RDNA1, but that's pure luck and there's no way we can guarantee that.

Actually regarding the above I have a feeling that we can build these instructions for the gfx1010, you referred to the VOP3P instruction which can also be built for that target, it could be all what im doing is none sense, but I like trying stuff and let me see where that takes me :) I will keep you posted either way.

Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail. The best we can do (and what it's already been done at Solus) is to patch libraries like rocBLAS, Tensile, and CK so that they can compile and run RDNA1 code on a best-effort basis. I believe Solus's support for RDNA1 hardware is complete. I assume you own a RDNA1 hardware, so feel free to grab a Solus ISO, install the pytorch package, give it a try (it should work out of the box, no environment variables or anything needed), and let me know or open an issue at the Solus repo if anything doesn't work.

i think Solus is definitely helpful but currently running the gfx1010 on gfx900 is a downgrade.

It would be my last resort to do that nevertheless I really like what you guys are doing at Solus looks a good approach for a better GPU support,

I have plenty of GPUs as I work as a repairman for myself. And thus i ended up with plenty of Graphic cards from all generations. So i need to be fair to myself :)

GZGavinZhao commented 1 month ago

i think Solus is definitely helpful but currently running the gfx1010 on gfx900 is a downgrade.

Note that only for CK, gfx1010 has the same code path as gfx900. For Tensile/rocBLAS, gfx1010 is being compiled normally, so it receives all the optimizations that gfx1010 should get by default.

I do agree that for CK specifically, we may be able to treat gfx1010 as gfx1030 in general and then workaround specific cases where VDOT instructions are used. However, for all other libraries, you shouldn't need to do anything; gfx1010 should just compile and run normally, at least that's the case for Solus with ROCm 5.7, 6.0, and 6.1.

waheedi commented 1 month ago

Note that only for CK, gfx1010 has the same code path as gfx900. For Tensile/rocBLAS, gfx1010 is being compiled normally, so it receives all the optimizations that gfx1010 should get by default.

I do agree that for CK specifically, we may be able to treat gfx1010 as gfx1030 in general and then workaround specific cases where VDOT instructions are used. However, for all other libraries, you shouldn't need to do anything; gfx1010 should just compile and run normally, at least that's the case for Solus with ROCm 5.7, 6.0, and 6.1.

Yes I noticed that, and that what brought me here :) I think there is a room of improvement in general :) the repos are huge and there are a lot of stuff like 20 repos :D

waheedi commented 1 month ago

Right now I'm just reading some weird stuff like this: defm V_DOT2_F32_F16 : VOP3P_Real_gfx10_gfx11_gfx12_Triple<0x13>; this is basically in some llvm file I think this one, llvm-project/llvm/lib/Target/AMDGPU/VOP3PInstructions.td and it shows that this V_DOT2_F32_F16 Which means its already part of that VOP3P Instructions set I dont understand that language very well, very complex for my heart

GZGavinZhao commented 1 month ago

That file is LLVM TableGen. The problem is again that V_DOT2_F32_F16 is not part of the RDNA1 instruction set, so even if you could make LLVM accept that, you'll very likely have problems running that instruction on the hardware.

By saying "then workaround specific cases where VDOT instructions are used", I meant instead of relying on hardware-accelerated instructions, write portable HIP code that are the equivalent of the VDOT operation.

waheedi commented 1 month ago

I think I understand what you mean, but you did not understand what I mean :)

It worked! right now here is my results for building for ck gfx1030;gfx1010;gfx803 with all develop branches using llvm19 and clang19

a fresh make looks like this:

make -j 8
[  0%] Built target device_max_pool_bwd_instance
[  0%] Built target device_transpose_instance
[  0%] Built target device_elementwise_normalization_instance
[  0%] Built target device_normalization_bwd_data_instance
[  0%] Built target device_normalization_bwd_gamma_beta_instance
[  0%] Built target device_avg_pool3d_bwd_instance
[ 14%] Built target device_elementwise_instance
[ 14%] Built target device_column_to_image_instance
[ 14%] Built target device_image_to_column_instance
[ 14%] Built target device_pool3d_fwd_instance
[ 14%] Built target device_normalization_fwd_instance
[ 28%] Built target utility
[ 42%] Built target device_permute_scale_instance
[ 57%] Built target device_softmax_instance
[ 57%] Built target example_gemm_dpp_fp16
[ 57%] Built target example_reduce_blockwise
[100%] Built target example_put_element_fp16
[100%] Built target example_avgpool3d_bwd_bf16
[100%] Built target example_avgpool3d_bwd_fp16
[100%] Built target example_avgpool3d_bwd_fp32
[100%] Built target example_image_to_column_f32
[100%] Built target example_column_to_image_f32
[100%] Built target example_layernorm2d_bwd_fp32
[100%] Built target example_groupnorm_bwd_fp32
[100%] Built target example_layernorm4d_fwd_fp16
[100%] Built target example_layernorm4d_fwd_splitk_fp16
[100%] Built target ckProfiler
#Complete man

I guess now we will fail in some areas which we need to look at more closely ;)

I have no idea how bad is this but it looks good for my eyes

bargo@beta:~/projects/rocm-setup/composable_kernel/build/bin$ ./ckProfiler groupnorm 1 0 2 1 1  --length 1 16 16 32 40
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
warning: xnack 'Off' was requested for a processor that does not support it!
found 39 instances
Perf:  0.0956515 ms, 27.5132 GB/s, DeviceNormalizationFwdImpl<64,Cluster_MK_1_64,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0547736 ms, 48.0465 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0424971 ms, 61.9261 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0379853 ms, 69.2816 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0457218 ms, 57.5585 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0262616 ms, 100.21 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_2,XYSrcVectorDim_1,VectorSize_X2_Gamma2_Beta2_Y2>
Perf:  0.0248361 ms, 105.962 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0246625 ms, 106.708 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0268792 ms, 97.9077 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:   0.032927 ms, 79.9246 GB/s, DeviceNormalizationFwdImpl<128,Cluster_MK_1_128,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0194474 ms, 135.323 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0191178 ms, 137.656 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0225913 ms, 116.491 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:   0.032919 ms, 79.9441 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_2_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0284743 ms, 92.4229 GB/s, DeviceNormalizationFwdImpl<256,Cluster_MK_1_256,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0181155 ms, 145.273 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:   0.018889 ms, 139.323 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0209594 ms, 125.561 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_2_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0203058 ms, 129.603 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0246889 ms, 106.594 GB/s, DeviceNormalizationFwdImpl<1024,Cluster_MK_1_1024,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0954451 ms, 27.5727 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0864766 ms, 30.4323 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0890941 ms, 29.5382 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0908229 ms, 28.976 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_1,XYSrcVectorDim_1,VectorSize_X1_Gamma1_Beta1_Y1>
Perf:  0.0523808 ms, 50.2413 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_2,XYSrcVectorDim_1,VectorSize_X2_Gamma2_Beta2_Y2>
Perf:  0.0371093 ms, 70.917 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0310343 ms, 84.7991 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0274904 ms, 95.7309 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0293671 ms, 89.6132 GB/s, DeviceNormalizationFwdSplitKImpl<128,Cluster_MK_1_128,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0351702 ms, 74.8271 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0300303 ms, 87.6342 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0311927 ms, 84.3685 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0290031 ms, 90.7378 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_2_16,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0326374 ms, 80.6338 GB/s, DeviceNormalizationFwdSplitKImpl<256,Cluster_MK_1_256,Slice_MK_1_32,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0358837 ms, 73.3391 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0326766 ms, 80.5371 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0352086 ms, 74.7455 GB/s, DeviceNormalizationFwdSplitKImpl<512,Cluster_MK_1_512,Slice_MK_2_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0421652 ms, 62.4136 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
Perf:  0.0389533 ms, 67.5599 GB/s, DeviceNormalizationFwdSplitKImpl<1024,Cluster_MK_1_1024,Slice_MK_1_8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
length = 1,16,16,32,40
best perf = 0.0181155 ms, 145.273 GB/s, DeviceNormalizationFwdImpl<512,Cluster_MK_1_512,Slice_MK_1_4,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>

I thank the Lord of Skies and Earth for what he gives and teaches and I also thank everyone for his help and contribution.

we need more documentations, I do not even know how to run the ckPorfiler man.

GZGavinZhao commented 1 month ago

Really nice! I'm curious, what did you change? Did you only update CK, or did you also update the TableGen files in LLVM as well?

waheedi commented 1 month ago

Really nice! I'm curious, what did you change? Did you only update CK, or did you also update the TableGen files in LLVM as well?

Thanks! I actually changed few things in the target code for that, and only made one change in ck, cmakelists to treat gfx1010 as gfx1030, but now I would need to continue the process, I think I will leave it here for now, but at least you know now, that this is not impossible as you thought earlier but rather possible , even it took a lot of my energy, but I need to provide better quality service for my future customers at https://ropotato.com

I think there will be more work to be done generally, but at least we have progress.

GZGavinZhao commented 1 month ago

Interesting. By "impossible", I meant it's impossible to compile CK as gfx1030 (i.e. only --offload-arch=gfx1030) and run the resulting library directly on gfx1010 hardware. CK is already runnable on gfx1010 since its inception with a small amount of patches.

waheedi commented 1 month ago

No man, gfx1010 is not run-able with these changes, do you get it now?

+++ b/include/ck/ck.hpp
@@ -78,7 +78,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
 #define CK_BUFFER_RESOURCE_3RD_DWORD -1
 #elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
-#elif defined(__gfx103__)
+#elif defined(__gfx103__) || defined(__gfx1010__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
 #elif defined(__gfx11__) || defined(__gfx12__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x31004000
@@ -88,7 +88,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
 #ifndef __HIP_DEVICE_COMPILE__                   // for host code, define nothing
 #elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
 #define CK_USE_AMD_V_MAC_F32
-#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
+#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) || defined(__gfx1010__) // for GPU code

unless you modify the stuff like i did, but you keep changing what you said, and i dont like that, I will quit the discussion

GZGavinZhao commented 1 month ago

My apologies and I think there are misunderstandings from the start of this conversation. In my first response to you, I said

Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail.

This still holds. If you compile code intended for RDNA2 (gfx1030) as gfx1010 by passing --offload-arch=gfx1010 and attempt to run the result on RDNA1 hardware, due to the difference in ISA instructions you cannot guarantee the code always compile or work on RDNA1 hardware. Note that I'm saying always. There has been cases where people on RDNA1 hardware try to run code compiled for RDNA2 through HSA_OVERRIDE_GFX_VERSION=10.3.0; sometimes it works, sometimes it doesn't.

What you seem to be trying to do is to make CK somehow run on RDNA1. This is a solved problem, see the diff below.

As to why the changes you posted don't work, I never said that the changes you posted were the ones needed to make it work. For ROCm 6.0, all you need is a simple sed -i 's/defined(__gfx900__)/defined(__gfx900__) || defined(__gfx1010__)/g' include/ck/ck.hpp to make it work. In practice, the equivalent changes on the current develop branch would be:

diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..32551a8da 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
 // buffer resource
 #ifndef __HIP_DEVICE_COMPILE__ // for host code
 #define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
 #elif defined(__gfx103__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -86,7 +86,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)

 // FMA instruction
 #ifndef __HIP_DEVICE_COMPILE__                   // for host code, define nothing
-#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx101__) // for GPU code
 #define CK_USE_AMD_V_MAC_F32
 #elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
 #define CK_USE_AMD_V_FMAC_F32

If you want to be really precise on using all the optimizations available, it would be this:

diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..4eaeefdae 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
 // buffer resource
 #ifndef __HIP_DEVICE_COMPILE__ // for host code
 #define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
 #elif defined(__gfx103__)
 #define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -88,10 +88,12 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
 #ifndef __HIP_DEVICE_COMPILE__                   // for host code, define nothing
 #elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
 #define CK_USE_AMD_V_MAC_F32
-#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
+#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) || defined(__gfx1011__) || defined(__gfx1012__) // for GPU code
 #define CK_USE_AMD_V_FMAC_F32
 #define CK_USE_AMD_V_DOT2_F32_F16
 #define CK_USE_AMD_V_DOT4_I32_I8
+#elif defined(__gfx101__)
+#define CK_USE_AMD_V_MAC_F32
 #elif defined(__gfx11__) || defined(__gfx12__)
 #define CK_USE_AMD_V_FMAC_F32
 #define CK_USE_AMD_V_DOT2_F32_F16

No changes in LLVM should be necessary. You can do everything in CK. The only concern I have is I don't understand how CK_BUFFER_RESOURCE_3RD_DWORD is calculated, so I'm not sure whether we're assigning the right value to it for gfx1010, but in practice I haven't seen that causing any trouble.

waheedi commented 1 month ago

No problem at all @GZGavinZhao I think I understand your point of view now, I much appreciate all your good work out there man, dont get me wrong and have a pleasant time.

All my best,

waheedi commented 1 month ago

Ok I found out that I messed up few things, also the diff above is missing on additional file: include/ck_tile/core/config.hpp also needs to be adjusted @GZGavinZhao

No changes in LLVM should be necessary. You can do everything in CK. The only concern I have is I don't understand how CK_BUFFER_RESOURCE_3RD_DWORD is calculated,

depending what you want to do, if you want to go a bit deeper than a shallow clone then you would need to have a look and maybe modify few things over there,

so I'm not sure whether we're assigning the right value to it for gfx1010, but in practice I haven't seen that causing any trouble.

The correct CK_BUFFER_RESOURCE_3RD_DWORD for gfx1010 should be equivalent to the gfx1030 according to the architecture docs. So 0x00020000 is not the correct value, but it will not be a disaster :) the correct value seems to be 0x31014000

I hope that helps

GZGavinZhao commented 1 month ago

depending what you want to do, if you want to go a bit deeper than a shallow clone then you would need to have a look and maybe modify few things over there,

If you're targeting the amd-staging branch of the ROCm LLVM, which is unstable and changing as the AMD engineers add more features that may not been more thoroughly-tested, you may get some crashes that are not present in released versions, and consequently you may need to change more code. I didn't want to deal with them so all the changes I made were on top of a tagged and released ref like rocm-6.1.2. If you don't have a particular reason to not do so, I would suggest that you also work on a released ROCm version like 6.0.0 or 6.1.2. At least for ROCm 6.0, I find that I didn't need to adjust for LLVM; just changing CK is able to make everything (including PyTorch) compile.

The correct CK_BUFFER_RESOURCE_3RD_DWORD for gfx1010 should be equivalent to the gfx1030 according to the architecture docs. ... the correct value seems to be 0x31014000

Would you mind telling me where did you find this information? I assume you're referencing the RDNA ISA architectures here? If so, could you please tell me the section number of the Instruction Set Architecture reference guide in which you found this information? I agree that it's highly likely CK_BUFFER_RESOURCE_3RD_DWORD is the same for RDNA1 and RDNA2 since table 37 in section 8.1.8 are equivalent, but I'm not sure if that would qualify as concrete evidence.

waheedi commented 1 month ago

If you're targeting the amd-staging branch of the ROCm LLVM, which is unstable and changing as the AMD engineers add more features that may not been more thoroughly-tested, you may get some crashes that are not present in released versions, and consequently you may need to change more code. I didn't want to deal with them so all the changes I made were on top of a tagged and released ref like rocm-6.1.2. If you don't have a particular reason to not do so, I would suggest that you also work on a released ROCm version like 6.0.0 or 6.1.2. At least for ROCm 6.0, I find that I didn't need to adjust for LLVM; just changing CK is able to make everything (including PyTorch) compile.

Well, its good to find bugs, or issues as early as possible, so I'm trying to help myself and possibly others. So I would go for the amd-staging as I'm also in development mode and I need to change stuff anyway. But I get your point its definitely much easier to deal with a released version because then you may not need to build the whole stack (big difference of time for sure)

Would you mind telling me where did you find this information? I assume you're referencing the RDNA ISA architectures here? If so, could you please tell me the section number of the Instruction Set Architecture reference guide in which you found this information? I agree that it's highly likely CK_BUFFER_RESOURCE_3RD_DWORD is the same for RDNA1 and RDNA2 since table 37 in section 8.1.8 are equivalent, but I'm not sure if that would qualify as concrete evidence.

I got the information from the RDNA1 and RDNA2 architecture you posted above, obviously I'm not guessing, and FYI both have the same section number:
8.1.8, check it out ;)

Ok I obviously did not read the last sentence in your reply, so updating: 8.1.8 clarifies the BUFFER_RESOURCE 128-bit structure which are identical in both architectures and the 3RD_DWORD is derived from that, which are equivalent (of course I assumed few things are also equivalent, and that might not be true but the closest to it).

Adding more explanation:

So, the third DWORD (bits 96-127) contains the following fields:

Num_records (bits 95:64) 
Dst_sel_x, Dst_sel_y, Dst_sel_z, Dst_sel_w (bits 98:96, 101:99, 104:102, 107:105)
Format (bits 114:108)
Index stride (bits 118:117)
Add tid enable (bit 119)
Resource Level (bit 120)
OOB_SELECT (bits 125:124)
Type (bits 127:126)

So I may be totally incorrect in my assumptions but also I maybe correct in all the assumptions except for the Num_records I have a feeling that one needs to be set to 0x7800 instead of 0x14000 (no I think it should be 0x14000) I was comparing wrong values, nvm

Thats it for Today :)

waheedi commented 1 month ago

and here more detailed documentation, this one was hidden :D https://doc-july-17.readthedocs.io/en/latest/GCN_ISA_Manuals/testdocbook.html?highlight=vector-memory-buffer-instructions#vector-memory-buffer-instructions

waheedi commented 1 month ago

Ok I'm partially done, here is my results, now running on all latest develop as of two days ago :) i think the performance is not bad, but I dont know how it compares to the HSA_OVERRIDE to 1030, if anyone have the gfx1010 running on the latest official rocm versions I would like to see your results (7.3 seconds for a standard no-half generation on SD) :

Screenshot_20240730_211900

The torch version is actually main as of yesterday so its 2.5.rc0 I think the 2.3.1 is just the number i choose to let pip3 manage the stuff correctly

i think there is still high disk latency as I'm using a physical hdd from 2009 for this test machine :D

TheTrustedComputer commented 1 month ago

I applied patches to the ROCm 6.2 release to make composable kernels compile for the 5500 XT (gfx1012) as described, as well as unit tests for verification. It seems to work, but there's a lot more fine-tuning to do, since several tests failed the check.

Test project /root/rocm-build/composable-kernel/test
      Start  1: test_magic_number_division
 1/43 Test  #1: test_magic_number_division ...................   Passed    1.79 sec
      Start  2: test_space_filling_curve
 2/43 Test  #2: test_space_filling_curve .....................   Passed    0.02 sec
      Start  3: test_conv_util
 3/43 Test  #3: test_conv_util ...............................   Passed    0.02 sec
      Start  4: test_reference_conv_fwd
 4/43 Test  #4: test_reference_conv_fwd ......................   Passed    0.02 sec
      Start  5: test_gemm_fp32
 5/43 Test  #5: test_gemm_fp32 ...............................   Passed    0.04 sec
      Start  6: test_gemm_fp16
 6/43 Test  #6: test_gemm_fp16 ...............................   Passed    0.04 sec
      Start  7: test_gemm_bf16
 7/43 Test  #7: test_gemm_bf16 ...............................   Passed    0.04 sec
      Start  8: test_gemm_int8
 8/43 Test  #8: test_gemm_int8 ...............................   Passed    0.04 sec
      Start  9: test_reduce_no_index
 9/43 Test  #9: test_reduce_no_index .........................***Failed    1.91 sec
      Start 10: test_reduce_with_index
10/43 Test #10: test_reduce_with_index .......................***Failed    3.82 sec
      Start 11: test_grouped_convnd_fwd_multi_ab_interface
11/43 Test #11: test_grouped_convnd_fwd_multi_ab_interface ...   Passed    0.04 sec
      Start 12: test_block_to_ctile_map
12/43 Test #12: test_block_to_ctile_map ......................   Passed    0.02 sec
      Start 13: test_softmax_rank3
13/43 Test #13: test_softmax_rank3 ...........................***Failed   52.44 sec
      Start 14: test_softmax_rank4
14/43 Test #14: test_softmax_rank4 ...........................***Failed  113.14 sec
      Start 15: test_softmax_interface
15/43 Test #15: test_softmax_interface .......................   Passed    0.06 sec
      Start 16: test_layernorm2d_fwd_fp32
16/43 Test #16: test_layernorm2d_fwd_fp32 ....................   Passed    0.83 sec
      Start 17: test_groupnorm_fwd_fp32
17/43 Test #17: test_groupnorm_fwd_fp32 ......................   Passed    0.51 sec
      Start 18: test_layernorm2d_fwd_fp16
18/43 Test #18: test_layernorm2d_fwd_fp16 ....................***Failed    0.66 sec
      Start 19: test_layernorm4d_fwd_fp16
19/43 Test #19: test_layernorm4d_fwd_fp16 ....................***Failed    0.33 sec
      Start 20: test_groupnorm_fwd_fp16
20/43 Test #20: test_groupnorm_fwd_fp16 ......................***Failed    0.64 sec
      Start 21: test_layernorm2d_bwd_data_fp32
21/43 Test #21: test_layernorm2d_bwd_data_fp32 ...............   Passed    0.52 sec
      Start 22: test_groupnorm_bwd_data_fp32
22/43 Test #22: test_groupnorm_bwd_data_fp32 .................   Passed    0.45 sec
      Start 23: test_layernorm2d_bwd_gamma_beta_fp32
23/43 Test #23: test_layernorm2d_bwd_gamma_beta_fp32 .........   Passed    0.42 sec
      Start 24: test_groupnorm_bwd_gamma_beta_fp32
24/43 Test #24: test_groupnorm_bwd_gamma_beta_fp32 ...........   Passed    0.40 sec
      Start 25: test_fp8
25/43 Test #25: test_fp8 .....................................   Passed    0.02 sec
      Start 26: test_bf8
26/43 Test #26: test_bf8 .....................................   Passed    0.02 sec
      Start 27: test_type_convert_const
27/43 Test #27: test_type_convert_const ......................   Passed    0.02 sec
      Start 28: test_elementwise_layernorm_fp16
28/43 Test #28: test_elementwise_layernorm_fp16 ..............***Failed    3.15 sec
      Start 29: test_batchnorm_fwd_rank_4
29/43 Test #29: test_batchnorm_fwd_rank_4 ....................***Failed   46.27 sec
      Start 30: test_batchnorm_bwd_rank_4
30/43 Test #30: test_batchnorm_bwd_rank_4 ....................***Failed   67.20 sec
      Start 31: test_batchnorm_infer_rank_4
31/43 Test #31: test_batchnorm_infer_rank_4 ..................***Failed    9.82 sec
      Start 32: test_avg_pool3d_bwd
32/43 Test #32: test_avg_pool3d_bwd ..........................***Failed    0.68 sec
      Start 33: test_max_pool3d_bwd
33/43 Test #33: test_max_pool3d_bwd ..........................   Passed    1.05 sec
      Start 34: test_avg_pool3d_fwd
34/43 Test #34: test_avg_pool3d_fwd ..........................***Failed    1.09 sec
      Start 35: test_max_pool3d_fwd
35/43 Test #35: test_max_pool3d_fwd ..........................***Failed    2.03 sec
      Start 36: test_conv_tensor_rearrange
36/43 Test #36: test_conv_tensor_rearrange ...................***Failed    3.61 sec
      Start 37: test_conv_tensor_rearrange_interface
37/43 Test #37: test_conv_tensor_rearrange_interface .........   Passed    0.04 sec
      Start 38: test_permute_scale
38/43 Test #38: test_permute_scale ...........................   Passed    0.41 sec
      Start 39: test_wrapper_layout
39/43 Test #39: test_wrapper_layout ..........................   Passed    0.02 sec
      Start 40: test_wrapper_tensor
40/43 Test #40: test_wrapper_tensor ..........................   Passed    0.30 sec
      Start 41: test_wrapper_copy
41/43 Test #41: test_wrapper_copy ............................   Passed    0.27 sec
      Start 42: test_wrapper_partition
42/43 Test #42: test_wrapper_partition .......................   Passed    0.02 sec
      Start 43: test_position_embedding
43/43 Test #43: test_position_embedding ......................   Passed    0.02 sec

65% tests passed, 15 tests failed out of 43

Total Test time (real) = 314.26 sec

The following tests FAILED:
          9 - test_reduce_no_index (Failed)
         10 - test_reduce_with_index (Failed)
         13 - test_softmax_rank3 (Failed)
         14 - test_softmax_rank4 (Failed)
         18 - test_layernorm2d_fwd_fp16 (Failed)
         19 - test_layernorm4d_fwd_fp16 (Failed)
         20 - test_groupnorm_fwd_fp16 (Failed)
         28 - test_elementwise_layernorm_fp16 (Failed)
         29 - test_batchnorm_fwd_rank_4 (Failed)
         30 - test_batchnorm_bwd_rank_4 (Failed)
         31 - test_batchnorm_infer_rank_4 (Failed)
         32 - test_avg_pool3d_bwd (Failed)
         34 - test_avg_pool3d_fwd (Failed)
         35 - test_max_pool3d_fwd (Failed)
         36 - test_conv_tensor_rearrange (Failed)
Errors while running CTest
Output from these tests are in: /root/rocm-build/composable-kernel/test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

If you want more information about this, or to see the output of these tests, ask away.

TheTrustedComputer commented 1 month ago

UPDATE: The failed test results from CK have been compiled with HIPCC using GCC 12 headers on the Debian 12 Bookworm Docker image. Recompiling with GCC 14 headers on the most recent Debian Sid point image, only two tests failed. It's a major improvement but likely still unusable for production.

Test project /root/rocm-build/ck/test
      Start  1: test_magic_number_division
 1/43 Test  #1: test_magic_number_division ...................   Passed    3.95 sec
      Start  2: test_space_filling_curve
 2/43 Test  #2: test_space_filling_curve .....................   Passed    0.02 sec
      Start  3: test_conv_util
 3/43 Test  #3: test_conv_util ...............................   Passed    0.02 sec
      Start  4: test_reference_conv_fwd
 4/43 Test  #4: test_reference_conv_fwd ......................   Passed    0.02 sec
      Start  5: test_gemm_fp32
 5/43 Test  #5: test_gemm_fp32 ...............................   Passed    0.03 sec
      Start  6: test_gemm_fp16
 6/43 Test  #6: test_gemm_fp16 ...............................   Passed    0.04 sec
      Start  7: test_gemm_bf16
 7/43 Test  #7: test_gemm_bf16 ...............................   Passed    0.04 sec
      Start  8: test_gemm_int8
 8/43 Test  #8: test_gemm_int8 ...............................   Passed    0.04 sec
      Start  9: test_reduce_no_index
 9/43 Test  #9: test_reduce_no_index .........................   Passed    3.51 sec
      Start 10: test_reduce_with_index
10/43 Test #10: test_reduce_with_index .......................   Passed    6.44 sec
      Start 11: test_grouped_convnd_fwd_multi_ab_interface
11/43 Test #11: test_grouped_convnd_fwd_multi_ab_interface ...   Passed    0.04 sec
      Start 12: test_block_to_ctile_map
12/43 Test #12: test_block_to_ctile_map ......................   Passed    0.02 sec
      Start 13: test_softmax_rank3
13/43 Test #13: test_softmax_rank3 ...........................   Passed   49.40 sec
      Start 14: test_softmax_rank4
14/43 Test #14: test_softmax_rank4 ...........................   Passed  105.24 sec
      Start 15: test_softmax_interface
15/43 Test #15: test_softmax_interface .......................   Passed    0.04 sec
      Start 16: test_layernorm2d_fwd_fp32
16/43 Test #16: test_layernorm2d_fwd_fp32 ....................   Passed    0.78 sec
      Start 17: test_groupnorm_fwd_fp32
17/43 Test #17: test_groupnorm_fwd_fp32 ......................   Passed    0.48 sec
      Start 18: test_layernorm2d_fwd_fp16
18/43 Test #18: test_layernorm2d_fwd_fp16 ....................   Passed    2.68 sec
      Start 19: test_layernorm4d_fwd_fp16
19/43 Test #19: test_layernorm4d_fwd_fp16 ....................   Passed    0.43 sec
      Start 20: test_groupnorm_fwd_fp16
20/43 Test #20: test_groupnorm_fwd_fp16 ......................   Passed    2.24 sec
      Start 21: test_layernorm2d_bwd_data_fp32
21/43 Test #21: test_layernorm2d_bwd_data_fp32 ...............   Passed    0.48 sec
      Start 22: test_groupnorm_bwd_data_fp32
22/43 Test #22: test_groupnorm_bwd_data_fp32 .................   Passed    0.42 sec
      Start 23: test_layernorm2d_bwd_gamma_beta_fp32
23/43 Test #23: test_layernorm2d_bwd_gamma_beta_fp32 .........   Passed    0.39 sec
      Start 24: test_groupnorm_bwd_gamma_beta_fp32
24/43 Test #24: test_groupnorm_bwd_gamma_beta_fp32 ...........   Passed    0.42 sec
      Start 25: test_fp8
25/43 Test #25: test_fp8 .....................................   Passed    0.02 sec
      Start 26: test_bf8
26/43 Test #26: test_bf8 .....................................   Passed    0.02 sec
      Start 27: test_type_convert_const
27/43 Test #27: test_type_convert_const ......................   Passed    0.02 sec
      Start 28: test_elementwise_layernorm_fp16
28/43 Test #28: test_elementwise_layernorm_fp16 ..............   Passed   12.33 sec
      Start 29: test_batchnorm_fwd_rank_4
29/43 Test #29: test_batchnorm_fwd_rank_4 ....................   Passed   69.13 sec
      Start 30: test_batchnorm_bwd_rank_4
30/43 Test #30: test_batchnorm_bwd_rank_4 ....................   Passed   72.18 sec
      Start 31: test_batchnorm_infer_rank_4
31/43 Test #31: test_batchnorm_infer_rank_4 ..................***Failed   15.15 sec
      Start 32: test_avg_pool3d_bwd
32/43 Test #32: test_avg_pool3d_bwd ..........................   Passed    1.03 sec
      Start 33: test_max_pool3d_bwd
33/43 Test #33: test_max_pool3d_bwd ..........................   Passed    1.09 sec
      Start 34: test_avg_pool3d_fwd
34/43 Test #34: test_avg_pool3d_fwd ..........................   Passed    1.14 sec
      Start 35: test_max_pool3d_fwd
35/43 Test #35: test_max_pool3d_fwd ..........................   Passed    2.15 sec
      Start 36: test_conv_tensor_rearrange
36/43 Test #36: test_conv_tensor_rearrange ...................***Failed    4.74 sec
      Start 37: test_conv_tensor_rearrange_interface
37/43 Test #37: test_conv_tensor_rearrange_interface .........   Passed    0.03 sec
      Start 38: test_permute_scale
38/43 Test #38: test_permute_scale ...........................   Passed    0.36 sec
      Start 39: test_wrapper_layout
39/43 Test #39: test_wrapper_layout ..........................   Passed    0.02 sec
      Start 40: test_wrapper_tensor
40/43 Test #40: test_wrapper_tensor ..........................   Passed    0.26 sec
      Start 41: test_wrapper_copy
41/43 Test #41: test_wrapper_copy ............................   Passed    0.26 sec
      Start 42: test_wrapper_partition
42/43 Test #42: test_wrapper_partition .......................   Passed    0.02 sec
      Start 43: test_position_embedding
43/43 Test #43: test_position_embedding ......................   Passed    0.02 sec

95% tests passed, 2 tests failed out of 43

Total Test time (real) = 357.21 sec

The following tests FAILED:
         31 - test_batchnorm_infer_rank_4 (Failed)
         36 - test_conv_tensor_rearrange (Failed)
Errors while running CTest
Output from these tests are in: /root/rocm-build/ck/test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
GZGavinZhao commented 1 month ago

@TheTrustedComputer If you could run test with --output-on-failure and show the output here that'd be helpful. When I ran the tests myself for a separate issue, I got some false positives because some tests allocated more memory than what my card can handle, so I'm wonder whether those two failures are from that.

waheedi commented 1 month ago

which patches did you go for the gfx900 one or the gfx1030 one?

TheTrustedComputer commented 1 month ago

@GZGavinZhao Here you go.

https://drive.google.com/file/d/1GxGLxHaJagvSSTukjjLzZMZs1UgtcOEE/view

Above is a link to the verbose output of the two specific tests that failed, as it may be too long to post to GitHub. It looks like some of the four elements in these tests are returning zero instead of expecting non-zero output. Regardless, I hope you find it helpful in diagnosing whether it's a false positive or a legitimate one. If it's the former, perhaps you can modify these tests for cards with less VRAM? To clarify, I'm using the 8GB 5500 XT model and not the 4GB one.

@waheedi I went for the gfx1030 patch to include the __gfx101__ related macros for the definitions ofCK_BUFFER_RESOURCE_3RD_DWORD and CK_TILE_BUFFER_RESOURCE_3RD_DWORD. I then compiled it from the ROCm 6.2 source release, targeting the gfx1012 instruction set architecture (GPU_TARGETS=gfx1012), which is the card I currently have. Compilation succeeded, and initial test results are promising. However, I must say that building it on older Linux distributions is probably not the way to go, as you can see.

waheedi commented 1 month ago

@TheTrustedComputer I'm glad you did that :)

TheTrustedComputer commented 3 weeks ago

Here's my rough attempt at adding composable kernels to the gfx1010 family. It seems I faced the exact issue as https://github.com/ROCm/MIOpen/issues/1528, but on a gfx1012 instead of the OP's gfx1031. So I followed https://github.com/ROCm/MIOpen/pull/1531/commits/74e496d24687fdcf003d886c4d07cefd3adab47c to hopefully get CK and MIOpen working together on RDNA1. However, it appears I already built and installed MIOpen without patching beforehand; the error message is identical as well. Also, I had to endure nearly 17 hours compiling CK for the gfx1012.

<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /root/rocm-6.2/MIOpen/src/hipoc/hipoc_program.cpp:292: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl