NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.3k stars 892 forks source link

[QST] cutlass fails during tensorflow assembly #1603

Closed zkbitcoin closed 1 week ago

zkbitcoin commented 2 months ago

cutlass is used in building kernels in tensorflow

I took a look in cutlass_archive/include/cutlass/matrix.h and indeed set_slice3x3 is not defined however set_slice_3x3 is

did not want to submit a bug and alarm anyone as I am new to project structure, also file matrix.h appears not to have changed

when I modified include file to point to set_slice_3x3 compilation succeeds just fine

its really strange , in fact I searched whole project for set_slice3x3 and is nwhere to be found

let know if you need more data from me on this, somehow I feel its not a bug.. hence question category

ERROR: /home/a/.cache/bazel/_bazel_a/46cc6e345e09372840ba05860089f3a0/external/local_xla/xla/service/gpu/kernels/BUILD:368:13: Compiling xla/service/gpu/kernels/cutlass_gemm_kernel_bf16xbf16_to_bf16.cu.cc failed: (Exit 1): clang failed: error executing command (from target @local_xla//xla/service/gpu/kernels:cutlass_gemm_kernel_bf16xbf16_to_bf16) /usr/lib/llvm-19/bin/clang -MD -MF bazel-out/k8-opt/bin/external/local_xla/xla/service/gpu/kernels/_objs/cutlass_gemm_kernel_bf16xbf16_to_bf16/cutlass_gemm_kernel_bf16xbf16_to_bf16.cu.pic.d ... (remaining 130 arguments skipped) In file included from external/local_xla/xla/service/gpu/kernels/cutlass_gemm_kernel_bf16xbf16_to_bf16.cu.cc:16: In file included from external/cutlass_archive/include/cutlass/gemm/device/gemm_universal.h:43: In file included from external/cutlass_archive/include/cutlass/gemm/threadblock/threadblock_swizzle.h:45: In file included from external/cutlass_archive/include/cutlass/gemm/threadblock/threadblock_swizzle_streamk.h:58: In file included from external/cutlass_archive/include/cutlass/core_io.h:51: external/cutlass_archive/include/cutlass/matrix.h:7848:7: error: no member named 'set_slice3x3' in 'Matrix<type-parameter-0-0, 3, 3>'; did you mean 'set_slice_3x3'? 7848 | m.set_slice3x3({ | ^ external/cutlass_archive/include/cutlass/matrix.h:7150:12: note: 'set_slice_3x3' declared here 7150 | Matrix & set_slice_3x3(Matrix<Element, 3, 3> const &m, int i = 0, int j = 0) { | ^ external/cutlass_archive/include/cutlass/matrix.h:14008:7: error: no member named 'set_slice3x3' in 'Matrix<type-parameter-0-0, 4, 4>'; did you mean 'set_slice_3x3'? 14008 | m.set_slice3x3({ | ^ external/cutlass_archive/include/cutlass/matrix.h:12862:12: note: 'set_slice_3x3' declared here 12862 | Matrix & set_slice_3x3(Matrix<Element, 3, 3> const &m, int i = 0, int j = 0) { | ^ external/cutlass_archive/include/cutlass/matrix.h:14028:7: error: no member named 'set_slice3x3' in 'Matrix<type-parameter-0-0, 4, 4>'; did you mean 'set_slice_3x3'? 14028 | m.set_slice3x3({ | ^ external/cutlass_archive/include/cutlass/matrix.h:12862:12: note: 'set_slice_3x3' declared here 12862 | Matrix & set_slice_3x3(Matrix<Element, 3, 3> const &m, int i = 0, int j = 0) { | ^ 3 errors generated when compiling for sm_60. Target //tensorflow/tools/pip_package:wheel failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 24.995s, Critical Path: 8.74s INFO: 21 processes: 16 internal, 5 local. FAILED: Build did NOT complete successfully

zkbitcoin commented 2 months ago

checking xla package next, could be they misspelled method name in their project but how did it pass standard build checks ?

same error as both use cutlass

one thing I can think of is I am using clang 19 , still very strange if this would be the cause

thakkarV commented 2 months ago

This doesn't seem to be CUTLASS but at a glance

zkbitcoin commented 2 months ago

external/cutlass_archive/include/cutlass/matrix.h

well, external/cutlass_archive/include/cutlass/matrix.h is where set_slice3x3 is unresolved but set_slice_3x3' is

xla includes it in external/local_xla/xla/service/gpu/kernels/cutlass_gemm_kernel_bf16xbf16_to_bf16.cu.cc:16

(from stack trace)

one can skip tensorflow and just compile xla (latest from master) using clang 19

clang --version Ubuntu clang version 19.0.0

using this

python configure.py --backend=CUDA --cuda_compute_capabilities="6.1" --nccl --cuda_compiler=CLANG

bazel build //xla/... --spawn_strategy=sandboxed --test_output=all --copt=-Wno-error=c23-extensions

thakkarV commented 2 months ago

That's a custom kernel written in XLA using CUTLASS, not an off the self one from our repo. It seems like the slice 3x3 stuff is also an dxtension and not a cutlass method

zkbitcoin commented 2 months ago

That's a custom kernel written in XLA using CUTLASS, not an off the self one from our repo. It seems like the slice 3x3 stuff is also an dxtension and not a cutlass method

but its there .. in nvidia cutlass repo

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/matrix.h#L7828

possible clang version mangles something ? (version 19)

mnicely commented 1 month ago

possible clang version mangles something ? It's possible we don't support cuda-clang. You may want to reach out to the XLA team

github-actions[bot] commented 1 week ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

lucifer1004 commented 3 days ago

@zkbitcoin I tried simply replacing all set_slice3x3 (actually only 4 occurrences) within cutlass with set_slice_3x3 and succeeded to compile the examples with clang++. Hope this might help you.

lucifer1004 commented 2 days ago

This should have been fixed by #1784 , @zkbitcoin could you please check again?