[BUG] StaticAssert "No static/dynamic overlap allowed in MixedBits." with NVHPC 23.1 / CUDA 11.8 for A100

reuterbal commented 1 year ago

Describe the bug Building cutlass 3.2 (current main) with NVHPC 23.1 / CUDA 11.8 for A100 fails with a static assertion in the tests:

[ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o
"/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/swizzle.hpp", line 235: error: static assertion failed with "No static/dynamic overlap allowed in MixedBits."
    static_assert((StaticInt & StaticFlags) == 0, "No static/dynamic overlap allowed in MixedBits.");
    ^
          detected during:
            instantiation of class "cute::MixedBits<StaticInt, DynamicType, StaticFlags> [with StaticInt=1U, DynamicType=uint32_t, StaticFlags=1U]" at line 255
            instantiation of "auto cute::make_mixed_bits(const cute::constant<S, s> &, const DynamicType &, const cute::constant<F, f> &) [with S=uint32_t, s=1U, DynamicType=uint32_t, F=uint32_t, f=1U]" at line 55 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/test/unit/cute/core/mixedbits.cpp"
            instantiation of function "lambda [](auto)->auto [with <auto-1>=cute::C<1U>]" at line 173 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of function "lambda [](auto &&...)->auto [with <auto-1>=<cute::C<0U>, cute::C<1U>, cute::C<2U>, cute::C<3U>, cute::C<4U>, cute::C<5U>, cute::C<6U>, cute::C<7U>>]" at line 89 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of "auto cute::detail::apply(T &&, F &&, cute::seq<I...>) [with T=std::integer_sequence<uint32_t, 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U> &, F=lambda [](auto &&...)->auto, I=<0, 1, 2, 3, 4, 5, 6, 7>]" at line 173 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of "void cute::for_each(T &&, F &&) [with T=std::integer_sequence<uint32_t, 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U>, F=lambda [](auto)->auto]" at line 68 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/test/unit/cute/core/mixedbits.cpp"
            instantiation of function "lambda [](auto)->auto [with <auto-1>=cute::C<1U>]" at line 173 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of function "lambda [](auto &&...)->auto [with <auto-1>=<cute::C<0U>, cute::C<1U>, cute::C<2U>, cute::C<3U>, cute::C<4U>, cute::C<5U>, cute::C<6U>, cute::C<7U>>]" at line 89 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of "auto cute::detail::apply(T &&, F &&, cute::seq<I...>) [with T=std::integer_sequence<uint32_t, 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U> &, F=lambda [](auto &&...)->auto, I=<0, 1, 2, 3, 4, 5, 6, 7>]" at line 173 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/include/cute/algorithm/tuple_algorithms.hpp"
            instantiation of "void cute::for_each(T &&, F &&) [with T=std::integer_sequence<uint32_t, 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U>, F=lambda [](auto)->auto]" at line 69 of "/leonardo_scratch/large/userexternal/<uid>/cutlass/cutlass/test/unit/cute/core/mixedbits.cpp"

Steps/Code to reproduce bug

git clone https://github.com/NVIDIA/cutlass.git
mkdir cutlass/build && cd cutlass/build
cmake .. -DCUTLASS_NVCC_ARCHS=80
make -j 64

Expected behavior Build succeeds

Environment details (please complete the following information): LEONARDO, NVHPC 23.1 and CUDA 11.8 system-provided installations

Additional context Problem does not seem to occur using release 3.1.0 (build not yet completed)

thakkarV commented 1 year ago

@mhoemmen who just encountered this himself. Mark, could we please upstream your solution here as a bugfix before 3.3 release?

thakkarV commented 1 year ago

It is regrettable how much 2.x headers inadvertently include parts of CuTe :(

mhoemmen commented 1 year ago

@thakkarV Clang 14 as the host compiler also manifests this issue. It turns out that GCC 8.3.0 did not like a straightforward fix. GCC 8.3.0 also seems to believe that cute::C<static_cast<uint32_t>(0)>::value_type is not uint32_t. I'm working on a solution that all these compilers will find acceptable.

mhoemmen commented 1 year ago

@reuterbal Just to clarify: when you say "NVHPC 23.1," do you mean the nvc++ compiler, or GCC (g++) as the host compiler plus nvcc as the device compiler? I'm guessing the latter, because even if nvc++ is in the PATH, CUTLASS will prefer the GCC + nvcc option.

reuterbal commented 1 year ago

No, I am indeed using nvc++ as host compiler:

[<uid>@login05 build]$ grep CMAKE_CXX_COMPILER CMakeCache.txt
CMAKE_CXX_COMPILER:FILEPATH=/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/compilers/bin/nvc++

mhoemmen commented 1 year ago

@reuterbal Thanks for responding! : - ) Could you please also list the following variables from your CMakeCache.txt file?

CMAKE_C_COMPILER
CMAKE_CUDA_COMPILER
CMAKE_CXX_FLAGS
CMAKE_CUDA_FLAGS

It's surprising to me that CUTLASS detected nvc++ automatically. I'm guessing that g++ and gcc are not in your PATH. I don't know what this system "LEONARDO" is, but I'm also guessing that it uses a modules system and that you have the GCC modules unloaded and the NVHPC modules loaded.

CUTLASS 3 does not currently have official support for nvc++. Do you need to use nvc++ or are you just relying on default settings to build?

reuterbal commented 1 year ago

I did not realize that nvc++ wasn't a (officially) supported compiler, thanks for clearing that up. You are correct, LEONARDO is a new EuroHPC system with A100 GPUs. CUTLASS is used inside a mixed Fortran/CUDA application that utilizes OpenACC, and I was using the same compiler for all dependencies. I could certainly try rebuilding with the GNU 11.3.0 available there.

[<uid>@login05 build]$ grep -E 'CMAKE_(C|CUDA)_COMPILER|CMAKE_(C|CXX|CUDA)_FLAGS' CMakeCache.txt
CMAKE_CUDA_COMPILER:FILEPATH=/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/compilers/bin/nvcc
CMAKE_CUDA_FLAGS:STRING=
CMAKE_CUDA_FLAGS_DEBUG:STRING=-g
CMAKE_CUDA_FLAGS_MINSIZEREL:STRING=-O1 -DNDEBUG
CMAKE_CUDA_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
CMAKE_CUDA_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
CMAKE_CXX_FLAGS:STRING=
CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=-O2 -s -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE:STRING=-fast -O3 -DNDEBUG
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -gopt
CMAKE_C_COMPILER:FILEPATH=/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/compilers/bin/nvc
CMAKE_C_FLAGS:STRING=
CMAKE_C_FLAGS_DEBUG:STRING=-g -O0
CMAKE_C_FLAGS_MINSIZEREL:STRING=-O2 -s -DNDEBUG
CMAKE_C_FLAGS_RELEASE:STRING=-fast -O3 -DNDEBUG
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=-O2 -gopt
//ADVANCED property for variable: CMAKE_CUDA_COMPILER
CMAKE_CUDA_COMPILER-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CUDA_FLAGS
CMAKE_CUDA_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CUDA_FLAGS_DEBUG
CMAKE_CUDA_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CUDA_FLAGS_MINSIZEREL
CMAKE_CUDA_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CUDA_FLAGS_RELEASE
CMAKE_CUDA_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CUDA_FLAGS_RELWITHDEBINFO
CMAKE_CUDA_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS
CMAKE_CXX_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_DEBUG
CMAKE_CXX_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_MINSIZEREL
CMAKE_CXX_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELEASE
CMAKE_CXX_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_COMPILER
CMAKE_C_COMPILER-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS
CMAKE_C_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_DEBUG
CMAKE_C_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_MINSIZEREL
CMAKE_C_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELEASE
CMAKE_C_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_C_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1

mhoemmen commented 1 year ago

@reuterbal Thanks for sending the list of CMake options! : - ) I'll try replicating this build. It's possible that this is an "nvc++ host compiler, nvcc device compiler" situation, which differs from "using nvc++ for everything."

CUTLASS is used inside a mixed Fortran/CUDA application that utilizes OpenACC, and I was using the same compiler for all dependencies. I could certainly try rebuilding with the GNU 11.3.0 available there.

We definitely test with the GCC 11 series. On the other hand, I would need to review nvfortran's documentation to see if there would be ABI compatibility issues with calling host functions generated by g++, or device functions generated by nvcc.

mhoemmen commented 1 year ago

@reuterbal FYI, I was able to build this test successfully with NVHPC 23.7's nvc++ as the host compiler, and nvcc as the device compiler. I didn't even need to use sudo to install the NVHPC SDK in my home directory. I just set CMAKE_CUDA_COMPILER to my nvcc (full path), CMAKE_CXX_COMPILER to my nvc++, and CMAKE_C_COMPILER to my nvc (one c! it lives in the same directory as nvc++).

I can try again with NVHPC 23.1, but if you're able to use a newer compiler that works, I would recommend doing that first. Support for compilers other than GCC + NVCC in CUTLASS 3 is still new. I've been having some trouble porting that MixedBits test specifically.

reuterbal commented 1 year ago

Thanks @mhoemmen, that is good to know! Unfortunately, in particular the CUDA-aware MPI library has only been built against that NVHPC release on Leonardo. I don't trust myself to be able to get an installation up and running that is tuned to the system in the same way.

But I should be able to ask for a newer compiler release to be installed, including the relevant dependencies.

mhoemmen commented 1 year ago

@reuterbal Thanks for reporting back! I think the NVHPC SDK may have an MPI in it that is (or could be) CUDA-aware, but I totally understand why you would want to use the existing known-to-work CUDA-aware MPI.

mhoemmen commented 1 year ago

@reuterbal I have to issue a correction: I'm able to replicate this build error for both NVHPC 23.1 and 23.7. (I had the wrong local branch checked out!) Thankfully, this means that my local changes actually fix the 23.7 build. I'm testing 23.1 now.

Edit: my local changes also fix the 23.1 build. I'm working on getting them through other compilers....

mhoemmen commented 1 year ago

@reuterbal Successfully got the fixes and new tests through the test gauntlet. Awaiting permission to merge.

mhoemmen commented 1 year ago

FYI, the fix should go into the 3.2.1 release, if not the 3.2 release. Thanks for reporting! : - )

reuterbal commented 1 year ago

Thanks so much, I'll make sure to test the new release.

mnicely commented 1 year ago

Coming to 3.2.1 this week

NVIDIA / cutlass

[BUG] StaticAssert "No static/dynamic overlap allowed in MixedBits." with NVHPC 23.1 / CUDA 11.8 for A100 #1031