Open vmarkovtsev opened 1 month ago
This is quite a large tile size for pingpong. @IonThruster do you know why we are stamping this tile size out?
@vmarkovtsev what is your CTK version? Please use 12.3 or newer if you are not already
My CTK is 12.4. Those kernels are generated with python/cutlass_library/generator.py
from the following cmake:
cmake -B build_12.4 -D CUTLASS_NVCC_ARCHS=90;90a -D CMAKE_CUDA_ARCHITECTURES=90;90a -D CUTLASS_ENABLE_TESTS=OFF -D CUTLASS_LIBRARY_OPERATIONS=gemm -D CUTLASS_LIBRARY_KERNELS=sgemm,bf16 -D CUTLASS_ENABLE_EXAMPLES=OFF -D CUTLASS_LIBRARY_IGNORE_KERNELS=planar,complex,spgemm,grouped,mixed,_u8,_s8,_tt,s*bf16gemm,fp8,e4m3,e5m2,void,s16816gemm -D CMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc -D CMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
Full log
+ make -j192 cutlass_library_static [ 0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/all_sm80_sgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int4.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_canonical.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/all_sm50_sgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_32.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_nn_align1.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_nn_align1.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_64.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_nn_align1.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e4m3out.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_nn_align1.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e4m3out.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_nn_align1.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e5m2out.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_nn_align1.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_nn_align1.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e5m2out.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_nn_align1.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_nn_align1.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_nn_align1.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/all_sm90_s64x256x16gemm_bf16_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp16out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_nn_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_nn_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_nn_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_nn_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_bf16out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_nn_align1.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_nt_align1.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/all_sm90_bf16_s64x256x16gemm_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nt_align1.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_nt_align1.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_nt_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp32out.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_tn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp32out.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_nt_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_tn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_tn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_other.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_tn_align1.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_nt_align1.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_mixed_input.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_tn_align1.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_nt_align1.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int_mixed_input.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_nt_align1.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/all_sm90_bf16_s64x128x16gemm_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_nt_align1.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/all_sm90_s64x128x16gemm_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_nt_align1.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_nt_align1.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tn_align1.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_tn_align1.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_tn_align1.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_tn_align1.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_tn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_tn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_tn_align1.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_tn_align1.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_tn_align1.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_tn_align1.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_tn_align1.cu.o [ 40%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8.cu.o [ 47%] Built target cutlass_library_gemm_sm50_sgemm_objs [ 47%] Linking CUDA static library libcutlass_gemm_sm50_sgemm.a [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Built target cutlass_library_gemm_sm50_sgemm_static [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 54%] Built target cutlass_library_gemm_sm80_sgemm_objs [ 54%] Linking CUDA static library libcutlass_gemm_sm80_sgemm.a [ 54%] Built target cutlass_library_gemm_sm80_sgemm_static [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 78%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 86%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' [ 91%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 91%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE' [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE' [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' [100%] Built target cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs [100%] Linking CUDA static library libcutlass_gemm_sm90_s64x256x16gemm_bf16.a [100%] Built target cutlass_library_gemm_sm90_s64x256x16gemm_bf16_static ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' [100%] Built target cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs [100%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x256x16gemm_bf16.a [100%] Built target cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_static ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE' [100%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs [100%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_bf16.a [100%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_static [100%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs [100%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a [100%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_static [100%] Built target cutlass_library_objs [100%] Linking CXX static library libcutlass.a [100%] Built target cutlass_library_static
@vmarkovtsev I think you can ignore these if its not coming from majority of the profiler kernels. Sometimes the Cartesian product of configurations can lead to a few outliers that spill.
When I build the latest cutlass library for 90a, I see a lot of warnings like:
Am I doing something wrong, or is that intended?