NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.88k stars 893 forks source link

Build from source without container errors #439

Open YJHMITWEB opened 1 year ago

YJHMITWEB commented 1 year ago

Branch/Tag/Commit

main

Docker Image Version

none

GPU name

A100

CUDA Driver

525.60.13

Reproduced Steps

PATH=$(getconf PATH)
module purge
module load cuda/11.6
module load gcc/9.4.0
module load cmake
conda activate FasterTransformer # with pytorch 1.12.0 installed

export CUDA_HOME=/opt/cuda/11.6
export CPATH=$CUDA_HOME/include:$CPATH
export CUDNN_LIBRARY=/home/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/lib
export CUDNN_LIB_DIR=/home/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/lib
export CUDNN_LIBRARY_PATH=/home/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/lib/libcudnn.so
export CUDNN_INCLUDE_DIR=/home/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/include
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..

-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is NVIDIA 11.6.55
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/gcc/9.4.0/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/11.6/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
CMake Warning (dev) at CMakeLists.txt:17 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /opt/cuda/11.6

  For compatibility, CMake is ignoring the variable.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /opt/cuda/11.6 (found suitable version "11.6", minimum required is "10.2") 
CUDA_VERSION 11.6 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- Found CUDNN: /home/yao.877/parallel_inference/mpi/cudnn/cudnn-linux-x86_64-8.5.0.96_cuda11-archive/lib/libcudnn.so  
-- Add DBUILD_CUTLASS_MOE, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MIXED_GEMM, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MOE, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MIXED_GEMM, requires CUTLASS. Increases compilation time
-- Running submodule update to fetch cutlass
-- NVTX is enabled.
-- Assign GPU architecture (sm=80)
-- Use WMMA
CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG -Xcompiler -O3 -DCUDA_PTX_FP8_F2FP_ENABLED --use_fast_math
-- COMMON_HEADER_DIRS: /home/yao.877/parallel_inference/projects/FasterTransformer;/opt/cuda/11.6/include;/home/FasterTransformer/3rdparty/cutlass/include;/home/FasterTransformer/src/fastertransformer/cutlass_extensions/include;/home/FasterTransformer/3rdparty/trt_fp8_fmha/src;/home/FasterTransformer/3rdparty/trt_fp8_fmha/generated
-- The C compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/gcc/9.4.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /usr/bin/python3.9 (found version "3.9.7") found components: Interpreter 
-- Configuring done
-- Generating done

Here the Found Python /usr/bin/python3.9 seems does not use the python in my conda environment.

Then, make -j

Errors

[ 64%] Built target layernorm_kernels
[ 64%] Building CXX object examples/cpp/decoding/CMakeFiles/layernorm_test.dir/layernorm_test.cc.o
[ 64%] Building CUDA object src/fastertransformer/layers/attention_layers_int8/CMakeFiles/WindowAttentionINT8.dir/WindowAttentionINT8.cu.o
[ 64%] Building CXX object src/fastertransformer/models/swin/CMakeFiles/SwinBlock.dir/SwinBlock.cc.o
[ 64%] Linking CXX executable ../../bin/test_gemm
[ 64%] Built target test_gemm
In file included from /home/FasterTransformer/src/fastertransformer/layers/attention_layers_int8/WindowAttentionINT8.h:28,
                 from /home/FasterTransformer/src/fastertransformer/layers/attention_layers_int8/WindowAttentionINT8.cu:17:
/home/FasterTransformer/src/fastertransformer/models/swin_int8/SwinINT8Weight.h:22:10: fatal error: cudnn.h: No such file or directory
   22 | #include <cudnn.h>
      |          ^~~~~~~~~
compilation terminated.
make[2]: *** [src/fastertransformer/layers/attention_layers_int8/CMakeFiles/WindowAttentionINT8.dir/build.make:76: src/fastertransformer/layers/attention_layers_int8/CMakeFiles/WindowAttentionINT8.dir/WindowAttentionINT8.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:4583: src/fastertransformer/layers/attention_layers_int8/CMakeFiles/WindowAttentionINT8.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 64%] Linking CUDA device code CMakeFiles/ParallelGptWeight.dir/cmake_device_link.o
[ 65%] Linking CXX static library ../../../../lib/libParallelGptWeight.a
[ 65%] Built target ParallelGptWeight
[ 65%] Linking CUDA device code CMakeFiles/TopKSamplingLayer.dir/cmake_device_link.o
[ 66%] Linking CXX static library ../../../../lib/libTopKSamplingLayer.a
[ 66%] Built target TopKSamplingLayer
[ 66%] Linking CXX executable ../../../bin/layernorm_test
[ 66%] Built target layernorm_test
[ 66%] Linking CUDA device code CMakeFiles/SwinBlock.dir/cmake_device_link.o
[ 66%] Linking CXX static library ../../../../lib/libSwinBlock.a
[ 66%] Built target SwinBlock
[ 66%] Linking CUDA device code CMakeFiles/decoder_masked_multihead_attention.dir/cmake_device_link.o
[ 66%] Linking CUDA static library ../../../lib/libdecoder_masked_multihead_attention.a
[ 66%] Built target decoder_masked_multihead_attention
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu: In instantiation of ‘void fastertransformer::invokeBatchTopPSampling(void*, size_t&, size_t&, int*, int*, bool*, float*, float*, const T*, const int*, int*, int*, curandState_t*, int, size_t, const int*, float, const float*, cudaStream_t, cudaDeviceProp*, const bool*) [with T = float; size_t = long unsigned int; curandState_t = curandStateXORWOW; cudaStream_t = CUstream_st*]’:
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu:1163:542:   required from here
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu:1053:7: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext’; use assignment or value-initialization instead [-Wclass-memaccess]
 1053 |         memset(&context, 0, sizeof(context));
      |       ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.h:110:8: note: ‘struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext’ declared here
  110 | struct TopKPerSegmentContext {
      |        ^~~~~~~~~~~~~~~~~~~~~
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu: In instantiation of ‘void fastertransformer::invokeBatchTopPSampling(void*, size_t&, size_t&, int*, int*, bool*, float*, float*, const T*, const int*, int*, int*, curandState_t*, int, size_t, const int*, float, const float*, cudaStream_t, cudaDeviceProp*, const bool*) [with T = __half; size_t = long unsigned int; curandState_t = curandStateXORWOW; cudaStream_t = CUstream_st*]’:
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu:1185:541:   required from here
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.cu:1053:7: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext’; use assignment or value-initialization instead [-Wclass-memaccess]
 1053 |         memset(&context, 0, sizeof(context));
      |       ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/FasterTransformer/src/fastertransformer/kernels/sampling_topp_kernels.h:110:8: note: ‘struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext’ declared here
  110 | struct TopKPerSegmentContext {
      |        ^~~~~~~~~~~~~~~~~~~~~
[ 66%] Linking CUDA device code CMakeFiles/beam_search_topk_kernels.dir/cmake_device_link.o
[ 66%] Linking CUDA static library ../../../lib/libbeam_search_topk_kernels.a
[ 66%] Built target beam_search_topk_kernels
[ 66%] Linking CUDA device code CMakeFiles/sampling_topp_kernels.dir/cmake_device_link.o
[ 67%] Linking CUDA static library ../../../lib/libsampling_topp_kernels.a
[ 67%] Built target sampling_topp_kernels
[ 67%] Linking CUDA device code CMakeFiles/th_utils.dir/cmake_device_link.o
[ 67%] Linking CXX static library ../../../lib/libth_utils.a
[ 67%] Built target th_utils
[ 67%] Linking CUDA device code CMakeFiles/moe_gemm_kernels.dir/cmake_device_link.o
[ 67%] Linking CXX static library ../../../../lib/libmoe_gemm_kernels.a
[ 67%] Built target moe_gemm_kernels
[ 67%] Linking CUDA device code CMakeFiles/int8_gemm_test.dir/cmake_device_link.o
[ 67%] Linking CXX executable ../../bin/int8_gemm_test
/usr/bin/ld: cannot find -lmkl_intel_ilp64
/usr/bin/ld: cannot find -lmkl_core
/usr/bin/ld: cannot find -lmkl_intel_thread
collect2: error: ld returned 1 exit status
make[2]: *** [tests/int8_gemm/CMakeFiles/int8_gemm_test.dir/build.make:157: bin/int8_gemm_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:10762: tests/int8_gemm/CMakeFiles/int8_gemm_test.dir/all] Error 2
[ 67%] Linking CUDA device code CMakeFiles/fpA_intB_gemm.dir/cmake_device_link.o
[ 67%] Linking CXX static library ../../../../lib/libfpA_intB_gemm.a
[ 67%] Built target fpA_intB_gemm
[ 67%] Linking CUDA device code CMakeFiles/online_softmax_beamsearch_kernels.dir/cmake_device_link.o
[ 67%] Linking CUDA static library ../../../lib/libonline_softmax_beamsearch_kernels.a
[ 67%] Built target online_softmax_beamsearch_kernels
make: *** [Makefile:136: all] Error 2

I am wondering how /home/FasterTransformer/src/fastertransformer/models/swin_int8/SwinINT8Weight.h:22:10: fatal error: cudnn.h: No such file or directory happens as I have exported the cudnn path.

Also, I am wondering how to solve [ 67%] Linking CXX executable ../../bin/int8_gemm_test /usr/bin/ld: cannot find -lmkl_intel_ilp64 /usr/bin/ld: cannot find -lmkl_core /usr/bin/ld: cannot find -lmkl_intel_thread

Thanks!

byshiue commented 1 year ago

You can try export the .so path into LD_LIBRARY_PATH.

For int8_gemm_test, it requires the pytorch path here https://github.com/NVIDIA/FasterTransformer/blob/main/tests/int8_gemm/CMakeLists.txt#L24.

Siyuan011 commented 1 year ago

Hi, have you fixed the error: Linking CXX executable ../../bin/int8_gemm_test /usr/bin/ld: cannot find -lmkl_intel_ilp64 /usr/bin/ld: cannot find -lmkl_core /usr/bin/ld: cannot find -lmkl_intel_thread?