NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

build trtllm very slow and raise an error #2469

Open anaivebird opened 2 days ago

anaivebird commented 2 days ago

System Info

Who can help?

@byshiue @Superjomn

Information

Tasks

Reproduction

apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
BUILD_WHEEL_ARGS="--trt_root /usr/local/tensorrt --python_bindings --benchmarks --cuda_architectures 90 -j8" python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}

Expected behavior

build in less than one hour

actual behavior

build slow more than 2 hours


[100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu.o
[100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorInterleavedTrue.cu.o
[100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu.o
[100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorInterleavedTrue.cu.o
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/kernels/quantization.cu(280): warning #780-D: reference is to variable "i" (declared at line 260) -- under old for-init scoping rules it would have been variable "i" (declared at line 265)
              smemBuffer[i] = vec;
                         ^
          detected during instantiation of "void tensorrt_llm::kernels::invokePerTokenQuantization(QuantT *, const T *, int64_t, int64_t, const float *, float *, float *, tensorrt_llm::common::QuantMode, cudaStream_t) [with T=float, QuantT=int8_t]" at line 354

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

[100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorTrue.cu.o
[100%] Built target context_attention_src
[100%] Linking CUDA device code CMakeFiles/cutlass_src.dir/cmake_device_link.o
[100%] Linking CXX static library libcutlass_src.a
[100%] Built target cutlass_src
[100%] Linking CUDA device code CMakeFiles/gemm_swiglu_sm90_src.dir/cmake_device_link.o
[100%] Linking CUDA static library libgemm_swiglu_sm90_src.a
[100%] Built target gemm_swiglu_sm90_src
[100%] Built target selective_scan_src

additional notes

lots of process when compiling, with ptxas -arch sm_80 which is unrelated to sm90 even when I use --cuda_architectures 90

ps.txt

anaivebird commented 2 days ago

it raised an error

[  0%] Generating .check_symbol
[  0%] Generating .check_symbol_executor
[  0%] Generating .check_symbol_internal_cutlass_kernels
[  0%] Built target gemm_swiglu_sm90_src
[  0%] Built target fb_gemm_src
[  0%] Built target check_symbol
[  0%] Built target check_symbol_executor
[  0%] Built target check_symbol_internal_cutlass_kernels
[  0%] Built target cutlass_src
[  1%] Built target selective_scan_src
[  2%] Built target common_src
[  2%] Built target layers_src
[  3%] Built target moe_gemm_src
[  4%] Built target fpA_intB_gemm_src
[  4%] Building CXX object tensorrt_llm/runtime/CMakeFiles/runtime_src.dir/tllmRuntime.cpp.o
[  5%] Built target decoder_attention
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp: In function ‘void {anonymous}::setWeightStreaming(nvinfer1::ICudaEngine&, float)’:
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:113:16: error: ‘class nvinfer1::ICudaEngine’ has no member named ‘setWeightStreamingBudgetV2’; did you mean ‘setWeightStreamingBudget’?
  113 |         engine.setWeightStreamingBudgetV2(budget);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                setWeightStreamingBudget
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp: In constructor ‘tensorrt_llm::runtime::TllmRuntime::TllmRuntime(const tensorrt_llm::runtime::RawEngine&, nvinfer1::ILogger*, float, bool)’:
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:242:41: error: ‘class nvinfer1::ICudaEngine’ has no member named ‘getDeviceMemorySizeV2’; did you mean ‘getDeviceMemorySize’?
  242 |     auto const devMemorySize = mEngine->getDeviceMemorySizeV2();
      |                                         ^~~~~~~~~~~~~~~~~~~~~
      |                                         getDeviceMemorySize
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp: In member function ‘nvinfer1::IExecutionContext& tensorrt_llm::runtime::TllmRuntime::addContext(int32_t)’:
/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:284:13: error: ‘class nvinfer1::IExecutionContext’ has no member named ‘setDeviceMemoryV2’; did you mean ‘setDeviceMemory’?
  284 |     context.setDeviceMemoryV2(mEngineBuffer->data(), static_cast<int64_t>(mEngineBuffer->getCapacity()));
      |             ^~~~~~~~~~~~~~~~~
      |             setDeviceMemory
gmake[3]: *** [tensorrt_llm/runtime/CMakeFiles/runtime_src.dir/build.make:527: tensorrt_llm/runtime/CMakeFiles/runtime_src.dir/tllmRuntime.cpp.o] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:1935: tensorrt_llm/runtime/CMakeFiles/runtime_src.dir/all] Error 2
gmake[2]: *** Waiting for unfinished jobs....
[ 23%] Built target decoder_attention_src
[ 63%] Built target kernels_src
[ 98%] Built target context_attention_src
gmake[1]: *** [CMakeFiles/Makefile2:1537: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
gmake: *** [Makefile:218: tensorrt_llm] Error 2
Traceback (most recent call last):
  File "/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/scripts/build_wheel.py", line 434, in <module>
    main(**vars(args))
  File "/home/work/xingwuFileSystem/qserve_trtllm/TensorRT-LLM/scripts/build_wheel.py", line 208, in main
    build_run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 192 --target tensorrt_llm nvinfer_plugin_tensorrt_llm th_common bindings   executorWorker  ' returned non-zero exit status 2.
byshiue commented 2 days ago

Thank you to report the issue. It is a bug at https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/CMakeLists.txt#L25. We should replace SRC_CU by SRC_CPP. We will fix it ASAP.

anaivebird commented 2 days ago

change this will fix which bug? compiling is slow, or the error?

byshiue commented 2 days ago

It fixes the slow compiling.

For the error, please create another bug if it is not related to the issue above.