TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
CUDA runtime error in cublasLtMatmul, CUBLAS_STATUS_EXECUTION_FAILED #700

Open WhiteDoveBuct opened 9 months ago

WhiteDoveBuct commented 9 months ago


python build.py \
--model_dir /AIED-data/xxx/Llama-2-70b-hf/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir /AIED-data/xxx/trt_engines/Llama-2-70b-hf-32/ \
--world_size 8 \
--tp_size 4 \
--pp_size 2 \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 3072 \
--parallel_build \
--use_rmsnorm_plugin float16 \
--use_inflight_batching \
--use_fused_mlp \


in_out_sizes=("1:1024:3072" "2:1024:3072" "4:1024:3072" "8:1024:3072", "16:1024:3072", "32:1024:3072")
for in_out in ${in_out_sizes[@]}
batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
echo "BS: $batch_size, ISL/OSL: $in_out_dims"

    mpirun -n 8 --allow-run-as-root --oversubscribe \                                                                                                                      
./cpp/build/benchmarks/gptSessionBenchmark \                                                                                                                               
--model llama \                                                                                                                                                            
--engine_dir /AIED-data/xxx/trt_engines/Llama-2-70b-hf-32 \                                                                                                         
--warm_up 1 \                                                                                                                                                              
--batch_size $batch_size \                                                                                                                                                 
--duration 0 \                                                                                                                                                             
--num_runs 5 \                                                                                                                                                             
--input_output_len $in_out_dims


error log

[1702983381.525127] [AI-99-141-release:95101:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
Benchmarking done. Iteration: 5, duration: 694.92 sec.
[BENCHMARK] batch_size 1 input_length 1024 output_length 3072 latency(ms) 138983.22 tokensPerSec
BS: 2, ISL/OSL: 1024,3072
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOpera
tionDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mC
ublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/code/tensorrt_llm/cpp/t
1 0x7f5e902009ce /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0xac9ce) [0x7f5e902009ce]
2 0x7f5e90254dc6 /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0x100dc6) [0x7f5e90254dc6]
3 0x7f5e9025519b /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0x10119b) [0x7f5e9025519b]
4 0x7f5e902262d1 /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0xd22d1) [0x7f5e902262d1]
5 0x7f5e90226bba tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc cons
t, nvinfer1::PluginTensorDesc const, void const const, void const, void, CUstream_st) + 2
6 0x7f5e46d3cba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f5e46d3cba9]
7 0x7f5e46d126af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f5e46d126af]
8 0x7f5e46d14320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f5e46d14320]
9 0x7f5ed5ee787f tensorrt_llm::runtime::GptSession::executeGenerationStep(int, std::vector<
tensorrt_llm::runtime::GenerationInput, std::allocator >
const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime
::GenerationOutput> >&, std::vector<int, std::allocator > const&, tensorrt_llm::batch_man age r::kv_cache_manager::KVCacheManager*, std::vector<bool, std::alloca tor >&) + 1903
10 0x7f5ed5ee912e tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_ll
m::runtime::GenerationOutput, std::allocator >&, std::ve
ctor<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInpu
t> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> con st& ) + 3070
11 0x7f5ed5eeb18b tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::Generat
ionOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig
const&) + 7003
12 0x5556de223dff ./cpp/build/benchmarks/gptSessionBenchmark(+0x19dff) [0x5556de223dff]
13 0x7f5e8fcfad90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f5e8fcfad90]
14 0x7f5e8fcfae40 __libc_start_main + 128
15 0x5556de225ef5 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1bef5) [0x5556de225ef5]

byshiue commented 8 months ago

From error

[1702984696.792585] [AI-99-141-release:13208:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device

it looks like a issue of your device. Could you try on another device?

WhiteDoveBuct commented 8 months ago
