NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 990 forks source link

The following argument types are supported: #863

Open whk6688 opened 10 months ago

whk6688 commented 10 months ago

when run build script:

nohup python build.py --hf_model_dir /code/tensorrt_llm/tmp/Qwen/models--Qwen--Qwen-7B-Chat/snapshots/8d24619bab456ea5abe2823c1d05fc5edec19174/ \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/ \ --parallel_build > t.log 2>&1 &

error: [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in cublasCreate(handle.get()): CUBLAS_STATUS_ALLOC_FAILED (/src/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:181) 55 0xaaaab7da80f4 PyEval_EvalCode + 116 56 0xaaaab7ddc83c python(+0x21c83c) [0xaaaab7ddc83c] 57 0xaaaab7dd3f48 python(+0x213f48) [0xaaaab7dd3f48] 58 0xaaaab7ddc4ec python(+0x21c4ec) [0xaaaab7ddc4ec] 59 0xaaaab7ddb654 _PyRun_SimpleFileObject + 388 60 0xaaaab7ddb220 _PyRun_AnyFileObject + 80 61 0xaaaab7dcab00 Py_RunMain + 512 62 0xaaaab7d99208 Py_BytesMain + 36 63 0xffff8a2273fc /lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff8a2273fc] 64 0xffff8a2274cc libc_start_main + 152 65 0xaaaab7d990f0 _start + 48 Traceback (most recent call last): File "/code/tensorrt_llm/examples/qwen/build.py", line 655, in build(0, args) File "/code/tensorrt_llm/examples/qwen/build.py", line 625, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/code/tensorrt_llm/examples/qwen/build.py", line 554, in build_rank_engine tensorrt_llm_qwen(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 547, in forward hidden_states = super().forward(input_ids, position_ids, use_cache, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 428, in forward hidden_states = layer( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 314, in forward attention_output = self.attention( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call__ output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 179, in forward qkv = self.qkv(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 137, in forward return self.multiply_gather(x, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 113, in multiply_gather x = _gemm_plugin(x, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 59, in _gemm_plugin layer = default_trtnet().add_plugin_v2(plug_inputs, gemm_plug) TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:

  1. (self: tensorrt.tensorrt.INetworkDefinition, inputs: List[tensorrt.tensorrt.ITensor], plugin: tensorrt.tensorrt.IPluginV2) -> tensorrt.tensorrt.IPluginV2Layer

Invoked with: <tensorrt.tensorrt.INetworkDefinition object at 0xfffeb8229ab0>, [<tensorrt.tensorrt.ITensor object at 0xfffeb8751430>, <tensorrt.tensorrt.ITensor object at 0xfffeb3f63b30>], None

byshiue commented 10 months ago

It looks like a environment issue of CUDA. The error happens at

TLLM_CUDA_CHECK(cublasCreate(handle.get()));

which initialize the cublas handler. How do you build the docker image? Could you try running other gpus program?

whk6688 commented 10 months ago

build docker with: make -C docker release_build NOTE: my platform is ORIN

byshiue commented 10 months ago

@whk6688 Can you please look at https://github.com/NVIDIA/TensorRT-LLM/issues/488#issuecomment-1848697981? As of right now Orin is not formally supported

nv-guomingz commented 1 day ago

Hi @whk6688 do u still have further issue or question now? If not, we'll close it soon.