NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 925 forks source link

CUDA runtime error in cudaDeviceGetDefaultMemPool on [windows + 16GB V100] #1821

Open ljayx opened 3 months ago

ljayx commented 3 months ago

Hi experts,

I'm running a 1.3B model on windows with 16GB V100 with below envs, but hit an issue which I couldn't find any clue. Could you please help check it.

TensorRT-LLM version: tag v0.10.0 Installation references: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.10.0/docs/source/installation/windows.md GPU info:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61                 Driver Version: 551.61         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB         TCC   |   00000000:00:07.0 Off |                    0 |
| N/A   31C    P0             38W /  300W |       9MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The code is from examples without any change:

cd examples/apps
python chat.py /model_dir /tokenizer_dir

outputs:

C:\Program Files\Python310\lib\site-packages\transformers\utils\generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1280
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
Traceback (most recent call last):
  File "D:\ljay\trtllm\TensorRT-LLM\examples\apps\chat.py", line 60, in <module>
    main(args.model_dir, args.tokenizer)
  File "D:\ljay\trtllm\TensorRT-LLM\examples\apps\chat.py", line 49, in main
    with GenerationExecutorWorker(model_dir, tokenizer, 1) as executor:
  File "C:\Program Files\Python310\lib\site-packages\tensorrt_llm\executor.py", line 514, in __init__
    self.engine = tllm.GptManager(engine_dir, executor_type, max_beam_width,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (C:\home\jenkins\agent\workspace\L0_MergeReque---cc29e445\llm\cpp\tensorrt_llm\runtime\bufferManager.cpp:211)
hijkzzz commented 3 months ago

Could you try adding the CUDA / TensorRT-LLM related libs path to Windows System $PATH$ environment variables? Another easier method is to use WSL2.

ljayx commented 3 months ago

I added them to PATH, but the issue remains. image

By the way, due to some reasons, I can't use WSL2.

hijkzzz commented 3 months ago

Similar issue: https://github.com/triton-inference-server/tensorrtllm_backend/issues/328

ljayx commented 2 months ago

@hijkzzz It looks like the solution is for vGPUs, what should I do with a bare metal like mine?