TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
GPU name ( NVIDIA RTX4090)
GPU memory size (24GB)
Libraries
TensorRT-LLM branch or tag (v0.13.0)
Versions of TensorRT CUDA
Container used (nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 )
NVIDIA driver version 12.6
OS (Windows 11 Pro)
Who can help?
No response
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
1.docker run --rm -it --net host --shm-size=8g --memory="64g" --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v d:/llm/tensorrtllm_backend:/tensorrtllm_backend -v d:/llm/engines:/engines nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
System Info CPU architecture ( x86_64)
CPU/Host memory size (64GB)
GPU properties
GPU name ( NVIDIA RTX4090) GPU memory size (24GB) Libraries
TensorRT-LLM branch or tag (v0.13.0) Versions of TensorRT CUDA Container used (nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 ) NVIDIA driver version 12.6
OS (Windows 11 Pro)
Who can help? No response
Information
The official example scripts
My own modified scripts Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below) Reproduction
1.docker run --rm -it --net host --shm-size=8g --memory="64g" --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v d:/llm/tensorrtllm_backend:/tensorrtllm_backend -v d:/llm/engines:/engines nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
2.run docker and excute pip show tensorrt_llm: Name: tensorrt-llm Version: 0.13.0 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models Home-page: https://github.com/NVIDIA/TensorRT-LLM Author: NVIDIA Corporation Author-email: License: Apache License 2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: accelerate, aenum, build, click, click-option-group, colored, cuda-python, diffusers, evaluate, h5py, lark, mpi4py, mpmath, numpy, nvidia-modelopt, onnx, openai, optimum, pandas, pillow, polygraphy, psutil, pulp, pynvml, sentencepiece, StrEnum, tensorrt, torch, transformers, wheel Required-by:
3.git clone https://github.com/NVIDIA/TensorRT-LLM.git (branch main) root@docker-desktop:/tensorrtllm_backend/src/TensorRT-LLM/examples/gemma# python3 convert_checkpoint.py --model-dir "/tensorrtllm_backend/models/gemma-2-9b-chat" --output-model-dir "/tensorrtllm_backend/trt-model" --dtype float16 --ckpt-type hf --world-size 1 --use-weight-only-with-precision int8 --load_model_on_cpu [TensorRT-LLM] TensorRT-LLM version: 0.13.0 You are using a model of type gemma2 to instantiate a model of type gemma. This is not supported for all configurations of models and can yield errors. Determined TensorRT-LLM configuration {'architecture': 'Gemma2ForCausalLM', 'dtype': 'float16', 'vocab_size': 256000, 'hidden_size': 3584, 'num_hidden_layers': 42, 'num_attention_heads': 16, 'hidden_act': 'gelu_pytorch_tanh', 'logits_dtype': 'float32', 'norm_epsilon': 1e-06, 'position_embedding_type': 'rope_gpt_neox', 'max_position_embeddings': 8192, 'num_key_value_heads': 8, 'intermediate_size': 14336, 'mapping': {'world_size': 1, 'gpus_per_node': 8, 'cp_size': 1, 'tp_size': 1, 'pp_size': 1, 'moe_tp_size': 1, 'moe_ep_size': 1}, 'quantization': {'quant_algo': <QuantAlgo.W8A16: 'W8A16'>, 'kv_cache_quant_algo': None, 'group_size': 128, 'smoothquant_val': None, 'clamp_val': None, 'has_zero_point': False, 'pre_quant_scale': True, 'exclude_modules': None}, 'use_parallel_embedding': False, 'embedding_sharding_dim': 0, 'share_embedding_table': True, 'head_size': 256, 'qk_layernorm': False, 'rotary_base': 10000.0, 'attn_bias': False, 'mlp_bias': False, 'rotary_scaling': None, 'inter_layernorms': True, 'query_pre_attn_scalar': 224, 'final_logit_softcapping': 30.0, 'attn_logit_softcapping': 50.0} Loading weights... Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.70it/s] Killed root@docker-desktop:/tensorrtllm_backend/src/TensorRT-LLM/examples/gemma#
additional notes My computer has a lot of free cpu memory. but the command prompt killed, no other informations.