TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Apache License 2.0
When I used convert_checkpoint.py to convert Gemma hf format, It print killed #2344

Open imilli opened 1 month ago

imilli commented 1 month ago

System Info CPU architecture ( x86_64)

CPU/Host memory size (64GB)

GPU properties

GPU name ( NVIDIA RTX4090) GPU memory size (24GB) Libraries

TensorRT-LLM branch or tag (v0.13.0) Versions of TensorRT CUDA Container used (nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 ) NVIDIA driver version 12.6

OS (Windows 11 Pro)

1.docker run --rm -it --net host --shm-size=8g --memory="64g" --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v d:/llm/tensorrtllm_backend:/tensorrtllm_backend -v d:/llm/engines:/engines nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3

2.run docker and excute pip show tensorrt_llm: Name: tensorrt-llm Version: 0.13.0 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models Home-page: https://github.com/NVIDIA/TensorRT-LLM Author: NVIDIA Corporation Author-email: License: Apache License 2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: accelerate, aenum, build, click, click-option-group, colored, cuda-python, diffusers, evaluate, h5py, lark, mpi4py, mpmath, numpy, nvidia-modelopt, onnx, openai, optimum, pandas, pillow, polygraphy, psutil, pulp, pynvml, sentencepiece, StrEnum, tensorrt, torch, transformers, wheel Required-by:

3.git clone https://github.com/NVIDIA/TensorRT-LLM.git (branch main) root@docker-desktop:/tensorrtllm_backend/src/TensorRT-LLM/examples/gemma# python3 convert_checkpoint.py --model-dir "/tensorrtllm_backend/models/gemma-2-9b-chat" --output-model-dir "/tensorrtllm_backend/trt-model" --dtype float16 --ckpt-type hf --world-size 1 --use-weight-only-with-precision int8 --load_model_on_cpu [TensorRT-LLM] TensorRT-LLM version: 0.13.0 You are using a model of type gemma2 to instantiate a model of type gemma. This is not supported for all configurations of models and can yield errors. Determined TensorRT-LLM configuration {'architecture': 'Gemma2ForCausalLM', 'dtype': 'float16', 'vocab_size': 256000, 'hidden_size': 3584, 'num_hidden_layers': 42, 'num_attention_heads': 16, 'hidden_act': 'gelu_pytorch_tanh', 'logits_dtype': 'float32', 'norm_epsilon': 1e-06, 'position_embedding_type': 'rope_gpt_neox', 'max_position_embeddings': 8192, 'num_key_value_heads': 8, 'intermediate_size': 14336, 'mapping': {'world_size': 1, 'gpus_per_node': 8, 'cp_size': 1, 'tp_size': 1, 'pp_size': 1, 'moe_tp_size': 1, 'moe_ep_size': 1}, 'quantization': {'quant_algo': <QuantAlgo.W8A16: 'W8A16'>, 'kv_cache_quant_algo': None, 'group_size': 128, 'smoothquant_val': None, 'clamp_val': None, 'has_zero_point': False, 'pre_quant_scale': True, 'exclude_modules': None}, 'use_parallel_embedding': False, 'embedding_sharding_dim': 0, 'share_embedding_table': True, 'head_size': 256, 'qk_layernorm': False, 'rotary_base': 10000.0, 'attn_bias': False, 'mlp_bias': False, 'rotary_scaling': None, 'inter_layernorms': True, 'query_pre_attn_scalar': 224, 'final_logit_softcapping': 30.0, 'attn_logit_softcapping': 50.0} Loading weights... Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.70it/s] Killed root@docker-desktop:/tensorrtllm_backend/src/TensorRT-LLM/examples/gemma#

additional notes My computer has a lot of free cpu memory. but the command prompt killed, no other informations. Image

