NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.17k stars 905 forks source link

LLAMA 3.1 8B Quantization failed from BF16 to FP8 #2052

Open Ryan-ZL-Lin opened 1 month ago

Ryan-ZL-Lin commented 1 month ago

System Info

GPU: NVIDIA T4 * 4 Driver Version: 550.54.15 CUDA: 12.4 Image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 TensorRT-LLM version: 0.11.0

Who can help?

No response

Information

Tasks

Reproduction

  1. git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  2. pip install -r /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/requirements.txt
  3. mkdir -p /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
  4. git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
  5. docker run --runtime=nvidia -it --net host --ipc=host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /srv:/srv nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
  6. pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt
  7. pip install -r /srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/requirements.txt
  8. HF_LLAMA3_1_8B_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
  9. UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
  10. ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
  11. QUANTIZATION_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py
  12. python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1

Expected behavior

The BF16 model can be quantized to FP8, then use FP8 checkpoints to build model engine

actual behavior

Quantization failed with the following error: ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'type': 'llama3'}

Here is the log:

root@7995db4e1845:/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct# python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Initializing model from /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Traceback (most recent call last):
  File "/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module>
    quantize_and_export(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 379, in quantize_and_export
    model = get_model(model_dir, dtype, device=device)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 183, in get_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 524, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 989, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 772, in from_dict
    config = cls(**config_dict)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 161, in __init__
    self._rope_scaling_validation()
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 182, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'type': 'llama3'}
root@7995db4e1845:/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct#
nv-guomingz commented 1 month ago

what's your transformers version?

MatthewPeyrard commented 1 month ago

I am hitting the same issue. It seems you need to have transformers 4.42.3, but this is impossible because "optimum" requires transformers to be a version <= 4.40.

I have modified the the config.json to use:

  "rope_scaling": {
    "factor": 8.0,
    "type": "dynamic"
  },

And that worked, but I am not yet sure what side effects this might have.

Ryan-ZL-Lin commented 1 month ago

what's your transformers version?

root@7995db4e1845:/srv/tensorrtllm_backend# pip show transformers Name: transformers Version: 4.42.4 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: transformers@huggingface.co License: Apache 2.0 License Location: /usr/local/lib/python3.10/dist-packages Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator

nv-guomingz commented 1 month ago

what's your transformers version?

root@7995db4e1845:/srv/tensorrtllm_backend# pip show transformers Name: transformers Version: 4.42.4 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: transformers@huggingface.co License: Apache 2.0 License Location: /usr/local/lib/python3.10/dist-packages Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator

Could u please update your transformers version to 4.43.dev0+?

Ryan-ZL-Lin commented 1 month ago

thanks @nv-guomingz After upgrading the transformers version, I got an different error saying safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge Here is the steps of reproduction:

Update transformers version to 4.43.dev0+

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
optimum 1.21.2 requires transformers[sentencepiece]<4.43.0,>=4.26.0, but you have transformers 4.44.0.dev0 which is incompatible.
Successfully installed transformers-4.44.0.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
root@7995db4e1845:/opt/tritonserver# pip show transformers
Name: transformers
Version: 4.44.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator
root@7995db4e1845:/opt/tritonserver#

Run quantization

root@7995db4e1845:/opt/tritonserver# HF_LLAMA3_1_8B_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
root@7995db4e1845:/opt/tritonserver# UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# QUANTIZATION_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py
root@7995db4e1845:/opt/tritonserver# python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Initializing model from /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Unrecognized keys in `rope_scaling` for 'rope_type'='llama3': {'type'}
Loading checkpoint shards:   0%|                                                                                                                                      | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module>
    quantize_and_export(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 379, in quantize_and_export
    model = get_model(model_dir, dtype, device=device)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 183, in get_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3931, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4385, in _load_pretrained_model
    state_dict = load_state_dict(shard_file, is_quantized=is_quantized)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 549, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
root@7995db4e1845:/opt/tritonserver#

this is the adjusted config under model repository image

nv-guomingz commented 1 month ago

Is it possible to run cmd pip install --upgrade safetensors and then run quantization again? I can't reproduce your issue on my side with the cmd you provided. my safetensors' version is 0.4.2 and modelopt version is 0.15.0

Ryan-ZL-Lin commented 1 month ago

Is it possible to run cmd pip install --upgrade safetensors and then run quantization again? I can't reproduce your issue on my side with the cmd you provided. my safetensors' version is 0.4.2 and modelopt version is 0.15.0

Hi @nv-guomingz I upgraded both libraries to the version as what you used. However, I still got an error saying safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge with additional NeMo warning log. I also tried safetensors==0.4.3 but the error was the same.

Here are the steps to reproduce:

upgrade safetensors

root@7995db4e1845:/opt/tritonserver# pip show safetensors
Name: safetensors
Version: 0.4.2
Summary:
Home-page: https://github.com/huggingface/safetensors
Author:
Author-email: Nicolas Patry <patry.nicolas@protonmail.com>
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires:
Required-by: accelerate, diffusers, transformers

upgrade nvidia-modelopt pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt==0.15.0

root@7995db4e1845:/opt/tritonserver# pip show nvidia-modelopt
Name: nvidia-modelopt
Version: 0.15.0
Summary: Nvidia TensorRT Model Optimizer: a unified model optimization and deployment toolkit.
Home-page: https://github.com/NVIDIA/TensorRT-Model-Optimizer
Author:
Author-email: "Nvidia, Inc." <ammo-support@exchange.nvidia.com>
License: NVIDIA Proprietary Software
Location: /usr/local/lib/python3.10/dist-packages
Requires: cloudpickle, ninja, numpy, packaging, pydantic, rich, scipy, tqdm
Required-by: tensorrt-llm
root@7995db4e1845:/opt/tritonserver#

Run quantization

root@7995db4e1845:/opt/tritonserver# HF_LLAMA3_1_8B_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
root@7995db4e1845:/opt/tritonserver# UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# QUANTIZATION_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py
root@7995db4e1845:/opt/tritonserver# python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[NeMo W 2024-07-31 10:43:34 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/xla.py:18: DeprecationWarning: `ModuleAvailableCache` is a special case of `RequirementCache`. Please use `RequirementCache(module=...)` instead.
      from lightning_fabric.accelerators.tpu import _XLA_AVAILABLE, TPUAccelerator

[NeMo W 2024-07-31 10:43:35 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/wandb/analytics/sentry.py:90: SentryHubDeprecationWarning: `sentry_sdk.Hub` is deprecated and will be removed in a future major release. Please consult our 1.x to 2.x migration guide for details on how to migrate `Hub` usage to the new API: https://docs.sentry.io/platforms/python/migration/1.x-to-2.x
      self.hub = sentry_sdk.Hub(client)

Initializing model from /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Unrecognized keys in `rope_scaling` for 'rope_type'='llama3': {'type'}
Loading checkpoint shards:   0%|                                                                                                                                      | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module>
    quantize_and_export(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 379, in quantize_and_export
    model = get_model(model_dir, dtype, device=device)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 183, in get_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3931, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4385, in _load_pretrained_model
    state_dict = load_state_dict(shard_file, is_quantized=is_quantized)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 549, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
[NeMo W 2024-07-31 10:43:36 nemo_logging:349] /usr/lib/python3.10/tempfile.py:1008: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpu8u6ggw6'>
      _warnings.warn(warn_message, ResourceWarning)

root@7995db4e1845:/opt/tritonserver#
nv-guomingz commented 1 month ago

It's weired. Could u please double check your ckpt's sanity? Below are the screenshots for llama 3.1 8b fp8 quantization and ckpt's md5sum values on my side.

image image
Ryan-ZL-Lin commented 1 month ago

It's weired. Could u please double check your ckpt's sanity? Below are the screenshots for llama 3.1 8b fp8 quantization and ckpt's md5sum values on my side. image image

my checkpoints' md5sum is different from yours...

ubuntu@ip-30-60-90-17:/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum
3896603df44731722a1cfdf617320b70  model-00001-of-00004.safetensors
db5afcbec4c40ca95b00caf053f6e028  model-00002-of-00004.safetensors
768b3498b50f0e24fad652af65e88e3d  model-00003-of-00004.safetensors
3817ec476aa6bfa6b18f29bb8fdbaec9  model-00004-of-00004.safetensors

here is the command I used to clone the checkpoints from HF: git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct

Is it the same as your download approach?

nv-guomingz commented 1 month ago

es/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum

It's very possible that the ckpt's correctness is wrong. You may try redownload them again.

Here is the size of my llama 3.1 8b checkpoint.

ls *.safetensors|xargs ls -al -rw-r--r-- 1 guomingz dip 4976698672 Jul 31 11:36 model-00001-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4999802720 Jul 31 11:37 model-00002-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4915916176 Jul 31 11:37 model-00003-of-00004.safetensors -rw-r--r-- 1 guomingz dip 1168138808 Jul 31 11:34 model-00004-of-00004.safetensors

Ryan-ZL-Lin commented 1 month ago

es/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum

It's very possible that the ckpt's correctness is wrong. You may try redownload them again.

Here is the size of my llama 3.1 8b checkpoint.

ls *.safetensors|xargs ls -al -rw-r--r-- 1 guomingz dip 4976698672 Jul 31 11:36 model-00001-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4999802720 Jul 31 11:37 model-00002-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4915916176 Jul 31 11:37 model-00003-of-00004.safetensors -rw-r--r-- 1 guomingz dip 1168138808 Jul 31 11:34 model-00004-of-00004.safetensors

Thanks @nv-guomingz You're right, the safetensors file size is too small in my repo. After redownloading the model checkpoint files, I can quantize the model to FP8.

-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00001-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00002-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00003-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00004-of-00004.safetensors

However, when running model engine build command, I got this error

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 468, in main
    rotary_type = rotary_scaling['type']
KeyError: 'type'

Is it relevant to the config.json in my ${UNIFIED_CKPT_PATH} ??

Here is the error log:

root@7995db4e1845:/opt/tritonserver# trtllm-build \
    --checkpoint_dir ${UNIFIED_CKPT_PATH} \
    --output_dir ${ENGINE_DIR} \
    --gemm_plugin fp8
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[08/01/2024-10:53:42] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gemm_plugin to fp8.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set nccl_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set lookup_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set lora_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set moe_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set context_fmha to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set remove_input_padding to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set reduce_fusion to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set multi_block_mode to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set enable_xqa to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set multiple_profiles to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set paged_state to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set streamingllm to False.
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.15.0'}
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[08/01/2024-10:53:42] [TRT-LLM] [I] Compute capability: (7, 5)
[08/01/2024-10:53:42] [TRT-LLM] [I] SM count: 40
[08/01/2024-10:53:42] [TRT-LLM] [I] SM clock: 1590 MHz
[08/01/2024-10:53:42] [TRT-LLM] [I] int4 TFLOPS: 260
[08/01/2024-10:53:42] [TRT-LLM] [I] int8 TFLOPS: 130
[08/01/2024-10:53:42] [TRT-LLM] [I] fp8 TFLOPS: 0
[08/01/2024-10:53:42] [TRT-LLM] [I] float16 TFLOPS: 65
[08/01/2024-10:53:42] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[08/01/2024-10:53:42] [TRT-LLM] [I] float32 TFLOPS: 8
[08/01/2024-10:53:42] [TRT-LLM] [I] Total Memory: 15 GiB
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory clock: 5001 MHz
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory bus width: 256
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe link width: 8
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 468, in main
    rotary_type = rotary_scaling['type']
KeyError: 'type'
root@7995db4e1845:/opt/tritonserver#
root@7995db4e1845:/opt/tritonserver#
nv-guomingz commented 1 month ago

For 0.11 TRT-LLM, we don't claim the LLAMA 3 support yet. https://github.com/NVIDIA/TensorRT-LLM/tree/v0.11.0/examples/llama.

I suggest you may have a try with this wheel. And for your device T4, it doesnt support FP8 accelaration technology.

junam2 commented 1 month ago

@nv-guomingz Hello. I have a same issue when quantize FP8 llama 3.1 70B model.

GPU: H100 * 2
Driver Version: 550.90.07
CUDA: 12.4
Image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
TensorRT-LLM version: 0.11.0

After I pip install with command pip install tensorrt-llm==0.12.0.dev2024073000, system delete tensorrt-llm v0.11.0 automatically. When I try to quantization, below error is occur. I think tensorrt-llm v0.12.0 can not connect with tritonserver:24.07

root@my-llama-test-687489c5q4-dxw6w:/home# trtllm-build
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 5, in <module>
    from tensorrt_llm.commands.build import main
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module>
    import tensorrt as trt
ModuleNotFoundError: No module named 'tensorrt'
github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."