Open Ryan-ZL-Lin opened 1 month ago
what's your transformers version?
I am hitting the same issue. It seems you need to have transformers 4.42.3, but this is impossible because "optimum" requires transformers to be a version <= 4.40.
I have modified the the config.json to use:
"rope_scaling": {
"factor": 8.0,
"type": "dynamic"
},
And that worked, but I am not yet sure what side effects this might have.
what's your transformers version?
root@7995db4e1845:/srv/tensorrtllm_backend# pip show transformers Name: transformers Version: 4.42.4 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: transformers@huggingface.co License: Apache 2.0 License Location: /usr/local/lib/python3.10/dist-packages Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator
what's your transformers version?
root@7995db4e1845:/srv/tensorrtllm_backend# pip show transformers Name: transformers Version: 4.42.4 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: transformers@huggingface.co License: Apache 2.0 License Location: /usr/local/lib/python3.10/dist-packages Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator
Could u please update your transformers version to 4.43.dev0+?
thanks @nv-guomingz
After upgrading the transformers version, I got an different error saying safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Here is the steps of reproduction:
Update transformers version to 4.43.dev0+
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
optimum 1.21.2 requires transformers[sentencepiece]<4.43.0,>=4.26.0, but you have transformers 4.44.0.dev0 which is incompatible.
Successfully installed transformers-4.44.0.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
root@7995db4e1845:/opt/tritonserver# pip show transformers
Name: transformers
Version: 4.44.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: nemo_text_processing, optimum, sentence-transformers, tensorrt-llm, transformers-stream-generator
root@7995db4e1845:/opt/tritonserver#
Run quantization
root@7995db4e1845:/opt/tritonserver# HF_LLAMA3_1_8B_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
root@7995db4e1845:/opt/tritonserver# UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# QUANTIZATION_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py
root@7995db4e1845:/opt/tritonserver# python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Initializing model from /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Unrecognized keys in `rope_scaling` for 'rope_type'='llama3': {'type'}
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module>
quantize_and_export(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 379, in quantize_and_export
model = get_model(model_dir, dtype, device=device)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 183, in get_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3931, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4385, in _load_pretrained_model
state_dict = load_state_dict(shard_file, is_quantized=is_quantized)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 549, in load_state_dict
with safe_open(checkpoint_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
root@7995db4e1845:/opt/tritonserver#
this is the adjusted config under model repository
Is it possible to run cmd pip install --upgrade safetensors
and then run quantization again?
I can't reproduce your issue on my side with the cmd you provided.
my safetensors' version is 0.4.2 and modelopt version is 0.15.0
Is it possible to run cmd
pip install --upgrade safetensors
and then run quantization again? I can't reproduce your issue on my side with the cmd you provided. my safetensors' version is 0.4.2 and modelopt version is 0.15.0
Hi @nv-guomingz
I upgraded both libraries to the version as what you used. However, I still got an error saying safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
with additional NeMo warning log. I also tried safetensors==0.4.3 but the error was the same.
Here are the steps to reproduce:
upgrade safetensors
root@7995db4e1845:/opt/tritonserver# pip show safetensors
Name: safetensors
Version: 0.4.2
Summary:
Home-page: https://github.com/huggingface/safetensors
Author:
Author-email: Nicolas Patry <patry.nicolas@protonmail.com>
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires:
Required-by: accelerate, diffusers, transformers
upgrade nvidia-modelopt
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt==0.15.0
root@7995db4e1845:/opt/tritonserver# pip show nvidia-modelopt
Name: nvidia-modelopt
Version: 0.15.0
Summary: Nvidia TensorRT Model Optimizer: a unified model optimization and deployment toolkit.
Home-page: https://github.com/NVIDIA/TensorRT-Model-Optimizer
Author:
Author-email: "Nvidia, Inc." <ammo-support@exchange.nvidia.com>
License: NVIDIA Proprietary Software
Location: /usr/local/lib/python3.10/dist-packages
Requires: cloudpickle, ninja, numpy, packaging, pydantic, rich, scipy, tqdm
Required-by: tensorrt-llm
root@7995db4e1845:/opt/tritonserver#
Run quantization
root@7995db4e1845:/opt/tritonserver# HF_LLAMA3_1_8B_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
root@7995db4e1845:/opt/tritonserver# UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/llama/llama-3.1-8b-instruct/tp1_pp1_fp8/1-gpu
root@7995db4e1845:/opt/tritonserver# QUANTIZATION_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py
root@7995db4e1845:/opt/tritonserver# python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[NeMo W 2024-07-31 10:43:34 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/xla.py:18: DeprecationWarning: `ModuleAvailableCache` is a special case of `RequirementCache`. Please use `RequirementCache(module=...)` instead.
from lightning_fabric.accelerators.tpu import _XLA_AVAILABLE, TPUAccelerator
[NeMo W 2024-07-31 10:43:35 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/wandb/analytics/sentry.py:90: SentryHubDeprecationWarning: `sentry_sdk.Hub` is deprecated and will be removed in a future major release. Please consult our 1.x to 2.x migration guide for details on how to migrate `Hub` usage to the new API: https://docs.sentry.io/platforms/python/migration/1.x-to-2.x
self.hub = sentry_sdk.Hub(client)
Initializing model from /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Unrecognized keys in `rope_scaling` for 'rope_type'='llama3': {'type'}
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/srv/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module>
quantize_and_export(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 379, in quantize_and_export
model = get_model(model_dir, dtype, device=device)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 183, in get_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3931, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4385, in _load_pretrained_model
state_dict = load_state_dict(shard_file, is_quantized=is_quantized)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 549, in load_state_dict
with safe_open(checkpoint_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
[NeMo W 2024-07-31 10:43:36 nemo_logging:349] /usr/lib/python3.10/tempfile.py:1008: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpu8u6ggw6'>
_warnings.warn(warn_message, ResourceWarning)
root@7995db4e1845:/opt/tritonserver#
It's weired. Could u please double check your ckpt's sanity? Below are the screenshots for llama 3.1 8b fp8 quantization and ckpt's md5sum values on my side.
It's weired. Could u please double check your ckpt's sanity? Below are the screenshots for llama 3.1 8b fp8 quantization and ckpt's md5sum values on my side.
my checkpoints' md5sum is different from yours...
ubuntu@ip-30-60-90-17:/srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum
3896603df44731722a1cfdf617320b70 model-00001-of-00004.safetensors
db5afcbec4c40ca95b00caf053f6e028 model-00002-of-00004.safetensors
768b3498b50f0e24fad652af65e88e3d model-00003-of-00004.safetensors
3817ec476aa6bfa6b18f29bb8fdbaec9 model-00004-of-00004.safetensors
here is the command I used to clone the checkpoints from HF:
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct /srv/tensorrtllm_backend/tensorrt_llm/examples/llama/llama-3.1-8b-instruct
Is it the same as your download approach?
es/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum
It's very possible that the ckpt's correctness is wrong. You may try redownload them again.
Here is the size of my llama 3.1 8b checkpoint.
ls *.safetensors|xargs ls -al -rw-r--r-- 1 guomingz dip 4976698672 Jul 31 11:36 model-00001-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4999802720 Jul 31 11:37 model-00002-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4915916176 Jul 31 11:37 model-00003-of-00004.safetensors -rw-r--r-- 1 guomingz dip 1168138808 Jul 31 11:34 model-00004-of-00004.safetensors
es/llama/llama-3.1-8b-instruct$ ls *.safetensors|xargs md5sum
It's very possible that the ckpt's correctness is wrong. You may try redownload them again.
Here is the size of my llama 3.1 8b checkpoint.
ls *.safetensors|xargs ls -al -rw-r--r-- 1 guomingz dip 4976698672 Jul 31 11:36 model-00001-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4999802720 Jul 31 11:37 model-00002-of-00004.safetensors -rw-r--r-- 1 guomingz dip 4915916176 Jul 31 11:37 model-00003-of-00004.safetensors -rw-r--r-- 1 guomingz dip 1168138808 Jul 31 11:34 model-00004-of-00004.safetensors
Thanks @nv-guomingz You're right, the safetensors file size is too small in my repo. After redownloading the model checkpoint files, I can quantize the model to FP8.
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00001-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00002-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00003-of-00004.safetensors
-rw------- 1 ubuntu ubuntu 135 Jul 30 03:34 model-00004-of-00004.safetensors
However, when running model engine build command, I got this error
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 468, in main
rotary_type = rotary_scaling['type']
KeyError: 'type'
Is it relevant to the config.json in my ${UNIFIED_CKPT_PATH} ??
Here is the error log:
root@7995db4e1845:/opt/tritonserver# trtllm-build \
--checkpoint_dir ${UNIFIED_CKPT_PATH} \
--output_dir ${ENGINE_DIR} \
--gemm_plugin fp8
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[08/01/2024-10:53:42] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gemm_plugin to fp8.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set nccl_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set lookup_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set lora_plugin to None.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set moe_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set context_fmha to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set remove_input_padding to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set reduce_fusion to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set multi_block_mode to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set enable_xqa to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set multiple_profiles to False.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set paged_state to True.
[08/01/2024-10:53:42] [TRT-LLM] [I] Set streamingllm to False.
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.15.0'}
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[08/01/2024-10:53:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[08/01/2024-10:53:42] [TRT-LLM] [I] Compute capability: (7, 5)
[08/01/2024-10:53:42] [TRT-LLM] [I] SM count: 40
[08/01/2024-10:53:42] [TRT-LLM] [I] SM clock: 1590 MHz
[08/01/2024-10:53:42] [TRT-LLM] [I] int4 TFLOPS: 260
[08/01/2024-10:53:42] [TRT-LLM] [I] int8 TFLOPS: 130
[08/01/2024-10:53:42] [TRT-LLM] [I] fp8 TFLOPS: 0
[08/01/2024-10:53:42] [TRT-LLM] [I] float16 TFLOPS: 65
[08/01/2024-10:53:42] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[08/01/2024-10:53:42] [TRT-LLM] [I] float32 TFLOPS: 8
[08/01/2024-10:53:42] [TRT-LLM] [I] Total Memory: 15 GiB
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory clock: 5001 MHz
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory bus width: 256
[08/01/2024-10:53:42] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe link width: 8
[08/01/2024-10:53:42] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 468, in main
rotary_type = rotary_scaling['type']
KeyError: 'type'
root@7995db4e1845:/opt/tritonserver#
root@7995db4e1845:/opt/tritonserver#
For 0.11 TRT-LLM, we don't claim the LLAMA 3 support yet. https://github.com/NVIDIA/TensorRT-LLM/tree/v0.11.0/examples/llama.
I suggest you may have a try with this wheel. And for your device T4, it doesnt support FP8 accelaration technology.
@nv-guomingz Hello. I have a same issue when quantize FP8 llama 3.1 70B model.
GPU: H100 * 2
Driver Version: 550.90.07
CUDA: 12.4
Image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
TensorRT-LLM version: 0.11.0
After I pip install with command pip install tensorrt-llm==0.12.0.dev2024073000, system delete tensorrt-llm v0.11.0 automatically. When I try to quantization, below error is occur. I think tensorrt-llm v0.12.0 can not connect with tritonserver:24.07
root@my-llama-test-687489c5q4-dxw6w:/home# trtllm-build
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 5, in <module>
from tensorrt_llm.commands.build import main
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
import tensorrt_llm.functional as functional
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module>
import tensorrt as trt
ModuleNotFoundError: No module named 'tensorrt'
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
GPU: NVIDIA T4 * 4 Driver Version: 550.54.15 CUDA: 12.4 Image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 TensorRT-LLM version: 0.11.0
Who can help?
No response
Information
Tasks
Reproduction
python3 ${QUANTIZATION_SCRIPT} --model_dir ${HF_LLAMA3_1_8B_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 1
Expected behavior
The BF16 model can be quantized to FP8, then use FP8 checkpoints to build model engine
actual behavior
Quantization failed with the following error: ValueError:
rope_scaling
must be a dictionary with two fields,type
andfactor
, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'type': 'llama3'}Here is the log: