NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.79k stars 1.01k forks source link

[Quantization] [mixtral_8x22B] NotImplementedError: Cannot copy out of meta tensor; no data! #1585

Closed Godlovecui closed 6 months ago

Godlovecui commented 6 months ago

System Info

image

Who can help?

@Tracin

Information

Tasks

Reproduction

python ../quantization/quantize.py --model_dir /network/model/Mixtral-8x22B-v0.1 \ --dtype bfloat16 \ --qformat fp8 \ --output_dir ./tllm_checkpoint_mixtral_8x22B_8gpu_fp8\ --kv_cache_dtype fp8 \ --calib_size 8 \ --tp_size 8 \ --batch_size 8

Expected behavior

generate the successful results for quantization

actual behavior

image

additional notes

When I quantize the Mixtral-8x22B-v0.1into fp8 in RTX-4090, it raises below error, how to resolve it? Thank you!

Initializing model from /network/model/Mixtral-8x22B-v0.1 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████| 59/59 [03:34<00:00, 3.64s/it] [05/09/2024-03:29:18] Some parameters are on the meta device device because they were offloaded to the cpu. [TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch .bfloat16. Initializing tokenizer from /network/model/Mixtral-8x22B-v0.1 Loading calibration dataset Starting quantization... Inserted 4875 quantizers Calibrating batch 0 Quantization done. Total time used: 103.36 s. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to tllm_checkpoint_mixtral_8x22B_8gpu_fp8/modelopt_model.0.pth using torch.save for further inspection. Detailed export error: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 364, in export_tensorrt_ll m_checkpoint for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint( File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 220, in torch_totensorrt llm_checkpoint build_decoder_config(layer, model_metadata_config, decoder_type, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1180, in build_decoder_config config.attention = build_attention_config(layer, model_metadata_config, dtype, config) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 650, in build_attention_config config.dense = build_linear_config(layer, LINEAR_ROW, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 606, in build_linear_config config.weight = weight.cpu() NotImplementedError: Cannot copy out of meta tensor; no data!

Godlovecui commented 6 months ago

The version of Tensorrt-llm is: [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700

nv-guomingz commented 6 months ago

I think the most possible reason is that modelopt requires loading the whole model into memory, so the 8x4090 doesn't have enough gpu memory for loading the mixtral_8x22B.

byshiue commented 6 months ago

duplicated issue as https://github.com/NVIDIA/TensorRT-LLM/issues/1440. Close this one.