NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.49k stars 811 forks source link

Quantizing Phi-3 128k Instruct to FP8 fails. #1741

Open kalradivyanshu opened 1 month ago

kalradivyanshu commented 1 month ago

System Info

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+


TritonLLM: v0.10.0

### Who can help?

@Tracin 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Ran: 
```python3 ./examples/quantization/quantize.py --model_dir ../../Phi-3-mini-128k-instruct/ --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./tllm_checkpoint_1gpu_fp8 --tp_size 1```

Error:

(phi-3) root@150b096d1444:/workspace/tensorrt/TensorRT-LLM# python3 ./examples/quantization/quantize.py --model_dir ../../Phi-3-mini-128k-instruct/ --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./tllm_checkpoint_1gpu_fp8 --tp_size 1 [TensorRT-LLM] TensorRT-LLM version: 0.10.0 Initializing model from ../../Phi-3-mini-128k-instruct/ Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.06it/s] [TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16. Initializing tokenizer from ../../Phi-3-mini-128k-instruct/ Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading calibration dataset Starting quantization... Inserted 387 quantizers Calibrating batch 0 [06/05/2024-16:24:58] You are not running the flash-attention implementation, expect numerical differences. Calibrating batch 1 Calibrating batch 2 Calibrating batch 3 Calibrating batch 4 Calibrating batch 5 Calibrating batch 6 Calibrating batch 7 Calibrating batch 8 Calibrating batch 9 Calibrating batch 10 Calibrating batch 11 Calibrating batch 12 Calibrating batch 13 Calibrating batch 14 Calibrating batch 15 Calibrating batch 16 Calibrating batch 17 Calibrating batch 18 Calibrating batch 19 Calibrating batch 20 Calibrating batch 21 Calibrating batch 22 Calibrating batch 23 Calibrating batch 24 Calibrating batch 25 Calibrating batch 26 Calibrating batch 27 Calibrating batch 28 Calibrating batch 29 Calibrating batch 30 Calibrating batch 31 Calibrating batch 32 Calibrating batch 33 Calibrating batch 34 Calibrating batch 35 Calibrating batch 36 Calibrating batch 37 Calibrating batch 38 Calibrating batch 39 Calibrating batch 40 Calibrating batch 41 Calibrating batch 42 Calibrating batch 43 Calibrating batch 44 Calibrating batch 45 Calibrating batch 46 Calibrating batch 47 Calibrating batch 48 Calibrating batch 49 Calibrating batch 50 Calibrating batch 51 Calibrating batch 52 Calibrating batch 53 Calibrating batch 54 Calibrating batch 55 Calibrating batch 56 Calibrating batch 57 Calibrating batch 58 Calibrating batch 59 Calibrating batch 60 Calibrating batch 61 Calibrating batch 62 Calibrating batch 63 Calibrating batch 64 Calibrating batch 65 Calibrating batch 66 Calibrating batch 67 Calibrating batch 68 Calibrating batch 69 Calibrating batch 70 Calibrating batch 71 Calibrating batch 72 Calibrating batch 73 Calibrating batch 74 Calibrating batch 75 Calibrating batch 76 Calibrating batch 77 Calibrating batch 78 Calibrating batch 79 Calibrating batch 80 Calibrating batch 81 Calibrating batch 82 Calibrating batch 83 Calibrating batch 84 Calibrating batch 85 Calibrating batch 86 Calibrating batch 87 Calibrating batch 88 Calibrating batch 89 Calibrating batch 90 Calibrating batch 91 Calibrating batch 92 Calibrating batch 93 Calibrating batch 94 Calibrating batch 95 Calibrating batch 96 Calibrating batch 97 Calibrating batch 98 Calibrating batch 99 Calibrating batch 100 Calibrating batch 101 Calibrating batch 102 Calibrating batch 103 Calibrating batch 104 Calibrating batch 105 Calibrating batch 106 Calibrating batch 107 Calibrating batch 108 Calibrating batch 109 Calibrating batch 110 Calibrating batch 111 Calibrating batch 112 Calibrating batch 113 Calibrating batch 114 Calibrating batch 115 Calibrating batch 116 Calibrating batch 117 Calibrating batch 118 Calibrating batch 119 Calibrating batch 120 Calibrating batch 121 Calibrating batch 122 Calibrating batch 123 Calibrating batch 124 Calibrating batch 125 Calibrating batch 126 Calibrating batch 127 Calibrating batch 128 Calibrating batch 129 Calibrating batch 130 Calibrating batch 131 Calibrating batch 132 Calibrating batch 133 Calibrating batch 134 Calibrating batch 135 Calibrating batch 136 Calibrating batch 137 Calibrating batch 138 Calibrating batch 139 Calibrating batch 140 Calibrating batch 141 Calibrating batch 142 Calibrating batch 143 Calibrating batch 144 Calibrating batch 145 Calibrating batch 146 Calibrating batch 147 Calibrating batch 148 Calibrating batch 149 Calibrating batch 150 Calibrating batch 151 Calibrating batch 152 Calibrating batch 153 Calibrating batch 154 Calibrating batch 155 Calibrating batch 156 Calibrating batch 157 Calibrating batch 158 Calibrating batch 159 Calibrating batch 160 Calibrating batch 161 Calibrating batch 162 Calibrating batch 163 Calibrating batch 164 Calibrating batch 165 Calibrating batch 166 Calibrating batch 167 Calibrating batch 168 Calibrating batch 169 Calibrating batch 170 Calibrating batch 171 Calibrating batch 172 Calibrating batch 173 Calibrating batch 174 Calibrating batch 175 Calibrating batch 176 Calibrating batch 177 Calibrating batch 178 Calibrating batch 179 Calibrating batch 180 Calibrating batch 181 Calibrating batch 182 Calibrating batch 183 Calibrating batch 184 Calibrating batch 185 Calibrating batch 186 Calibrating batch 187 Calibrating batch 188 Calibrating batch 189 Calibrating batch 190 Calibrating batch 191 Calibrating batch 192 Calibrating batch 193 Calibrating batch 194 Calibrating batch 195 Calibrating batch 196 Calibrating batch 197 Calibrating batch 198 Calibrating batch 199 Calibrating batch 200 Calibrating batch 201 Calibrating batch 202 Calibrating batch 203 Calibrating batch 204 Calibrating batch 205 Calibrating batch 206 Calibrating batch 207 Calibrating batch 208 Calibrating batch 209 Calibrating batch 210 Calibrating batch 211 Calibrating batch 212 Calibrating batch 213 Calibrating batch 214 Calibrating batch 215 Calibrating batch 216 Calibrating batch 217 Calibrating batch 218 Calibrating batch 219 Calibrating batch 220 Calibrating batch 221 Calibrating batch 222 Calibrating batch 223 Calibrating batch 224 Calibrating batch 225 Calibrating batch 226 Calibrating batch 227 Calibrating batch 228 Calibrating batch 229 Calibrating batch 230 Calibrating batch 231 Calibrating batch 232 Calibrating batch 233 Calibrating batch 234 Calibrating batch 235 Calibrating batch 236 Calibrating batch 237 Calibrating batch 238 Calibrating batch 239 Calibrating batch 240 Calibrating batch 241 Calibrating batch 242 Calibrating batch 243 Calibrating batch 244 Calibrating batch 245 Calibrating batch 246 Calibrating batch 247 Calibrating batch 248 Calibrating batch 249 Calibrating batch 250 Calibrating batch 251 Calibrating batch 252 Calibrating batch 253 Calibrating batch 254 Calibrating batch 255 Calibrating batch 256 Calibrating batch 257 Calibrating batch 258 Calibrating batch 259 Calibrating batch 260 Calibrating batch 261 Calibrating batch 262 Calibrating batch 263 Calibrating batch 264 Calibrating batch 265 Calibrating batch 266 Calibrating batch 267 Calibrating batch 268 Calibrating batch 269 Calibrating batch 270 Calibrating batch 271 Calibrating batch 272 Calibrating batch 273 Calibrating batch 274 Calibrating batch 275 Calibrating batch 276 Calibrating batch 277 Calibrating batch 278 Calibrating batch 279 Calibrating batch 280 Calibrating batch 281 Calibrating batch 282 Calibrating batch 283 Calibrating batch 284 Calibrating batch 285 Calibrating batch 286 Calibrating batch 287 Calibrating batch 288 Calibrating batch 289 Calibrating batch 290 Calibrating batch 291 Calibrating batch 292 Calibrating batch 293 Calibrating batch 294 Calibrating batch 295 Calibrating batch 296 Calibrating batch 297 Calibrating batch 298 Calibrating batch 299 Calibrating batch 300 Calibrating batch 301 Calibrating batch 302 Calibrating batch 303 Calibrating batch 304 Calibrating batch 305 Calibrating batch 306 Calibrating batch 307 Calibrating batch 308 Calibrating batch 309 Calibrating batch 310 Calibrating batch 311 Calibrating batch 312 Calibrating batch 313 Calibrating batch 314 Calibrating batch 315 Calibrating batch 316 Calibrating batch 317 Calibrating batch 318 Calibrating batch 319 Calibrating batch 320 Calibrating batch 321 Calibrating batch 322 Calibrating batch 323 Calibrating batch 324 Calibrating batch 325 Calibrating batch 326 Calibrating batch 327 Calibrating batch 328 Calibrating batch 329 Calibrating batch 330 Calibrating batch 331 Calibrating batch 332 Calibrating batch 333 Calibrating batch 334 Calibrating batch 335 Calibrating batch 336 Calibrating batch 337 Calibrating batch 338 Calibrating batch 339 Calibrating batch 340 Calibrating batch 341 Calibrating batch 342 Calibrating batch 343 Calibrating batch 344 Calibrating batch 345 Calibrating batch 346 Calibrating batch 347 Calibrating batch 348 Calibrating batch 349 Calibrating batch 350 Calibrating batch 351 Calibrating batch 352 Calibrating batch 353 Calibrating batch 354 Calibrating batch 355 Calibrating batch 356 Calibrating batch 357 Calibrating batch 358 Calibrating batch 359 Calibrating batch 360 Calibrating batch 361 Calibrating batch 362 Calibrating batch 363 Calibrating batch 364 Calibrating batch 365 Calibrating batch 366 Calibrating batch 367 Calibrating batch 368 Calibrating batch 369 Calibrating batch 370 Calibrating batch 371 Calibrating batch 372 Calibrating batch 373 Calibrating batch 374 Calibrating batch 375 Calibrating batch 376 Calibrating batch 377 Calibrating batch 378 Calibrating batch 379 Calibrating batch 380 Calibrating batch 381 Calibrating batch 382 Calibrating batch 383 Calibrating batch 384 Calibrating batch 385 Calibrating batch 386 Calibrating batch 387 Calibrating batch 388 Calibrating batch 389 Calibrating batch 390 Calibrating batch 391 Calibrating batch 392 Calibrating batch 393 Calibrating batch 394 Calibrating batch 395 Calibrating batch 396 Calibrating batch 397

Calibrating batch 398 Calibrating batch 399 Calibrating batch 400 Calibrating batch 401 Calibrating batch 402 Calibrating batch 403 Calibrating batch 404 Calibrating batch 405 Calibrating batch 406 Calibrating batch 407 Calibrating batch 408 Calibrating batch 409 Calibrating batch 410 Calibrating batch 411 Calibrating batch 412 Calibrating batch 413 Calibrating batch 414 Calibrating batch 415 Calibrating batch 416 Calibrating batch 417 Calibrating batch 418 Calibrating batch 419 Calibrating batch 420 Calibrating batch 421 Calibrating batch 422 Calibrating batch 423 Calibrating batch 424 Calibrating batch 425 Calibrating batch 426 Calibrating batch 427 Calibrating batch 428 Calibrating batch 429 Calibrating batch 430 Calibrating batch 431 Calibrating batch 432 Calibrating batch 433 Calibrating batch 434 Calibrating batch 435 Calibrating batch 436 Calibrating batch 437 Calibrating batch 438 Calibrating batch 439 Calibrating batch 440 Calibrating batch 441 Calibrating batch 442 Calibrating batch 443 Calibrating batch 444 Calibrating batch 445 Calibrating batch 446 Calibrating batch 447 Calibrating batch 448 Calibrating batch 449 Calibrating batch 450 Calibrating batch 451 Calibrating batch 452 Calibrating batch 453 Calibrating batch 454 Calibrating batch 455 Calibrating batch 456 Calibrating batch 457 Calibrating batch 458 Calibrating batch 459 Calibrating batch 460 Calibrating batch 461 Calibrating batch 462 Calibrating batch 463 Calibrating batch 464 Calibrating batch 465 Calibrating batch 466 Calibrating batch 467 Calibrating batch 468 Calibrating batch 469 Calibrating batch 470 Calibrating batch 471 Calibrating batch 472 Calibrating batch 473 Calibrating batch 474 Calibrating batch 475 Calibrating batch 476 Calibrating batch 477 Calibrating batch 478 Calibrating batch 479 Calibrating batch 480 Calibrating batch 481 Calibrating batch 482 Calibrating batch 483 Calibrating batch 484 Calibrating batch 485 Calibrating batch 486 Calibrating batch 487 Calibrating batch 488 Calibrating batch 489 Calibrating batch 490 Calibrating batch 491 Calibrating batch 492 Calibrating batch 493 Calibrating batch 494 Calibrating batch 495 Calibrating batch 496 Calibrating batch 497 Calibrating batch 498 Calibrating batch 499 Calibrating batch 500 Calibrating batch 501 Calibrating batch 502 Calibrating batch 503 Calibrating batch 504 Calibrating batch 505 Calibrating batch 506 Calibrating batch 507 Calibrating batch 508 Calibrating batch 509 Calibrating batch 510 Calibrating batch 511 Quantization done. Total time used: 39.34 s. Unknown model type Phi3ForCausalLM. Continue exporting... torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. current rank: 0, tp rank: 0, pp rank: 0 torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to tllm_checkpoint_1gpu_fp8/modelopt_model.0.pth using torch.save for further inspection. Detailed export error: 'unknown:Phi3ForCausalLM' Traceback (most recent call last): File "/workspace/tensorrt/TensorRT-LLM/phi-3/lib/python3.10/site-packages/modelopt/torch/export/model_config_export.py", line 364, in export_tensorrt_llm_checkpoint for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint( File "/workspace/tensorrt/TensorRT-LLM/phi-3/lib/python3.10/site-packages/modelopt/torch/export/model_config_export.py", line 312, in torch_to_tensorrt_llm_checkpoint tensorrt_llm_config = convert_to_tensorrt_llm_config(model_config, tp_size_overwrite) File "/workspace/tensorrt/TensorRT-LLM/phi-3/lib/python3.10/site-packages/modelopt/torch/export/tensorrt_llm_utils.py", line 84, in convert_to_tensorrt_llm_config "architecture": MODEL_NAME_TO_HF_ARCH_MAP[decoder_type], KeyError: 'unknown:Phi3ForCausalLM' Traceback (most recent call last): File "/workspace/tensorrt/TensorRT-LLM/./examples/quantization/quantize.py", line 90, in quantize_and_export( File "/workspace/tensorrt/TensorRT-LLM/phi-3/lib/python3.10/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 340, in quantize_and_export with open(f"{export_path}/config.json", "r") as f: FileNotFoundError: [Errno 2] No such file or directory: './tllm_checkpoint_1gpu_fp8/config.json'



### Expected behavior

Quantized Phi-3 weights would get saved

### actual behavior

Gives error: Detailed export error: 'unknown:Phi3ForCausalLM' and fails.

### additional notes

None.
hijkzzz commented 1 month ago

I confirmed that this is a bug on v0.11. We are investigating internally.

hijkzzz commented 1 month ago

Reply from @Tracin: Support FP8 of Phi3 is supposed to be in the next release.

kalradivyanshu commented 1 month ago

Oh okay, thank you for your swift reply! Can I somehow use something else to get float8 quant? Can I maybe convert phi-3 huggingface to float 8 using: https://huggingface.co/blog/quanto-introduction and then load them using convert_checkpoint in TensorRTLLM? Will that work?

kalradivyanshu commented 1 month ago

Reply from @Tracin: Support FP8 of Phi3 is supposed to be in the next release.

will 0.11.0.dev2024060400 work for Phi3 quantization? Or next release means 0.12?

nv-guomingz commented 1 month ago

Reply from @Tracin: Support FP8 of Phi3 is supposed to be in the next release.

will 0.11.0.dev2024060400 work for Phi3 quantization? Or next release means 0.12?

It depends on when the modelopt lib adds the Phi-3 128k supporting.

kalradivyanshu commented 1 month ago

Can I somehow use something else to get float8 quant? Can I maybe convert phi-3 huggingface to float 8 using: https://huggingface.co/blog/quanto-introduction and then load them using convert_checkpoint in TensorRTLLM? Will that work?

@nv-guomingz will this solution work? or is this also reliant on modelopt?

nv-guomingz commented 1 month ago

Can I somehow use something else to get float8 quant? Can I maybe convert phi-3 huggingface to float 8 using: https://huggingface.co/blog/quanto-introduction and then load them using convert_checkpoint in TensorRTLLM? Will that work?

@nv-guomingz will this solution work? or is this also reliant on modelopt?

TRT-LLM only leaverage the nvidia-modelopt for quantization till now. We don't evaluate the https://huggingface.co/blog/quanto-introduction yet. My guts telling me that it's possible to use this tool for fp8 scaling factor generation but it needs additional efforts(glue code development) to apply them on trtllm checkponit generation.

We've filed bug to model opt team and I think this issue could be fixed soon(trt-llm will push updates in weekly bias, so we don't need to wait another formal release, like 0.12).

cjluo-omniml commented 1 month ago

Phi3 mini model changed after modelopt0.11 is out. We plan to launch a new version of modelopt soon which will fix this issue.

nv-guomingz commented 2 weeks ago

Hi @kalradivyanshu , modelopt 0.13.0 had been released for several days and please have a try with it. I've verified the python3 ./examples/quantization/quantize.py --model_dir ../../Phi-3-mini-128k-instruct/ --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./tllm_checkpoint_1gpu_fp8 --tp_size 1 on my side and there's no crash now.

If the issue doesn't exist any more, please close this ticket.

willy808 commented 1 week ago

** when we using fp8 / int4_awq with not enough vram. we saw it would automatic offload some parameters. it's really thankful!

however, it would have some bug to save weight to safetensors after quantizing. it would loss the config and the weight would save to be modelopt.pth first, it's really confused. please help us solved. ** Quantization done. Total time used: 11516.87 s. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to llama3-70b-fp8/modelopt_model.0.pth using torch.save for further inspection. Detailed export error: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 364, in export_tensorrt_llm_checkpoint for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint( File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 220, in torch_to_tensorrt_llm_checkpoint build_decoder_config(layer, model_metadata_config, decoder_type, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1179, in build_decoder_config config.attention = build_attention_config(layer, model_metadata_config, dtype, config) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 649, in build_attention_config config.dense = build_linear_config(layer, LINEAR_ROW, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 605, in build_linear_config config.weight = weight.cpu() NotImplementedError: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/media/mnt/sdb/willy/tensorrt_llm/TensorRT-LLM/examples/quantization/llama3-70-quantize.py", line 107, in quantize_and_export( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 451, in quantize_and_export with open(f"{export_path}/config.json", "r") as f: FileNotFoundError: [Errno 2] No such file or directory: 'llama3-70b-fp8/config.json'