NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 930 forks source link

quantized model of mistral 7b using ammo will not create engine example\llama convert_check_point.py #1383

Closed raymondbernard closed 6 months ago

raymondbernard commented 6 months ago

System Info

CPU: x86_64 GPU: 4060 RTX CUDA Version: 12.2.0 TensorRT-LLM version: TensorRT-LLM 0.8.0 On branch rel Driver Version: 546.12
Python: 3.10 OS : Windows 11

Who can help?

@Tracin @juney-nvidia @byshiue

Information

Tasks

Reproduction

1) First I quantized the fine tuned model of mistral 7b 1 v. I used the following notebook to do so . https://colab.research.google.com/drive/1tHlySYMlGlbDv6B43osIiHN1HQRIYHPJ?usp=sharing below it the main line I used to successfully quantize it. !python quantize.py --model_dir="RayBernard/test_identity" --device=cuda --qformat=int4_awq --output_dir=quantized_model 2) Download the quantized model to my PC. Then I but it into the example\llama folder . 3) my config.json file looks like : { "producer": { "name": "ammo", "version": "0.7.4" }, "architecture": "LlamaForCausalLM", "dtype": "float16", "num_hidden_layers": 32, "num_attention_heads": 32, "num_key_value_heads": 8, "hidden_size": 4096, "norm_epsilon": 1e-05, "vocab_size": 32000, "max_position_embeddings": 32768, "hidden_act": "silu", "use_parallel_embedding": true, "embedding_sharding_dim": 0, "quantization": { "quant_algo": "W4A16_AWQ", "kv_cache_quant_algo": null, "group_size": 128, "has_zero_point": false, "pre_quant_scale": true, "exclude_modules": [ "lm_head" ] }, "mapping": { "world_size": 1, "tp_size": 1, "pp_size": 1 }, "head_size": 128, "intermediate_size": 14336, "position_embedding_type": "rope_gpt_neox", "rotary_base": 10000.0 } 4) Followed convertions of : https://github.com/NVIDIA/TensorRT-LLM/tree/rel/examples/llama python convert_checkpoint.py --model_dir quantized_model --output_dir tllm_checkpoint_1gpu_fp16 --dtype float16
** think these args are wrong based on my my qantization.

Expected behavior

I expected to create a two files : config.json and rank0.safetensors files in tllm_checkpoint_1gpu_fp16 directory. I did copy my quantized_model in the example\llama directory.
By the way, I tested converting and creating an engine for the bloom example. It works just fine in my environment as per the documented example.

actual behavior

(.venv) C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM\examples\llama>python convert_checkpoint.py --model_dir quantized_model --output_dir tllm_checkpoint_1gpu_fp16 --dtype float16 [TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0 Traceback (most recent call last): File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM\examples\llama\convert_checkpoint.py", line 1532, in main() File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM\examples\llama\convert_checkpoint.py", line 1212, in main hf_config = AutoConfig.from_pretrained(args.model_dir, File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\transformers\models\auto\configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs) File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\transformers\configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\transformers\configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\transformers\utils\hub.py", line 360, in cached_file raise EnvironmentError( OSError: quantized_model does not appear to have a file named config.json. Checkout 'https://huggingface.co/quantized_model/None' for available files.

additional notes

The message is strange cause the convert_checkpoint.py "thinks" I like to point to a huggingface repo. I believe there must be additional args I need to add to the convert_checkpoint.py to select the exact quatization I am using etc. If you would be kind to let me know which args I need based on my fined tuned model of Mistral 7b v 1 it would be great!

byshiue commented 6 months ago

As mentioned in document https://github.com/NVIDIA/TensorRT-LLM/tree/rel/examples/llama#awq, after quantizing the model, you should build engine by it instead of running convert_checkpoint again.

raymondbernard commented 6 months ago

(.venv) C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM\examples\llama>trtllm-build --checkpoint_dir quantized_model --output_dir trt_engines --gemm_plugin float16
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[04/01/2024-09:10:49] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/01/2024-09:10:49] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/01/2024-09:10:49] [TRT-LLM] [I] Set gemm_plugin to float16. [04/01/2024-09:10:49] [TRT-LLM] [I] Set lookup_plugin to None. [04/01/2024-09:10:49] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/01/2024-09:10:49] [TRT-LLM] [I] Set paged_kv_cache to True. [04/01/2024-09:10:49] [TRT-LLM] [I] Set remove_input_padding to True. [04/01/2024-09:10:49] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/01/2024-09:10:49] [TRT-LLM] [I] Set multi_block_mode to False. [04/01/2024-09:10:49] [TRT-LLM] [I] Set enable_xqa to True. [04/01/2024-09:10:49] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/01/2024-09:10:49] [TRT-LLM] [I] Set tokens_per_block to 128. [04/01/2024-09:10:49] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/01/2024-09:10:49] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/01/2024-09:10:49] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. Traceback (most recent call last): File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\Scripts\trtllm-build.exe__main__.py", line 7, in File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\tensorrt_llm\commands\build.py", line 497, in main
parallel_build(source, build_config, args.output_dir, workers, File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\tensorrt_llm\commands\build.py", line 415, in parallel_build model_config = PretrainedConfig.from_json_file( File "C:\Users\RayBe\OneDrive\Documents\nvidiaplayground\TensorRT-LLM.venv\lib\site-packages\tensorrt_llm\models\modeling_utils.py", line 174, in from_json_file with open(config_file) as f: FileNotFoundError: [Errno 2] No such file or directory: 'quantized_model\config.json'

@byshiue I here is the message I got .

byshiue commented 5 months ago

It looks you don't pass the path correctly. Please check again.