NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.33k stars 935 forks source link

Mixtral-8x7b-instruct-0.1 build fails with TypeError: LoraConfig.from_hf() missing 1 required positional argument: 'trtllm_modules_to_hf_modules' #924

Open ajamjoom opened 8 months ago

ajamjoom commented 8 months ago

Setup Machine: AWS Sagemaker ml.p4d.24xlarge Model: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Used Docker container image with the latest build of trt-llm (0.8.0.dev2024011601)

Build arguments

    --world_size 8
    --tp_size 8
    --dtype float16
    --max_input_len 1024
    --max_output_len 512
    --max_batch_size 64
    --max_beam_width 1
    --use_gpt_attention_plugin float16
    --use_gemm_plugin float16
    --enable_context_fmha
    --use_inflight_batching
    --remove_input_padding
    --paged_kv_cache
    --tokens_per_block 128
    --rotary_base 10000.0
    --output_dir /tmp/.djl.ai/trtllm/XXX/XXX/1
    --parallel_build
    --use_custom_all_reduce
    --model_dir /tmp/.djl.ai/download/XXX

Error log

TypeError: LoraConfig.from_hf() missing 1 required positional argument: 'trtllm_modules_to_hf_modules'

More complete logs:


Converting model to TensorRT-LLM artifacts
convert_py: Converting Hugging Face model to TensorRT engine...
convert_py: Running TensorRT-LLM build command:
  python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/build.py
    --world_size 8
    --tp_size 8
    --dtype float16
    --max_input_len 1024
    --max_output_len 512
    --max_batch_size 64
    --max_beam_width 1
    --use_gpt_attention_plugin float16
    --use_gemm_plugin float16
    --enable_context_fmha
    --use_inflight_batching
    --remove_input_padding
    --paged_kv_cache
    --tokens_per_block 128
    --rotary_base 10000.0
    --output_dir /tmp/.djl.ai/trtllm/XXX/XXX/1
    --parallel_build
    --use_custom_all_reduce
    --model_dir /tmp/.djl.ai/download/XXX
convert_py: You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
convert_py: Traceback (most recent call last):
convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/build.py", line 895, in <module>
convert_py: args = parse_arguments()
convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/build.py", line 549, in parse_arguments
convert_py: lora_config = LoraConfig.from_hf(args.hf_lora_dir,
convert_py: TypeError: LoraConfig.from_hf() missing 1 required positional argument: 'trtllm_modules_to_hf_modules'
convert_py: Traceback (most recent call last):
convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 80, in <module>
convert_py: main()
convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 76, in main
convert_py: create_trt_llm_repo(properties, args)
convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 34, in create_trt_llm_repo
convert_py: create_model_repo(model_id_or_path, **kwargs)
convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/__init__.py", line 48, in create_model_repo
convert_py: model.create_model_repo(**kwargs)
convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trt_llm_model.py", line 129, in create_model_repo
convert_py: self._build_trt_engine(**kwargs)
convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/llama_model.py", line 91, in _build_trt_engine
convert_py: subprocess.check_call(cmd, shell=True)
convert_py: File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
convert_py: raise CalledProcessError(retcode, cmd)
convert_py: subprocess.CalledProcessError: Command
  'python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/build.py
  --world_size 8
  --tp_size 8
  --dtype float16
  --max_input_len 1024
  --max_output_len 512
  --max_batch_size 64
  --max_beam_width 1
  --use_gpt_attention_plugin float16
  --use_gemm_plugin float16
  --enable_context_fmha
  --use_inflight_batching
  --remove_input_padding
  --paged_kv_cache
  --tokens_per_block 128
  --rotary_base 10000.0
  --output_dir /tmp/.djl.ai/trtllm/XXX/XXX/1
  --parallel_build
  --use_custom_all_reduce 
  --model_dir /tmp/.djl.ai/download/XXX'
  returned non-zero exit status 1.
    [ERROR] ModelServer - Failed register workflow
    java.util.concurrent.CompletionException: ai.djl.engine.EngineException: Model conversion process failed!
    #011at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?]
    #011at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) [?:?]
    #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770) [?:?]
    #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760) [?:?]
    #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?]
    #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?]
    #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?]
    #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?]
    #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?]
    Caused by: ai.djl.engine.EngineException: Model conversion process failed!
    #011at ai.djl.serving.wlm.ildTrtLlmArtifacts(va:267) ~[wlm-0.26.0.jar:?]
    #011at ai.djl.serving.wlm.nvertIfNeed(va:129) ~[wlm-0.26.0.jar:?]
    #011at ai.djl.serving.wlm.ModelInfo.initialize(ModelInfo.java:465) ~[wlm-0.26.0.jar:?]
    #011at ai.djl.serving.models.ModelManager.lambda$registerWorkflow$2(ModelManager.java:99) ~[serving-0.26.0.jar:?]
    #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[?:?]
    #011... 6 more
 ModelServer - Stopping model server

CC: @byshiue, @symphonylyh

byshiue commented 8 months ago

Could you share your commit? From your log, the error happens at

convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/build.py", line 549, in parse_arguments
convert_py: lora_config = LoraConfig.from_hf(args.hf_lora_dir,

But in latest main branch at https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/build.py#L549, the code is different.

ajamjoom commented 8 months ago

Dockerfile

# Start from the official AWS DJL 0.26 trt 0.7.1 inference container
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122

# update trt to latest pre-release
RUN pip3 install tensorrt_llm -U --pre --extra-index-url <https://pypi.nvidia.com>

RUN pip3 show tensorrt_llm

Ref to the AWS large-model-inference-containers, where I got the base container from.

When I build the docker image, I see these logs:

Uninstalling tensorrt_llm-0.7.1:
  Successfully uninstalled tensorrt_llm-0.7.1

Successfully installed tensorrt_llm-0.8.0.dev2024011601

Step 4/4 : RUN pip3 show tensorrt_llm
 ---> Running in 0372e69a67ea
Name: tensorrt-llm
Version: 0.8.0.dev2024011601
Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
Home-page: <https://github.com/NVIDIA/TensorRT-LLM>
Author: NVIDIA Corporation
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, build, colored, cuda-python, diffusers, evaluate, janus, lark, mpi4py, numpy, nvidia-ammo, onnx, optimum, polygraphy, psutil, pynvml, sentencepiece, tensorrt, torch, transformers, wheel
Required-by:
Removing intermediate container 0372e69a67ea

So I should be on tensorrt_llm-0.8.0.dev2024011601, which should be the latest pre-release. I'm unsure how to check for the commit as I don't see this pre-release tagged in the repo. I wonder if this pre-release is potentially pointing at an older version rather than the latest (saw this issue report about prev issues resurrecting).

I looked into some of the latest commits and it seems that I'm using this one although my docker logs show that I'm using tensorrt_llm0.8.0.dev2024011601.

byshiue commented 8 months ago

Could you run git log to show the commit you use?