NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 929 forks source link

RuntimeError: Unsupported model architecture: FalconForCausalLM #1116

Open shekhars-li opened 7 months ago

shekhars-li commented 7 months ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

Convert HF weights:

python convert_checkpoint.py --model_dir falcon-7b-hf --dtype float16 --output_dir tensorrt-llm-falcon-7b-hf

0.7.1
You are using a model of type RefinedWebModel to instantiate a model of type falcon. This is not supported for
 all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]/home/jobuser/.local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Weights loaded. Total time: 00:00:02
Total time of converting checkpoints: 00:01:44

Compile engine:

trtllm-build --checkpoint_dir tensorrt-llm-falcon-7b-hf --use_gemm_plugin float16 --remove_input_padding \
  --use_gpt_attention_plugin float16 --output_dir tensorrt-llm-falcon-7b-hf/engine/

[02/20/2024-18:18:26] [TRT-LLM] [I] Remove Padding Enabled
Traceback (most recent call last):
  File "/home/jobuser/.local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/home/jobuser/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 217, in main
    build_and_save(source, build_config, args.output_dir, workers,
  File "/home/jobuser/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 154, in build_and_save
    build_and_save_shard(rank, rank % workers, ckpt_dir, build_config,
  File "/home/jobuser/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 130, in build_and_save_shard
    engine = build(build_config,
  File "/home/jobuser/.local/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 609, in build
    raise RuntimeError(
RuntimeError: Unsupported model architecture: FalconForCausalLM

Expected behavior

Engine compiles successfullly

actual behavior

trtllm-build returns

RuntimeError: Unsupported model architecture: FalconForCausalLM

additional notes

I am following very simple standard script from the repo. The weights are HF weights. The build is simple too. I already have pods with all the dependencies installed. I verified tensorrt-llm can be loaded/used in python repr.

shekhars-li commented 7 months ago

Update: I see the latest release 0.7.1 does not support FalconForCausalLM in MODEL_MAP yet. I do not have an option to compile from source as I can only push a precompiled docker image and not run the compilation on the cluster with A100. Can you please create a new release with the latest changes that support the FalconForCausalLM architecture too?

shekhars-li commented 7 months ago

As a final attempt, I tried to install unreleased version myself

pip install tensorrt-llm==0.9.0.dev2024020600 --extra-index-url https://pypi.nvidia.com

And I am unable to install also:

INFO: pip is looking at multiple versions of tensorrt-llm to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install tensorrt-llm because these package versions have conflicting dependencies.

The conflict is caused by:
    nvidia-ammo 0.7.3 depends on torchprofile>=0.0.4
    nvidia-ammo 0.7.2 depends on torchprofile>=0.0.4
    nvidia-ammo 0.7.1 depends on onnxruntime>=1.16.1
    nvidia-ammo 0.7.0 depends on onnxruntime>=1.16.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict