NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Loaded model not correctly sent to the process (GPTNeoX build.py) #795

Open ydm-amazon opened 9 months ago

ydm-amazon commented 9 months ago

I am getting an error when using TensorRT-LLM/examples/gptneox/build.py to build the TensorRT engine:

line 314, in build_rank_engine
    assert hf_gpt is not None, f'Could not load weights from hf_gpt model as it is not loaded yet.'
AssertionError: Could not load weights from hf_gpt model as it is not loaded yet.

It seems like hf_gpt is correctly loaded in parse_arguments. However, inside a process (inside the build function), hf_gpt is None.

I am using TensorRT-LLM 0.7.0, and this is the command I am using to build:

python3 build.py \
                        --log_level verbose \
                        --world_size 4 \
                        --model_dir /tmp/input_model_dir/ \
                        --dtype float16 \
                        --max_input_len 1024 \
                        --max_output_len 512 \
                        --max_batch_size 32 \
                        --max_beam_width 1 \
                        --use_gpt_attention_plugin float16 \
                        --use_gemm_plugin float16 \
                        --use_layernorm_plugin float16 \
                        --enable_context_fmha \
                        --remove_input_padding \
                        --output_dir /tmp/output_model_dir/ \
                        --parallel_build
ydm-amazon commented 8 months ago

It seems that the build.py of phi (in the main branch) gets this error too.