NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 931 forks source link

Invalid MIT-MAGIC-COOKIE-1 key #2247

Open sherlcok314159 opened 1 week ago

sherlcok314159 commented 1 week ago

System Info

Who can help?

@byshiue @ncomly-nvidia

Information

Tasks

Reproduction

I run the following build script in the terminal of ubuntu 20.04 (connected via ssh and the ubuntu has a virtual screen by Xorg).

python xxx/TensorRT-LLM/examples/enc_dec/convert_checkpoint.py --model_type bart \
    --model_dir xxx/hub/models--facebook--nougat-small \
    --output_dir nougat-small-trt/bfloat16 \
    --tp_size 1 \
    --pp_size 1 \
    --dtype bfloat16 \
    --nougat

And the log:

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
Invalid MIT-MAGIC-COOKIE-1 key

I do a lot of search on the web. It looks like this problem is caused by the mpi. But why converting checkpoint needs a screen.

Expected behavior

Run well

actual behavior

See the above

additional notes

No

hweiske commented 1 week ago

I am getting the same error trying to build mistral for ChatRTX on linux using python build.py --model_dir './model/mistral/mistral7b_hf' --quant_ckpt_path './model/mistral/mistral7b_int4_quant_weights/mistral_tp1_rank0.npz' --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir './model/mistral/mistral7b_int4_engine' --world_size 1 --tp_size 1 --parallel_build --max_input_len 7168 --max_batch_size 1 --max_output_len 1024 According to this.

lfr-0531 commented 5 days ago

I cannot reproduce this issue locally. Can you have a try on the latest main branch? And follow the install doc to correctly install TensorRT-LLM.

sherlcok314159 commented 5 days ago

Did you use the local PC or the remote server without screen? Is there any command to check whether the TRT-LLM is correctly installed.

lfr-0531 commented 5 days ago

Did you use the local PC or the remote server without screen? Is there any command to check whether the TRT-LLM is correctly installed.

Remote server.

To check installation

python3 -c "import tensorrt_llm"