Invalid MIT-MAGIC-COOKIE-1 key

sherlcok314159 commented 1 week ago

System Info

OS: Ubuntu 20.04
GPU: RTX 2080TI

Who can help?

@byshiue @ncomly-nvidia

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I run the following build script in the terminal of ubuntu 20.04 (connected via ssh and the ubuntu has a virtual screen by Xorg).

python xxx/TensorRT-LLM/examples/enc_dec/convert_checkpoint.py --model_type bart \
    --model_dir xxx/hub/models--facebook--nougat-small \
    --output_dir nougat-small-trt/bfloat16 \
    --tp_size 1 \
    --pp_size 1 \
    --dtype bfloat16 \
    --nougat

And the log:

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
Invalid MIT-MAGIC-COOKIE-1 key

I do a lot of search on the web. It looks like this problem is caused by the mpi. But why converting checkpoint needs a screen.

Expected behavior

Run well

actual behavior

See the above

additional notes

No

hweiske commented 1 week ago

I am getting the same error trying to build mistral for ChatRTX on linux using python build.py --model_dir './model/mistral/mistral7b_hf' --quant_ckpt_path './model/mistral/mistral7b_int4_quant_weights/mistral_tp1_rank0.npz' --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir './model/mistral/mistral7b_int4_engine' --world_size 1 --tp_size 1 --parallel_build --max_input_len 7168 --max_batch_size 1 --max_output_len 1024 According to this.

lfr-0531 commented 5 days ago

I cannot reproduce this issue locally. Can you have a try on the latest main branch? And follow the install doc to correctly install TensorRT-LLM.

sherlcok314159 commented 5 days ago

Did you use the local PC or the remote server without screen? Is there any command to check whether the TRT-LLM is correctly installed.

lfr-0531 commented 5 days ago

Did you use the local PC or the remote server without screen? Is there any command to check whether the TRT-LLM is correctly installed.

Remote server.

To check installation

python3 -c "import tensorrt_llm"

NVIDIA / TensorRT-LLM