Open ydm-amazon opened 3 months ago
Maybe you have to add --tp_size 4
to your convert_checkpoint.py
command?
What command causes the error to occur? If you're running the model, are you using mpirun -n 4 ...
or otherwise specifying the world size should be 4?
This is the gemma model, where there is no --tp_size
option. I am also specifying the right world size for mpirun
I am also able to reproduce the issue
Conversion script:
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/gemma/convert_checkpoint.py --dtype float16 --world-size 4 --model-dir google/gemma-7b --output-model-dir /tmp/trtllm_gemma_ckpt/ --ckpt-type hf
After model conversions, I can clearly see 4 ranks of safetensors saved
# ls /tmp/trtllm_gemma_ckpt/
config.json rank0.safetensors rank1.safetensors rank2.safetensors rank3.safetensors
During the engine build phase
# trtllm-build --tp_size 4 --checkpoint_dir /tmp/trtllm_gemma_ckpt/ --log_level info --gemm_plugin float16 --output_dir /tmp/.djl.ai/trtllm/4647f76cc28bee0fdd3f41d68a8620656f876497/google-gemma-7b/1 --workers 1 --gpt_attention_plugin float16 --paged_kv_cache enable --context_fmha enable --max_beam_width 1 --remove_input_padding enable --use_custom_all_reduce disable --use_paged_context_fmha enable --use_fp8_context_fmha disable --max_batch_size 16 --max_input_len 1024 --max_seq_len 1024 --use_fused_mlp
It only generate 1 rank engine file
/tmp/.djl.ai/trtllm/4647f76cc28bee0fdd3f41d68a8620656f876497/google-gemma-7b/1/
config.json rank0.engine
@ydm-amazon @lanking520 @LanceB57 Gemma TP has a problem, but we have fixed it internally and will release in next Tuesday's weekly update. The issue was that some functions were reading TP information from Mapping, which was not set correctly.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
AWS g5.12xlarge (GPU 4 x NVIDIA A10G) CPU x86_64 TensorRT-LLM v0.11.0
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
As noted below, I am using TP 4 and PP 1 (hence, world size of 4). The model that fails is Gemma-7B.
Expected behavior
MPI should not error out, since world size is consistent with TP and PP.
actual behavior
The software gives an error of the following:
Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/executor/executorImpl.cpp:435)
additional notes
A similar issue is mentioned in https://github.com/NVIDIA/TensorRT-LLM/issues/2021 but that one is marked as 'not a bug' so I created a new issue. The answer in that issue does not apply to this one.