Why is there no Alltoall function in MoE implementation?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.34k stars 936 forks source link

Hi, I am running and profiling the code of the Mixtral implementation, however, neither in the code nor in the profiling, did I find any Alltoall operations.

I built the TRT engine using the following config:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1 \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --world_size 4 \
                --tp_size 4 \
                --output_dir ./trt_engines/mixtral/TP \
                --moe_tp_mode 1 \
                --max_output_len 2048

I tried both --moe_tp_mode 1 and --moe_tp_mode 2, but seems they just end with the same tensor parallelism, with no expert parallelism enabled. Also in the nsight profiling, there are only Allreduce and Allgather calls, which seems insufficient for expert parallelism.

NVIDIA / TensorRT-LLM

Why is there no Alltoall function in MoE implementation? #989