Optimal Multi-node Inference Parallel Settings

Great work! I noticed in your blog that the multi-node inference is implemented via TP and PP

While challenging, this can be achieved with two-node inference using a combination of system optimizations such as FP8 weights, split-fuse and continuous batching, tensor parallelism within a node and pipeline parallelism across nodes.

I was wondering have you tried DP + TP + EP as described in the DeepSpeed-MoE paper? And what's the best practice to scale such a giant model to a multi-node environment to achieve the best inference efficiency?

Snowflake-Labs / snowflake-arctic

Optimal Multi-node Inference Parallel Settings #15