Snowflake-Labs / snowflake-arctic

Apache License 2.0
511 stars 41 forks source link

Optimal Multi-node Inference Parallel Settings #15

Open iteratorlee opened 4 months ago

iteratorlee commented 4 months ago

Great work! I noticed in your blog that the multi-node inference is implemented via TP and PP

While challenging, this can be achieved with two-node inference using a combination of system optimizations such as FP8 weights, split-fuse and continuous batching, tensor parallelism within a node and pipeline parallelism across nodes.

I was wondering have you tried DP + TP + EP as described in the DeepSpeed-MoE paper? And what's the best practice to scale such a giant model to a multi-node environment to achieve the best inference efficiency?