AICB Simulations on limited GPU cluster

aliyun / aicb

Other

139 stars 21 forks source link

Thank you for your attention and support of our work. You've raised an excellent question. We have multiple solutions to address this situation (lack of GPUs):

Full Simulation: We have proposed SimAI (which can parse workloads produced by AICB), a highly accurate large model simulator (this work has been accepted by NSDI'25). It requires as little as one CPU server to accurately replicate the training process of any cluster LLM (like GPT3-175B, 10000 GPUs). The entire SimAI system will also be open-sourced in the future.
Generating GPU cluster-like traffic on CPU clusters: We will first open-source MockNCCL, a core component of SimAI. It can parse AICB workloads, remove GPU dependencies from the collective communication parts, and convert them into point-to-point flow tables. This enables generating RDMA traffic similar to real NCCL on CPU clusters.

aliyun / aicb

AICB Simulations on limited GPU cluster #3