aliyun / aicb

Other
139 stars 21 forks source link

AICB Simulations on limited GPU cluster #3

Closed YukariSonz closed 2 months ago

YukariSonz commented 3 months ago

Hi,

Many thanks for this fantastic work! I got a small question regarding to the use cases of AICB

  1. I do understand that this work is mainly for researchers to understand the communication pattern of LLM training system. I noticed that to build the computation graph dependencies will require the physical cluster and build the communication pattern will do the collective communications on PyTorch + NCCL + Real GPUs -- but what happen if I got limited computation resource (let's say, 2 A100 GPU) but I want to simulate the communication pattern of training a GPT-3 (175B) on 100/100/10000 GPUs system.
Huoyuan100861 commented 3 months ago

Thank you for your attention and support of our work. You've raised an excellent question. We have multiple solutions to address this situation (lack of GPUs):

  1. Full Simulation: We have proposed SimAI (which can parse workloads produced by AICB), a highly accurate large model simulator (this work has been accepted by NSDI'25). It requires as little as one CPU server to accurately replicate the training process of any cluster LLM (like GPT3-175B, 10000 GPUs). The entire SimAI system will also be open-sourced in the future.
  2. Generating GPU cluster-like traffic on CPU clusters: We will first open-source MockNCCL, a core component of SimAI. It can parse AICB workloads, remove GPU dependencies from the collective communication parts, and convert them into point-to-point flow tables. This enables generating RDMA traffic similar to real NCCL on CPU clusters.