Open GHGmc2 opened 1 year ago
It is still quite necessary in my opinion, since our OPT-175B benchmarks are executed in Nvidia's Selene with NVLink and NVSwitch you mentioned, pipeline parallelism is still essential for scaling, or even borderline of table stake.
I think the situation you described pushed to its extreme is basically TPU cluster that inter-chip connection is not a bottleneck of scaling, in this case it will suffice mostly with just tensor / intra-operator parallelism that relies less on inter-operator pass instead.
@jiaodong Thanks for your inputs. I also noticed that inter-op only parallelism contributes more than intra-op only for the final performance based your A100 cluster currently.
But ”the latest NVLink Switch system supports up to 256 GPUs(32 nodes) with direct connection“ means H100 SuperPOD, not A100: https://www.servethehome.com/nvidia-nvlink4-nvswitch-at-hot-chips-34/
I agree with your option that "pushed to its extreme is basically TPU cluster that inter-chip connection is not a bottleneck of scaling", any plan for the extreme case in the future?
@GHGmc2 thanks for the clarification -- this image is much clearer, previously i thought you meant 4th gen NVLink. I think it will change the optimal hyper parameters of (data / tensor / pipeline) parallelism degrees in this hardware setting, but i wouldn't frame it as "is the pillar still solid". Inter and intra op parallelism are mechanisms to partition a large model so Alpa is able to freely support all degrees of parallelism that matters to scale. When the cluster has slower inter-host connection but sufficient number of GPUs, we have the luxury to reduce or drop data / tensor parallelism dimension for best scaling performance (that's the configs we used in Selene's A100 cluster). In H100 SuperPod, we might find a different optimal strategy but it's just a matter of simple config change on top of the same model partitioning mechanisms. We might be able to give that a try sometime in future as our Nvidia collaboration continues, since no one else really have access to H100 SuperPod at that scale yet :)
@jiaodong Thanks for your rely.
I suppose we may get better parallelism stragegy with global search space than two level sub search spaces, but I agree that the two level way simplifies the problem a lot on existing GPU clusters.
Again, can we open a discussion section? Currently we can only fire a PR or issue if have some questions. Thanks!
@GHGmc2 Feel free to design a new algorithm that can search over the global space (for the new H100 cluster)!
@GHGmc2 Feel free to design a new algorithm that can search over the global space (for the new H100 cluster)!
I wish I could someday..
I believe we do need a lib to do auto parallel things for LLM, and it will be better if alpa
could be the chosen one^_^
Thanks for your great works anyway.
alpa's two-level hierarchical space of parallelism is based on the observation that the bandwidth of inter-node(like InfiniBand) is much lower than intra-node(like NVLink).
But the latest NVLink Switch system supports up to 256 GPUs(32 nodes) with direct connection: https://www.nvidia.com/en-sg/data-center/nvlink/, is the pillar still solid since NVIDIA may support more and more GPU cards with direct connection in the future?
BTW, can we open a discussion section? Currently we can only fire a PR or issue if have some questions. Thanks!