alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.08k stars 357 forks source link

[Discussion]On the pillar of alpa's two-level hierarchical space of parallelism #906

Open GHGmc2 opened 1 year ago

GHGmc2 commented 1 year ago

alpa's two-level hierarchical space of parallelism is based on the observation that the bandwidth of inter-node(like InfiniBand) is much lower than intra-node(like NVLink).

But the latest NVLink Switch system supports up to 256 GPUs(32 nodes) with direct connection: https://www.nvidia.com/en-sg/data-center/nvlink/, is the pillar still solid since NVIDIA may support more and more GPU cards with direct connection in the future?

BTW, can we open a discussion section? Currently we can only fire a PR or issue if have some questions. Thanks!

jiaodong commented 1 year ago

It is still quite necessary in my opinion, since our OPT-175B benchmarks are executed in Nvidia's Selene with NVLink and NVSwitch you mentioned, pipeline parallelism is still essential for scaling, or even borderline of table stake.

I think the situation you described pushed to its extreme is basically TPU cluster that inter-chip connection is not a bottleneck of scaling, in this case it will suffice mostly with just tensor / intra-operator parallelism that relies less on inter-operator pass instead.

GHGmc2 commented 1 year ago

@jiaodong Thanks for your inputs. I also noticed that inter-op only parallelism contributes more than intra-op only for the final performance based your A100 cluster currently.

But ”the latest NVLink Switch system supports up to 256 GPUs(32 nodes) with direct connection“ means H100 SuperPOD, not A100: https://www.servethehome.com/nvidia-nvlink4-nvswitch-at-hot-chips-34/

I agree with your option that "pushed to its extreme is basically TPU cluster that inter-chip connection is not a bottleneck of scaling", any plan for the extreme case in the future?

jiaodong commented 1 year ago

@GHGmc2 thanks for the clarification -- this image is much clearer, previously i thought you meant 4th gen NVLink. I think it will change the optimal hyper parameters of (data / tensor / pipeline) parallelism degrees in this hardware setting, but i wouldn't frame it as "is the pillar still solid". Inter and intra op parallelism are mechanisms to partition a large model so Alpa is able to freely support all degrees of parallelism that matters to scale. When the cluster has slower inter-host connection but sufficient number of GPUs, we have the luxury to reduce or drop data / tensor parallelism dimension for best scaling performance (that's the configs we used in Selene's A100 cluster). In H100 SuperPod, we might find a different optimal strategy but it's just a matter of simple config change on top of the same model partitioning mechanisms. We might be able to give that a try sometime in future as our Nvidia collaboration continues, since no one else really have access to H100 SuperPod at that scale yet :)

GHGmc2 commented 1 year ago

@jiaodong Thanks for your rely.

I suppose we may get better parallelism stragegy with global search space than two level sub search spaces, but I agree that the two level way simplifies the problem a lot on existing GPU clusters.

Again, can we open a discussion section? Currently we can only fire a PR or issue if have some questions. Thanks!

zhisbug commented 1 year ago

@GHGmc2 Feel free to design a new algorithm that can search over the global space (for the new H100 cluster)!

GHGmc2 commented 1 year ago

@GHGmc2 Feel free to design a new algorithm that can search over the global space (for the new H100 cluster)!

I wish I could someday..

I believe we do need a lib to do auto parallel things for LLM, and it will be better if alpa could be the chosen one^_^ Thanks for your great works anyway.