NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
903 stars 207 forks source link

šŸš€[FEA]: Print communicator layout when using verbose=True in DistributedManager.create_groups_from_config #259

Open azrael417 opened 9 months ago

azrael417 commented 9 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

I am looking into the distributed manager. When creating a tree of communicators, it is possible to use a verbose flag to print the layouts. What it prints is something like this:

Node ID: world
  Children: [Node(tag=model, identifier=model, data=ProcessGroupNode(name=model, size=8, ), Node(tag=data, identifier=data, data=ProcessGroupNode(name=data, size=4, )]
Node ID: model
  Children: [Node(tag=spatial, identifier=spatial, data=ProcessGroupNode(name=spatial, size=4, ), Node(tag=matmul, identifier=matmul, data=ProcessGroupNode(name=matmul, size=2, )]
Node ID: data
  Children: []
Node ID: spatial
  Children: [Node(tag=h, identifier=h, data=ProcessGroupNode(name=h, size=4, ), Node(tag=w, identifier=w, data=ProcessGroupNode(name=w, size=1, )]
Node ID: matmul
  Children: [Node(tag=fin, identifier=fin, data=ProcessGroupNode(name=fin, size=2, ), Node(tag=fout, identifier=fout, data=ProcessGroupNode(name=fout, size=1, )]
Node ID: h
  Children: []
Node ID: w
  Children: []
Node ID: fin
  Children: []
Node ID: fout
  Children: []

Describe any alternatives you have considered

I think what would be good is to not print an empty list for leaf nodes. This is more or less cosmetic. However, what is very useful is to print a list of world ranks associated with every node: for example, if I have 8 ranks, and I have 2 model and 4 data parallel ranks, it would be good to see something like this:

world: [0, 1, ,2 ,3 ,4, 5, 6, 7] model = [ [0, 1, 2, 3], [4, 5, 6, 7] ] data = [ [0, 4], [1, 5], [2, 6], [3, 7] ]

This is very instructive to understand how ranks are placed and helps debugging. Especially when you are using say alltoall in one comm direction and all reductions in the other, you want to make sure that the alltoall ranks are placed closer together than the allreduce ranks. printing this topology helps to understand the comm layout better.

akshaysubr commented 9 months ago

This is a good point! I actually use this myself while debugging and can expose this using the verbose=True option