🚀[FEA]: Print communicator layout when using verbose=True in DistributedManager.create_groups_from_config

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

I am looking into the distributed manager. When creating a tree of communicators, it is possible to use a verbose flag to print the layouts. What it prints is something like this:

Node ID: world
  Children: [Node(tag=model, identifier=model, data=ProcessGroupNode(name=model, size=8, ), Node(tag=data, identifier=data, data=ProcessGroupNode(name=data, size=4, )]
Node ID: model
  Children: [Node(tag=spatial, identifier=spatial, data=ProcessGroupNode(name=spatial, size=4, ), Node(tag=matmul, identifier=matmul, data=ProcessGroupNode(name=matmul, size=2, )]
Node ID: data
  Children: []
Node ID: spatial
  Children: [Node(tag=h, identifier=h, data=ProcessGroupNode(name=h, size=4, ), Node(tag=w, identifier=w, data=ProcessGroupNode(name=w, size=1, )]
Node ID: matmul
  Children: [Node(tag=fin, identifier=fin, data=ProcessGroupNode(name=fin, size=2, ), Node(tag=fout, identifier=fout, data=ProcessGroupNode(name=fout, size=1, )]
Node ID: h
  Children: []
Node ID: w
  Children: []
Node ID: fin
  Children: []
Node ID: fout
  Children: []

Describe any alternatives you have considered

I think what would be good is to not print an empty list for leaf nodes. This is more or less cosmetic. However, what is very useful is to print a list of world ranks associated with every node: for example, if I have 8 ranks, and I have 2 model and 4 data parallel ranks, it would be good to see something like this:

world: [0, 1, ,2 ,3 ,4, 5, 6, 7] model = [ [0, 1, 2, 3], [4, 5, 6, 7] ] data = [ [0, 4], [1, 5], [2, 6], [3, 7] ]

This is very instructive to understand how ranks are placed and helps debugging. Especially when you are using say alltoall in one comm direction and all reductions in the other, you want to make sure that the alltoall ranks are placed closer together than the allreduce ranks. printing this topology helps to understand the comm layout better.

NVIDIA / modulus