NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
795 stars 172 forks source link

🐛[BUG]: Order of communicators in comm tree is not in the order as they were added #258

Open azrael417 opened 7 months ago

azrael417 commented 7 months ago

Version

0.5.0

On which installation method(s) does this occur?

Pip

Describe the issue

I have the following problem: although I am adding process groups in a specific order to the tree, the printed tree, when converted to dict, has the wrong order. The ordering is important because the user wants to influence of how its communicators are laid out, ideally in a fashion so that ranks in communicators instantiated first are closer together.

Minimum reproducible example

from modulus.distributed.manager import DistributedManager
from modulus.distributed.config import ProcessGroupNode, ProcessGroupConfig

leaf_config = {"h": 2, "w": 2, "fin": 1, "fout": 1}

world = ProcessGroupNode('world')
pconfig = ProcessGroupConfig(world)

pconfig.add_node(ProcessGroupNode("data"), parent="world")

pconfig.add_node(ProcessGroupNode("model"), parent="world")

pconfig.add_node(ProcessGroupNode("spatial"), parent="model")
pconfig.add_node(ProcessGroupNode("matmul"), parent="model")

pconfig.add_node(ProcessGroupNode("h", size=leaf_config["h"]), parent="spatial")
pconfig.add_node(ProcessGroupNode("w", size=leaf_config["w"]), parent="spatial")

pconfig.add_node(ProcessGroupNode("fin", size=leaf_config["fin"]), parent="matmul")
pconfig.add_node(ProcessGroupNode("fout", size=leaf_config["fout"]), parent="matmul")

print(pconfig.tree.to_dict())

Relevant log output

This prints:

>>> print(pconfig.tree.to_dict())
{'world': {'children': ['data', {'model': {'children': [{'matmul': {'children': ['fin', 'fout']}}, {'spatial': {'children': ['h', 'w']}}]}}]}}

So although the group "spatial" was added first to its parent "model", the order in the tree view is swapped. I am not 100% sure if that will be the same order in which those communicators will be instantiated but it would be good to verify that nodes which are added first are guaranteed to be instantiated first. The children are stored in a list which has strict ordering. Therefore, it should be easy to fix this.

Environment details

I think this is not relevant for this issue.
azrael417 commented 7 months ago

According to the documentation, the nodes in treelib should be ordered as they are appended. So if that is the case, I think we should be good.