Open ly-muc opened 1 year ago
@ly-muc Thank you for the filing the issue! The problem was in the code. It is fixed in the commit and new release.
The test was only ran on a single node but should also work in multinode setting.
I think it is sufficient to have
dist.init_process_group(backend="cgx", init_method="env://", rank=self.rank)
. The rank is taken from OMPI_COMM_WORLD_RANK
which is supposed to be global rank, not local.
I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines.
I execute the test with the following line:
However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:
or this one:
In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?
For my understanding, in the readme when
dist.init_process_group
is called, the local rank is used. Does this assume that there is only one node?Thanks!