IST-DASLab / torch_cgx

Pytorch distributed backend extension with compression support
GNU Affero General Public License v3.0
17 stars 0 forks source link

Unable to run Unittest #2

Open ly-muc opened 1 year ago

ly-muc commented 1 year ago

I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines.

os.environ['MASTER_ADDR'] = args.masterhost
os.environ['MASTER_PORT'] = '4040'
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]

dist.init_process_group(backend="cgx",  init_method="env://", rank=self.rank % torch.cuda.device_count())

I execute the test with the following line:

mpirun -np 2 -x PATH --hostfile hostfile --tag-output --allow-run-as-root -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca coll ^hcoll -- python test/test_qmpi.py --masterhost=$MASTER_HOST

However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:

======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
        self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2.], device='cuda:0', dtype=torch.float16) != tensor([3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 128

or this one:

======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
    self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2.], device='cuda:0', dtype=torch.float16) != tensor([3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 1

In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?

For my understanding, in the readme when dist.init_process_group is called, the local rank is used. Does this assume that there is only one node?

Thanks!

ilmarkov commented 1 year ago

@ly-muc Thank you for the filing the issue! The problem was in the code. It is fixed in the commit and new release.

The test was only ran on a single node but should also work in multinode setting. I think it is sufficient to have dist.init_process_group(backend="cgx", init_method="env://", rank=self.rank). The rank is taken from OMPI_COMM_WORLD_RANK which is supposed to be global rank, not local.