Algebraic-Programming / pytorch-hccl-tests

Other
3 stars 0 forks source link

Add backend-agnostic correctness test, particularly for async ops #10

Open learning-chip opened 1 year ago

learning-chip commented 1 year ago

Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, batch_isend_irecv results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.

For existing correctness checks, See the assertEqual calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(

The unified unit test should assert the same expected behavior under both hccl and nccl backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.

The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like torch_comm_test.osu_bench and torch_comm_test.unit_test as two separate namespaces.

learning-chip commented 1 year ago

An example of inadequate check vs proper check: see test_batch_isend_irecv_nccl in 1.11 vs 1.13:

Only 1.13 performs self.assertEqual(recv_tensors[src], expected_tensors[src]), which is what we need here.