Add backend-agnostic correctness test, particularly for async ops

Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, batch_isend_irecv results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.

For existing correctness checks, See the assertEqual calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(

The unified unit test should assert the same expected behavior under both hccl and nccl backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.

The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like torch_comm_test.osu_bench and torch_comm_test.unit_test as two separate namespaces.

Algebraic-Programming / pytorch-hccl-tests

Add backend-agnostic correctness test, particularly for async ops #10