ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 50 forks source link

Fix SWDEV-459623 #1428

Closed xinyazhang closed 1 month ago

xinyazhang commented 1 month ago

Fixes #SWDEV-459623

Tested on rocm-framework-51. Log:

(py_3.10) xinyazha@12fd1640b6b7:~/rocm-pytorch$ PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_tensor/test_attention.py -k test_ring_attention_compile_attention -v
test_ring_attention_compile_attention_fn0 (__main__.RingAttentionTest) ... ok
test_ring_attention_compile_attention_fn1 (__main__.RingAttentionTest) ... [rank1]:[W530 08:05:10.073449407 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank0]:[W530 08:05:10.123574261 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
skipped 'Test skipped at subprocess level, look at subprocess log for skip reason'

----------------------------------------------------------------------
Ran 2 tests in 11.265s

OK (skipped=1)