ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 51 forks source link

SWDEV-469009 - skips for flaky distributed tests #1439

Closed pragupta closed 3 months ago

pragupta commented 3 months ago

I see this one failing consistently failing due to tensor-likes are not equal:

python test/distributed/test_c10d_gloo.py -k test_reduce_stress_cuda

These are flaky. I see them run fine when I run individually, however, they timeout in the bunch:

test/distributed/fsdp/test_fsdp_clip_grad_norm.py -k test_no_gradients

test/distributed/fsdp/test_fsdp_optim_state.py -k test_optim_state_dict_nested

test/distributed/fsdp/test_fsdp_optim_state.py -k test_scatter_full_optim_state_dict

test/distributed/fsdp/test_fsdp_optim_state.py -k test_rekey_optim_state_dict

test/distributed/fsdp/test_fsdp_optim_state.py -k test_shard_full_optim_state_dict

test/distributed/fsdp/test_fsdp_optim_state.py -k test_full_optim_state

test/distributed/fsdp/test_fsdp_use_orig_params.py -k test_access_params_after_forward