NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.42k stars 1.4k forks source link

Contrib unit test failure in `openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data` #1802

Open xwang233 opened 6 months ago

xwang233 commented 6 months ago

Describe the Bug

Contrib unit test failure in openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data

Minimal Steps/Code to Reproduce the Bug

root@b4db9ba94176:/opt/pytorch/apex/apex/contrib/test# pytest -vvvs -k test_fused_update_on_random_data
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python3
cachedir: .pytest_cache
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/apex/apex/contrib/test/.hypothesis/examples'))
rootdir: /opt/pytorch/apex
configfile: pyproject.toml
plugins: timestamper-0.0.10, xdist-3.6.1, random-order-1.1.1, benchmark-4.0.0, rerunfailures-14.0, anyio-4.3.0, timeout-2.3.1, xdoctest-1.1.0, hypothesis-6.100.0, shard-0.1.2, cov-4.1.0, flakefinder-1.1.0
collected 113 items / 112 deselected / 1 selected
Running 1 items in this shard: apex/contrib/test/openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data

[2024-05-15 17:12:58] openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data FAILED

=================================================================================================== FAILURES ====================================================================================================
_____________________________________________________________________________ FusedAdamSWATestCase.test_fused_update_on_random_data _____________________________________________________________________________

self = <test_fused_adam_swa.FusedAdamSWATestCase testMethod=test_fused_update_on_random_data>

    def setUp(self):
        super().setUp()
        self._seed = 19260817
        random.seed(self._seed)
        torch.manual_seed(self._seed)
>       torch.backends.cudnn.deterministic = True

openfold_triton/test_fused_adam_swa.py:91:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.backends.ContextProp object at 0x7f57fdef1a80>, obj = <module 'torch.backends.cudnn' from '/opt/pytorch/pytorch/torch/backends/cudnn/__init__.py'>, val = True

    def __set__(self, obj, val):
        if not flags_frozen():
            self.setter(val)
        else:
>           raise RuntimeError(
                f"not allowed to set {obj.__name__} flags "
                "after disable_global_flags; please use flags() context manager instead"
            )
E           RuntimeError: not allowed to set torch.backends.cudnn flags after disable_global_flags; please use flags() context manager instead

../../../../pytorch/torch/backends/__init__.py:43: RuntimeError
=============================================================================================== warnings summary ================================================================================================
../../transformer/tensor_parallel/cross_entropy.py:78
  /opt/pytorch/apex/apex/transformer/tensor_parallel/cross_entropy.py:78: DeprecationWarning: invalid escape sequence '\s'
    """

../../transformer/pipeline_parallel/schedules/fwd_bwd_pipelining_with_interleaving.py:49
  /opt/pytorch/apex/apex/transformer/pipeline_parallel/schedules/fwd_bwd_pipelining_with_interleaving.py:49: DeprecationWarning: invalid escape sequence '\_'
    """Run interleaved 1F1B schedule with communication between pipeline stages as needed.

../../transformer/pipeline_parallel/schedules/fwd_bwd_pipelining_without_interleaving.py:261
  /opt/pytorch/apex/apex/transformer/pipeline_parallel/schedules/fwd_bwd_pipelining_without_interleaving.py:261: DeprecationWarning: invalid escape sequence '\_'
    """Run non-interleaved 1F1B schedule, with communication between pipeline stages.

../../../../pytorch/torch/_custom_ops.py:253
  /opt/pytorch/pytorch/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
    return torch.library.impl_abstract(qualname, func, _stacklevel=2)

../../../../vision/torchvision/transforms/_functional_pil.py:242
  /opt/pytorch/vision/torchvision/transforms/_functional_pil.py:242: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
    interpolation: int = Image.BILINEAR,

../../../../vision/torchvision/transforms/_functional_pil.py:288
  /opt/pytorch/vision/torchvision/transforms/_functional_pil.py:288: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
    interpolation: int = Image.NEAREST,

../../../../vision/torchvision/transforms/_functional_pil.py:304
  /opt/pytorch/vision/torchvision/transforms/_functional_pil.py:304: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
    interpolation: int = Image.NEAREST,

../../../../vision/torchvision/transforms/_functional_pil.py:321
  /opt/pytorch/vision/torchvision/transforms/_functional_pil.py:321: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
    interpolation: int = Image.BICUBIC,

../optimizers/distributed_fused_adam.py:273
  /opt/pytorch/apex/apex/contrib/optimizers/distributed_fused_adam.py:273: DeprecationWarning: invalid escape sequence '\:'
    """Adam optimizer with ZeRO algorithm.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================ short test summary info ============================================================================================
FAILED openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data - RuntimeError: not allowed to set torch.backends.cudnn flags after disable_global_flags; please use flags() context manager instead
================================================================================= 1 failed, 112 deselected, 9 warnings in 7.40s =================================================================================

Expected Behavior test pass

Environment test failed since 2/13/24 although https://github.com/NVIDIA/apex/pull/1759 was merged on 12/14/23 and there has been no change on the test since then

test was skipped before 2/13/24 because some environment setup in our CI, e.g. 2/12/24:

openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data SKIPPED (Skip testing FusedAdamSWA: No module named 'einops')

cc @crcrpar @eqy @nWEIdia