NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
1.04k stars 246 forks source link

🐛[BUG]: Intermittent failures in `test_healpix.py` pytest. #608

Open ktangsali opened 4 months ago

ktangsali commented 4 months ago

Version

source - main

On which installation method(s) does this occur?

Docker

Describe the issue

The test_healpix.py can sometimes report a failure if run with the full pytest suite. This however passes if the test is executed individually. The error is similar to the issue here: https://github.com/Unidata/netcdf4-python/issues/1343. The discussion suggests issue with the particular netcdf version and potentially a newer version will help.

Downgrading the netcdf version to 1.6.5 also helps.

Minimum reproducible example

No response

Relevant log output

_______________________ test_open_time_series_on_the_fly _______________________

self = CachingFileManager(<class 'netCDF4._netCDF4.Dataset'>, '/data/nfs/modulus-data/datasets/healpix/merge/z1000.nc', mode=...r': True, 'diskless': False, 'persist': False, 'format': 'NETCDF4'}, manager_id='2509f805-6f7f-4ab1-952a-f9a698464d2c')
needs_lock = True

    def _acquire_with_cache_info(self, needs_lock=True):
        """Acquire a file, returning the file and whether it was cached."""
        with self._optional_lock(needs_lock):
            try:
>               file = self._cache[self._key]

/usr/local/lib/python3.10/dist-packages/xarray/backends/file_manager.py:211: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xarray.backends.lru_cache.LRUCache object at 0x7fdf674382c0>
key = [<class 'netCDF4._netCDF4.Dataset'>, ('/data/nfs/modulus-data/datasets/healpix/merge/z1000.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '2509f805-6f7f-4ab1-952a-f9a698464d2c']

    def __getitem__(self, key: K) -> V:
        # record recent use of the key by moving it to the front of the list
        with self._lock:
>           value = self._cache[key]
E           KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/data/nfs/modulus-data/datasets/healpix/merge/z1000.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '2509f805-6f7f-4ab1-952a-f9a698464d2c']

/usr/local/lib/python3.10/dist-packages/xarray/backends/lru_cache.py:56: KeyError

Environment details

No response

mnabian commented 1 month ago

@daviddpruitt would you be able to look into this?