Open mx-moth opened 4 months ago
In the test_clip example in your post, @mx-moth , it looks like the netcdf python library is definitely in the mix. That uses the netcdf-c library and hdf5 C library under the hood, and those can optionally be compiled with support for parallel processing via Open MPI. It might be worth ensuring that your netcdf4 python library links against the latest versions of those libraries, and whether those being compiled with --parallel
support or not makes any difference. (The ereefs/netcdf-base docker image is set up to let you select library versions and compilation options, and may help provide a test environment)
Further evidence towards it being an issue with the netCDF4 library on PyPI is that the errors go away if I downgrade to netCDF4 < 1.7. Unfortunately netCDF4 ~= 1.6.x is not compatible with numpy, so that also needs downgrading. Setting up the environment as follows will not segfault:
$ conda env create --name emsarray-tests --no-default-packages --file ./continuous-integration/environment.yaml
$ conda activate emsarray-tests
$ conda install python=3.12 pip
$ pip install -e .[testing] 'netcdf4<1.7' 'numpy<2'
$ pytest -vv
@mx-moth as a point of interest, I just had a go at reproducing this on the ereefs/netcdf-base image using NetCDF libraries compiled with OpenMPI support and could NOT reproduce it!
This environment has:
Preparation steps:
docker pull onaci/ereefs-netcdf-base:python-3.11-slim-bookworm
docker run --rm -i -t onaci/ereefs-netcdf-base:python-3.11-slim-bookworm bash
Then from a shell inside the container:
git clone git clone https://github.com/csiro-coasts/emsarray.git
cd emsarray
git checkout dependency-version-bump
# Ensure the python netcdf4 library compiles its own wheel against the netcdf-c library
# version which is already installed into the base image:
# Note: before running this step, I needed to edit continuous-integration/requirements-3.11.txt
# so that any requirement with extras (like coverage[toml]==7.5.4) no longer had the [] part:
# This is because of https://github.com/pypa/pip/issues/8210 and the newest version of pip!
pip3-netcdf-install continuous-integration/requirements-3.11.txt
# Then edit the requirements-3.11.txt file again to put the extras back...
# And install all the other requirements:
pip3 install -r continuous-integration/requirements-3.11.txt
pip install -e .[testing]
# Run the tests
pytest -vv
All the tests passed without error or segfault.
Installing netCDF4 1.7.1 and numpy 2.0.0 from conda, and installing the rest of the dependencies from pip also does not segfault. Something in the netCDF4 1.7.1 wheel from PyPI is seeming more likely
$ conda env create --name emsarray-tests --no-default-packages --file ./continuous-integration/environment.yaml
$ conda activate emsarray-tests
$ conda install python=3.12 'netcdf4=1.7.1' 'numpy=2.0'
$ pip install -e .[testing]
$ pytest -vv
I disabled dask multithreading in the tests in #137. To reenable multithreading when running pytest, run it as pytest --dask-scheduler threads
When all dependencies are updated to their latest versions and installed via PyPI (#137) the test suite will regularly - but non-deterministically - segfault. If all the latest dependencies are installed via conda no segfault has been observed.
As a temporary work around, disabling dask multithreading seems to stop the segfaults. This is not an acceptable solution long term but will suffice to unblock other development work.
This ticket tracks the investigation so far.
To stop the segfaults, dask can be set to single threaded mode by running:
This is now enabled by default for test runs. To trigger the failures again, run the tests with
pytest --dask-scheduler=threads
To set up a test environment clone this repository, make a conda environment, and install the dependencies from PyPI as follows:
The tests segfault regularly on two specific tests which subset UGrid datasets, however other subsetting tests have also failed. Python 3.10, 3.11, and 3.12 all exhibit this issue. These tests previously worked fine. The stack traces printed vary, but some examples follow:
tests/conventions/test_ugrid.py::test_make_and_apply_clip_mask
``` $ pytest -vv --dask-scheduler threads ============ test session starts ============ platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0 -- /home/hea211/projects/emsarray/.conda/bin/python3.11 cachedir: .pytest_cache Matplotlib: 3.9.0 Freetype: 2.6.1 rootdir: /home/hea211/projects/emsarray configfile: pyproject.toml testpaths: tests plugins: mpl-0.17.0, cov-5.0.0 collected 365 items ... tests/conventions/test_ugrid.py::test_make_and_apply_clip_mask Fatal Python error: Segmentation fault Thread 0x00007f8c195fa700 (most recent call first): File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 113 in _getitem File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 1014 in explicit_indexing_adapter File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 100 in __getitem__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 650 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 787 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 576 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 573 in __array__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/array/core.py", line 118 in getter File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/core.py", line 127 in _execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 225 in execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 239 intests/conventions/test_ugrid.py::test_make_and_apply_clip_mask
``` $ pytest -vv --dask-scheduler threads ============ test session starts ============ platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0 -- /home/hea211/projects/emsarray/.conda/bin/python3.11 cachedir: .pytest_cache Matplotlib: 3.9.0 Freetype: 2.6.1 rootdir: /home/hea211/projects/emsarray configfile: pyproject.toml testpaths: tests plugins: mpl-0.17.0, cov-5.0.0 collected 365 items ... tests/conventions/test_ugrid.py::test_make_and_apply_clip_mask Fatal Python error: Segmentation fault Thread 0x00007efd0f7fe700 (most recent call first): File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/locks.py", line 64 in __enter__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/locks.py", line 231 in __enter__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 77 in __setitem__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/array/core.py", line 4380 in load_store_chunk File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/array/core.py", line 4398 in store_chunk File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/core.py", line 127 in _execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 225 in execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 239 intests/cli/commands/test_clip.py::test_clip
``` $ pytest -vv --dask-scheduler threads -- tests/cli/commands/test_clip.py::test_clip ============ test session starts ============ platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0 -- /home/hea211/projects/emsarray/.conda/bin/python3.11 cachedir: .pytest_cache Matplotlib: 3.9.0 Freetype: 2.6.1 rootdir: /home/hea211/projects/emsarray configfile: pyproject.toml plugins: mpl-0.17.0, cov-5.0.0 collected 1 item tests/cli/commands/test_clip.py::test_clip Fatal Python error: Fatal Python error: Segmentation faultSegmentation fault Current thread 0x00007fb280fa0700 (most recent call first): File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 113 in _getitem File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 1014 in explicit_indexing_adapter File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 100 in __getitem__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 650 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 787 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 576 in get_duck_array File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/xarray/core/indexing.py", line 573 in __array__ File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/array/core.py", line 118 in getter File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/core.py", line 127 in _execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 225 in execute_task File "/home/hea211/projects/emsarray/.conda/lib/python3.11/site-packages/dask/local.py", line 239 in