conda-forge / netcdf4-feedstock

A conda-smithy repository for netcdf4.
BSD 3-Clause "New" or "Revised" License
3 stars 25 forks source link

New `netCDF4=1.6.1` (most probably) causing a fairly large number of Segmentation Faults #141

Closed valeriupredoi closed 1 year ago

valeriupredoi commented 1 year ago

Heads up guys, we are seeing some very frequent segfaults in our CI when we have the new, hours-old, netCDF4=1.6.1 in our environment. It's most probably due to it, since HDF5 has been at 1.12.2 for a while now - more than a month, and with netCDF4=1.6.0 all works fine (and other packages staying at the same version and hash point). Apologies if this proves out to be due to a different package, but better safe than sorry in terms of a forewarning. I can test more if you guys need me to be a third-party testbed. Cheers muchly :beer:

ocefpaf commented 1 year ago

xref,: https://github.com/Unidata/netcdf4-python/issues/1192#issuecomment-1252388903

@xylar and @hmaarrfk are you experiencing issues with v1.6.1? I wonder if we should pull the packages.

valeriupredoi commented 1 year ago

testing a bit more with 1.6.1/1.6.0 from conda-forge (on a Ubuntu machine), it looks like the installation may be responsible for the odd behaviour we're seeing - conda compilers perhaps (just a flip of the coin)? https://github.com/Unidata/netcdf4-python/issues/1192#issuecomment-1252388903

hmaarrfk commented 1 year ago

i'm sorry, i haven't been experiencing issues. Maybe we can run the tests on our conda-forge CIs?

hmaarrfk commented 1 year ago

it may be tat some of our constraints prevent our upgrade.

hmaarrfk commented 1 year ago
netcdf4                   1.6.0           nompi_py39h6ced12a_102    conda-forge

I used to have this installed. I updated my dev environment to 1.6.1 i"ll see what happens in a few.

xylar commented 1 year ago

I've been using netcdf4 1.6.1 on Linux (both on my Ubuntu laptop and on HPC) without incident so far. I think it might be worth doing some more investigation of the segfaults that @valeriupredoi is seeing or finding out of others report them before pulling the package.

xylar commented 1 year ago

I see, it looks like other folks are having trouble and reporting it upstream on https://github.com/Unidata/netcdf4-python/issues/1192?

TomDufall commented 1 year ago

I'm using Amazon Linux 2 and getting a combination of 'NetCDF: Not a valid ID' errors and segfaults periodically. I'm installing iris-grib via conda-forge, which will pull in iris and then this package.

Sample segfault with Python faulthandler:

Fatal Python error: Segmentation fault

Current thread 0x00007febd77fe700 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 477 in __getitem__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/array/core.py", line 122 in getter
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/utils.py", line 71 in apply
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 149 in get
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/optimization.py", line 990 in __call__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 224 in execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 238 in <listcomp>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 57 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 870 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007febef7bd700 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 870 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fec14037740 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/common/mixin.py", line 127 in name
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2424 in <lambda>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2422 in dim_coords
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 1790 in coords
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 102 in _summary_coord_extra
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 146 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 204 in <listcomp>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 203 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 344 in add_vector_section
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 348 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_printout.py", line 165 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2483 in summary
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2491 in __repr__
  File "<string>", line 3 in __repr__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/dataclasses.py", line 368 in wrapper
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 373 in getMessage
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 668 in format
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 929 in format
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1085 in emit
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1187 in emit
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 954 in handle
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1661 in callHandlers
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1599 in handle
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1589 in _log
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1475 in error
  File "/home/ec2-user/environment/<internal package - path removed>", line 57 in compare_cubes
  File "/home/ec2-user/environment/<internal package - path removed>", line 178 in <listcomp>
  File "/home/ec2-user/environment/<internal package - path removed>", line 178 in _main
  File "/home/ec2-user/environment/<internal package - path removed>", line 252 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 610 in invoke
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 1066 in invoke
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 782 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 829 in __call__
  File "/home/ec2-user/environment/<internal package - path removed>", line 277 in <module>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/bdb.py", line 580 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1547 in _runmodule
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1703 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1732 in <module>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/runpy.py", line 87 in _run_code
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/runpy.py", line 194 in _run_module_as_main
Segmentation fault
valeriupredoi commented 1 year ago

I have narrowed down the issue to Segfaults and stray HDF errors at file close point happening only when dealing with cached netCDF files, for us, with 1.6.1, see https://github.com/ESMValGroup/ESMValCore/issues/1727#issuecomment-1253838080 - I ran hundreds of tests with simply reading files directly off disk and did not see any issue; we're investigating what's that caching having to do with the issue, but was wondering @TomDufall @trexfeathers and others reporting issues you guys doing simple reads off disk or using some sort of caching? BTW that issue from the cached data test does not hit us with 1.6.0 (tested proper with it too)

jswhit commented 1 year ago

@valeriupredoi do you have a self-contained example I can run to reproduce the segfault to see if a particular code change in netcdf4-python 1.6.1 is the culprit?

TomDufall commented 1 year ago

We're saving NetCDF files to disk with boto3 and loading them with iris.load_raw([path.as_uri() for path in filepaths]). The only caching we're explicitly doing is reusing files if they're already downloaded. I'm not sure if Iris does any caching under the hood when loading files.

valeriupredoi commented 1 year ago

@jswhit I do, here - run this as a test with eg pytest -n 2 test_this.py

import iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')

@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes

def test_io(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    for i, c in enumerate(cubes):
        try:
            c.data
        except RuntimeError as exc:
            print("Crap!", i, c, exc)
            raise

you will need the sample data we are using from here https://github.com/ESMValGroup/ESMValCore/tree/main/tests/sample_data/multimodel_statistics - that's being loaded from cache. Let me know if you have issues running the mini test! :beer:

valeriupredoi commented 1 year ago

OK guys we're zeroing in our test failures (related, closely or distantly, to 1.6.1) - proves out that tests pass fine if one clears the pytest cache before running, see https://github.com/ESMValGroup/ESMValCore/issues/1727 bottom(ish) comments; I am also running a monitoring PR where I run our tests every hour to deterimne if there are any other fails from other tests https://github.com/ESMValGroup/ESMValCore/pull/1730

TomDufall commented 1 year ago

It's not as compact as I'd like, but here's an example that reliably produces an invalid ID or segfault error for me after a few iterations. The file is nothing special but I can supply it if needed/find a publicly-accessible example file tomorrow. Removing one of the max diff/value range calls significantly reduces the failure rate. Normally, this code would have two different cubes - one NetCDF and one GRIB, but I cut down the code to keep in minimal.

from pathlib import Path
from typing import Iterable, Optional, Tuple

from dask import array, compute
from dask.array.core import Array
import iris
from iris.cube import Cube

iris.FUTURE.datum_support = True  # silence annoying warning

filepath = Path("ukv_20220818T0300Z-PT0000H00M-temperature_at_surface.nc").resolve()

def max_diff(
    lazy_data_a: Array, lazy_data_b: Array) -> Array:
    """
    Calculate the maximum absolute difference between the data values of two
    lazy arrays. Nan values are ignored.
    """
    diff_array = lazy_data_a - lazy_data_b
    absolute_differences = abs(diff_array)
    return array.nanmax(absolute_differences)

def value_range(lazy_data: Array) -> Array:
    """
    Calculate the range of an array. Specifically, the difference between
    the maximum and minimum values within the array. Nan values are ignored.
    """
    return array.nanmax(lazy_data) - array.nanmin(lazy_data)

def compare_cubes(cube_a: Cube, cube_b: Cube) -> Tuple:
    """
    Compare the data values of two cubes. Return a tuple containing the
    maximum difference between the two cubes, the value range of cube_a
    and the value range of cube_b. The cubes must have the same shape.
    If not, then an error is raised.
    """
    lazy_a = cube_a.lazy_data()
    lazy_b = cube_b.lazy_data()
    mod = cube_a.units.modulus
    return compute(
        max_diff(lazy_a, lazy_b),
        value_range(lazy_a),
        value_range(lazy_b),
    )

def compare():
    cube_a = iris.load(filepath)[0]
    cube_b = iris.load(filepath)[0]
    compare_cubes(cube_a, cube_b)

i = 0
while True:
    i += 1
    print(i)
    compare()
hmaarrfk commented 1 year ago

I'm somewhat worried that you are using multi-threading with dask and this file.

I some parts of the netcdf library are serial. that could be the issue.

jswhit commented 1 year ago

the fact that all C lib calls now release the GIL - including close (in version 1.6.1) might also be the issue. @TomDufall if you can post your sample file I will try with and without releasing the GIL for Dataset.close().

dopplershift commented 1 year ago

netCDF is not threadsafe (IIRC even for reading).

jswhit commented 1 year ago

Using the example from @TomDufall (with. my own test data) I am able to reproduce the segfault with version 1.6.1. No segfault with 1.6.0. I suspect it has to do with the releasing of the GIL introduced in 1.6.1 and the non-threadsafe-ness of netcdf-c.

Clearly some netcdf-c calls should not be called without the GIL. Question now is: which ones?

UPDATE: keeping the GIL on nc_open (i.e. in for call c-lib calls in Dataset.__init__) fixes the segfaults for @TomDufall's test.

dopplershift commented 1 year ago

IMO, the GIL is for locking the interpreter. That it solves issues with netcdf in a multi-threaded context is incidental. It's on users to manage locking for non-threadsafe resources like netCDF.

TomDufall commented 1 year ago

Weirdly, on my test, setting the dask scheduler to threaded even with a high or low worker count ((e.g. scheduler="threads", workers=20 or 2 on my 2-core instance), stops the errors. I've tried setting compute to use single-threaded, threads, and processes (the only options I'm aware of) and all 3 work - I'm not sure what default it's using that is so much more error-prone.

Please could someone elaborate on in what scenarios netCDF isn't thread safe? I only use it through Iris, so I've come up with a few potentially problematic access patterns - I hope they're clear enough. In these examples, the cube is using lazy data, so Iris will attempt to use netCDF to load the data only when the access attempt is made. I assume 1 is unsafe and 3 is safe? Is that correct/what about case 2?

1) Construct a cube from a file. Access the same cube from two threads at once. 2) Construct two cubes from the same file. Access both cubes from two threads at once. 3) Construct two cubes from two different files. Access both cubes from two threads at once.

Sample file from earlier now attached if anyone still wants it (had to zip to upload) ukv_20220818T0300Z-PT0000H00M-temperature_at_surface.zip

jswhit commented 1 year ago

@TomDufall I'd be curious to know if https://github.com/Unidata/netcdf4-python/pull/1193 stops the segfaults in all the scenarios you mention.

dopplershift commented 1 year ago

I believe even read operations have the opportunity to modify in-memory netCDF data structures, so strictly speaking, none of those 3 are safe IIRC. @WardF can you shed any light on what operations (if any) are threadsafe for netCDF?

WardF commented 1 year ago

Certainly; libnetcdf is not thread safe. Because libhdf5 is historically not thread safe, it would have been a very heavy lift for us to wrap our operations around it in a thread safe way. As it stands, there is an experimental branch @dennisheimbigner is working on that is looking to offer thread safety, but it is not going to be merged in any time soon.

So, to answer the question above, netCDF is not thread safe, full stop, unless thread safety is handled by the software/program making use of libnetcdf.

dopplershift commented 1 year ago

And to be clear, there is still value in having the GIL released during API calls without the need for netCDF to be threadsafe.

DennisHeimbigner commented 1 year ago

I believe Ward is correct and that there is no scenario in which accessing netcdf-c with multiple threads is safe. Even just reading data can cause problems because of the use of internal cache data structures that are potentially modified when doing simple data reads.

trexfeathers commented 1 year ago

I have narrowed down the issue to Segfaults and stray HDF errors at file close point happening only when dealing with cached netCDF files, for us, with 1.6.1, see ESMValGroup/ESMValCore#1727 (comment) - I ran hundreds of tests with simply reading files directly off disk and did not see any issue; we're investigating what's that caching having to do with the issue, but was wondering @TomDufall @trexfeathers and others reporting issues you guys doing simple reads off disk or using some sort of caching? BTW that issue from the cached data test does not hit us with 1.6.0 (tested proper with it too)

One of the places we are seeing errors is running Python scripts in Read the Docs builds (e.g. 18171552), which include a loading step. I am almost certain these files are being loaded straight from disk. (I assume that is what @valeriupredoi means by "caching", rather than anything to do with NetCDF's chunk_cache).

jswhit commented 1 year ago

@trexfeathers are the Read the Docs builds using threads? (and reading the same file from different threads?). I ask because the working hypothesis now is that all of the segfaults people are seeing with 1.6.1 are related to the fact that netcdf-c is not thread safe, and not releasing the GIL when calling the C lib in 1.6.1 has increased the possibility that segfaults will occur when reading from or writing to the same netCDF4.Dataset object from different threads.

trexfeathers commented 1 year ago

@trexfeathers are the Read the Docs builds using threads?

Honestly I don't know. This isn't information that is published in RTD's docs, and the question is quite difficult to read around since so much documentation uses RTD.

Sorry, I know this isn't helpful. This topic is beyond my expertise.

ocefpaf commented 1 year ago

Folks, after some investigation this is not a problem with the conda-forge netcdf4 package. Maybe not even a problem upstream. For those using iris you can workaround it with:

import dask
dask.config.set(scheduler="single-threaded")

I'm closing this for now.

valeriupredoi commented 1 year ago

cheers @ocefpaf - am keeping track of this at our end, so are the iris folk :+1: