Segmentation fault when running tests on CircleCI

bouweandela commented 4 years ago

The unit tests are failing regularly due to a segmentation fault.

If this happens to your pull request, please go to CircleCI by clicking 'Details' behind the CircleCI 'commit' check below your pull request, and clicking on the failed test job. Once you're on CircleCI, click the 'Rerun' button in the top right corner and choose 'Rerun Workflow from Failed'. Usually, the test will succeed the second time, if not, try again. If you do not see the 'Rerun' button, you need to log in to CircleCI with your GitHub account first.

The problem seems to originate from the netcdf4 library.

Example crashed run: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/2482/workflows/f8e73729-c4cf-408c-bdae-beec24238ac1/jobs/10300/steps

Example stack trace:

tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py ..Fatal Python error: Segmentation fault

Thread 0x00007fc098ff9700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 576 in _handle_results
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap
 ... 

Thread 0x00007fc0b97fa700 (most recent call first):

  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 391 in __getitem__
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 98 in getter
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 222 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 125 in worker
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap

...

Current thread 0x00007fc0fa7fc700 (most recent call first):
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 391 in __getitem__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 98 in getter
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in <genexpr>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 222 in execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 125 in worker
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap

...

 Thread 0x00007fc19193e740 (most recent call first):
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
   File "/opt/conda/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 444 in compute
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 262 in as_concrete_data
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_data_manager.py", line 216 in data
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/coords.py", line 640 in points
   File "/develop/tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py", line 134 in test_cl_fix_metadata
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
   File "/develop/.eggs/pytest_runner-5.2-py3.8.egg/ptr.py", line 220 in run_tests
   File "/develop/.eggs/pytest_runner-5.2-py3.8.egg/ptr.py", line 209 in run
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/dist.py", line 985 in run_command
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/dist.py", line 966 in run_commands
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/core.py", line 148 in setup
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/setuptools/__init__.py", line 144 in setup
   File "setup.py", line 159 in <module>

Received "segmentation fault" signal

bouweandela commented 4 years ago

This is could be because pytest is running the tests in parallel and the netCDF4 library is not thread safe https://github.com/Unidata/netcdf4-python/issues/844, but I'm not sure if pytest actually does this.

bouweandela commented 4 years ago

Actually the threads are created by dask, we're not running pytest in parallel.

valeriupredoi commented 4 years ago

yeah, seen this, albeit it seems to be transitory. I am smelling hdf5 so I pinned it to catch the latest 1.10.6 in #649 - the one I believe the CI is compiling netcdf4 against is 1.10.5 which is more than 7 months old

valeriupredoi commented 4 years ago

gah! just looked at the Artifacts and it is 1.10.6 the CI is installing, bleh

valeriupredoi commented 4 years ago

what are these things? https://10300-190192145-gh.circle-artifacts.com/0/logs/versions.txt

valeriupredoi commented 4 years ago

nevermind, they're the base system versions, got confused they are the env packages versions

valeriupredoi commented 4 years ago

aha! I found the bugger: indeed the final test environment is using hdf5=1.10.5 because the env reverts back to it when installing ncl:

hdf5                          1.10.6-mpi_mpich_ha7d0aea_0 --> 1.10.5-mpi_mpich_ha7d0aea_1004

the problem is pinning it to >=1.10.6 causes conda to blow its pipes and refuse to solve the environment, dumb conda. The good news is I ran multiple test instances with (at least 4, on CI machine) with 1.10.6 and still to see the segfault, so now am pretty sure it's the hdf5 version that's causing it

valeriupredoi commented 4 years ago

conda is a good boi, it's ncl that is not allowing 1.10.6

bouweandela commented 4 years ago

Instead of using very recent versions, maybe we could try using some older versions of hdf5 or netcdf? Because the problem didn't occur earlier?

valeriupredoi commented 4 years ago

OK I had a closer look at this today - some prelim conclusions:

it's not hdf5 that's causing the issue (tests fail with Segfaults for the core package too, where we have 1.10.6 and no ncl)
this segfault issue creeps up on the GA machine too, and is seemingly independent of Python version see fail with Python 3.8 and fail with Python 3.6 - they are identical
it seems that there is an issue serializing the flow as some others have spotted eg https://github.com/pytest-dev/pytest-xdist/issues/464
it seems to come from execnet https://pypi.org/project/execnet/ and I'm a bit concerned about their bold warning message right at the top: execnet currently is in maintenance-only mode, mostly because it is still the backend of the pytest-xdist plugin. Do not use in new projects

valeriupredoi commented 4 years ago

I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo since we were not seeing these for Core (where hdf5 was and is indeed at a recent 1.10.6 version)

bouweandela commented 4 years ago

I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo

Could you attach a stack trace to back that up?

stefsmeets commented 3 years ago

Do we have a list of tests that are affected? This issue is somewhat annoying, because it happens all the time. Can we mark the failing tests as xfail until this is resolved?

tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py::test_cl_fix_metadata
tests/integration/cmor/_fixes/test_common.py::test_cl_hybrid_pressure_coord_fix_metadata_with_a
...

bouweandela commented 3 years ago

Do we have a list of tests that are affected?

That would probably be all files that use iris to load a netcdf file

Can we mark the failing tests as xfail until this is resolved?

I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.

stefsmeets commented 3 years ago

Can we mark the failing tests as xfail until this is resolved?

I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.

I was thinking they still show up in the summary, but they don't slow down PRs. At the moment, every time I make a commit to a PR I have to go into CircleCI and re-run the failing test (and make sure the fail is because of a segmentation error) and wait another 20 minutes for the result. It is not a good experience.

valeriupredoi commented 3 years ago

OK this bugs me badly so I had a systematic look at one of the tests failing on a regular basis, namely tests/integration/cmor/_fixes/test_common.py. On my machine I ran sets of 20 repeated tests (2 gworkers per run):

No of fails in 20 runs of the test
run1: 3/20
run2: 5/20
run3: 1/20
run4: 4/20
-----------
avg: 3.25/20 = 16%

with each of the fails counted from the gworker's crash due to Segmentation Fault. So pretty bad on my laptop - a ~20% crash rate is nasty, I noticed worse rates (by eye) on the CI machine.

Now, here's the funny thing - I even found a fix that makes the crash rate drop significantly: encasing the Dataset import in two print statements eg:

print("DATbefore")
from netCDF4 import Dataset
print("DATafter")

makes the thing run with a much lower SegFault rate (I ran the test 160 times and it crashed only once, so a crash rate of <1%);

Another experiment was to run on with a single worker: running with a single gworker (pytest -n 1) reduces the fail rate too to 3/120<3%.

I think there is some time race condition that fails badly when using netCDF4.Dataset - we have not seen these fail anywhere else but the cmor fixes. @bouweandela has been around the block e.g. https://github.com/SciTools/iris/issues/2063 and the related netCDF4-python Issue - so it looks to me as something related to, if not identical. Also hdf5 and netCDF4 are not thread safe. I am not sure if we can ever fix this issue but rather maybe try reduce the segfaulting rate?

valeriupredoi commented 3 years ago

the only tests that are at risk from this issue are the ones using netCDF4.Dataset:

tests/integration/cmor/_fixes/cmip6/test_cesm2_waccm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_noresm2_lm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_esm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cesm2.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/test_common.py:from netCDF4 import Dataset

I suggest we change these and create a dataset in a different way - iris cube saved to netCDF or simply pass the cube, bypassing the load step. Note that using cdo or such will mean building the dataset with netCDF4.Dataset so we shouldn't do that - what you guys think @jvegasbsc @stefsmeets @bouweandela

schlunma commented 3 years ago

Would it help to remove the dataset creation from the files and instead store the netcdf files in the repository? Creating the files with iris does not work unfortunately since we need to set invalid attributes.

stefsmeets commented 3 years ago

Would it help to remove the dataset creation the files and instead store the netcdf files in the repository? Creating the files with iris does not work unfortunately since we need to set invalid attributes.

I think this is a good solution. It also simplifies the tests and reduces the load on CircleCI.

schlunma commented 3 years ago

All right, let me open a PR for that.

valeriupredoi commented 3 years ago

@schlunma yeah I like the idea! they're not big files anyways

stefsmeets commented 3 years ago

It might also be possible to cache the call with pytest, like we are doing here:

https://github.com/ESMValGroup/ESMValCore/blob/4d52e4ac50483a98e946353a630427831da44f3a/tests/sample_data/multimodel_statistics/test_multimodel.py#L59

The .pytest_cache directory itself is already cached in CircleCI too.

schlunma commented 3 years ago

PR is open: #938

valeriupredoi commented 3 years ago

@schlunma great job with the PR, mate! I think it's a very useful one (I will post a minor comment there) but unfortunately it does not solve this particular problem BUT what it does is it narrows the problem down and now I actually know what the heck is going on: look at (again) tests/integration/cmor/_fixes/test_common.py:

stress points where the SegFaults happen are the calls to coordinates from cubes: lines: 63, 78, 142 - I have seen at least one SegFault related to each of these calls
these are all lazy cubes and the guff that gets spat out when the SegFault occurs is related to dask not being able to do something iris needs it to do (realize data!)
now, if I throw a save call for each of these cubes eg
```
# Raw cube
cube = cubes.extract_strict(var_name_constraint(short_name))
iris.save(cube, "shit3.nc")
```
virtually realizing the data, all goes fine, I get no more SegFaults (0/80 tests run!)

So - there is something awfully wrong happening at the interface of iris with dask, for some reason associated to accessing coordinates - @bouweandela and @bjlittle may know more - I am stumped but I at this point I don't care because if we realize the data for each of the cubes in these tests (not big data hence not much memory) we won't get the SegFaults no more - I forgot how do you realize the data cube.co_realize() or something?

EDIT: iris.save(cube, "shit3.nc") may fail throwing a SegFault right at the point where data needs to be realized, see comment below, but if that doesn't fail, nothing else fails anymore (no more SegFaults)

valeriupredoi commented 3 years ago

OK this is the interface iris-dask that is causing the issue:

  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays

and this is very alarming because this is the elementary call to compute of dask, nothing fancy. This means this issue may happen not only to our tests but to any call to iris to realize some data, the reason we haven't seen it frequently is simply because we run these tests so many more times.

So - for our tests - we should work with realized data; for the more general problem - it's beyond me what the solution is :beer:

valeriupredoi commented 3 years ago

do you know what? I am going to take iris off the hook, I just got this SegFault due to nothing iris-related:

Thread 0x00007f16c058d740 (most recent call first):
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 279 in compute
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 1416 in __array__
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83 in asarray
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/_lib/_util.py", line 272 in _asarray_validated
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/interpolate/polyint.py", line 84 in _prepare_x
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/interpolate/polyint.py", line 72 in __call__
  File "/home/valeriu/ESMValCore/esmvalcore/cmor/_fixes/shared.py", line 251 in add_plev_from_altitude
  File "/home/valeriu/ESMValCore/esmvalcore/cmor/_fixes/common.py", line 44 in fix_metadata
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 147 in hybrid_height_coord_fix_metadata
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 168 in test_cl_hybrid_height_coord_fix_metadata
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall

valeriupredoi commented 3 years ago

OK there is nothing inherently wrong with the data realization from iris, I ran a toy test that reproduces one of the conditions that was is producing a SegFault in the test_common.py:

import iris

cubes = iris.load("sample.nc")

def var_name_constraint(var_name):
    """:mod:`iris.Constraint` using `var_name` of a :mod:`iris.cube.Cube`."""
    return iris.Constraint(cube_func=lambda c: c.var_name == var_name)

def test_realize():
    """Realize data."""
    ap = cubes.extract_strict(var_name_constraint("ps"))
    air_pressure_coord = ap.coord('latitude')
    assert air_pressure_coord.points is not None
    assert air_pressure_coord is not None

I got 0 Segfaults from 80 run instances. Combined with the issue from the previous comment I can safely say there is no reproducible issue with iris and realization of data from dask! I am starting to think that by bite-sizing the tests (breaking them into smaller, easier to be transferred to threads) tests we won't see these Segfaults anymore, maybe it's the old enemy - a function too large to be optimally pickled/packed for parallel runs!

valeriupredoi commented 3 years ago

OK - so here's a new development - the SegFaults are Poisson-distributed - using again test_common.py as benchmark and running equal numbers of runs for: the last test only, the first two tests only, and all of the tests (the whole test script as we have it), I get the following rates using @schlunma 's #938

1 tests (last one): 2/80 fails
2 tests (first two): 3/80 fails
3 tests (all tests): 5/80 fails

yielding the beautiful Poisson distribution equation:

P((a+b)x) = P(ax) + P(bx)

basically we can estimate the SegFault rate for each test that realizes data to 1/80=1.3% (this is of course machine-dependent) and all these happen because of the realization of data at iris-dask interface:

Thread 0x00007fc8880d1740 (most recent call first):
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 262 in as_concrete_data
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_data_manager.py", line 216 in data
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/coords.py", line 640 in points
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 78 in hybrid_pressure_coord_fix_metadata
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 101 in test_cl_hybrid_pressure_coord_fix_metadata_with_ap
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call

so not using netCDF4.Dataset improves the rate of SegFaults but doesn't eliminate the risk; still, a 10% decrease in rate is pretty good. Shrinking the tests themselves has no effect, one needs to understand the problem that the system encounters between iris and dask, as I yapped above, but beats me if I know how to debug that. I need a break from doing this stuff since I finally hit a deadend - @bjlittle have you guyes seen these sort of SegFaults before?

valeriupredoi commented 3 years ago

another piece of information we can infer from the study of these tests and their distribution is that we can get a probability of at least one of them happening and screwing up the test run: they are roughly 15 of these tests in the fixes tests so the probability is 1.3 x 15 = 20% so with Manuel's fix we will see an improvement in the number of SegFaults but still, expect 1 in 5 test runs to be plagued by SegFaults :+1:

sadielbartholomew commented 3 years ago

Hi, I'm not involved in the ESMValGroup projects at all but I work on other libraries that use netcdf4-python, namely cf-python and cfdm, and thought it might be relevant that we have also been seeing numerous segmentation faults across our test suite, both locally and on GitHub Actions, since the Autumn, where previously we had never seen any seg faults that weren't isolated and related to another C library we use.

I'm trying to follow the conclusions you have reached here to get a feel for whether we might be encountering the same underlying problem(s). It is notable that these libraries I work on use neither Iris nor Dask (and we test via unittest rather than pytest), but we do use netCDF4. So if we are encountering similar issues, they will most likely be from netCDF4. Indeed from what I think I have deduced with the seg faults we see with cf-python and cfdm that the seg faults happen in certain cases of opening netCDF files via netCDF4.Dataset(filename, "r") (I've put an example traceback collected using the faulthandler library below in a collapsed section in case you'd like to get a feel for what we are seeing on our side).

So that seems in line with the seg fault issues you are encountering with ESMValCore, at least your tentative conclusion:

the only tests that are at risk from this issue are the ones using netCDF4.Dataset:

@valeriupredoi do you think it might be useful to have a discussion about this in case we might both be running into similar issues? We might be able to help each other out. And I'd really love to get to the bottom of the seg faults in cf-python and cfdm, they are a real hindrance to development work!

Our netCDF seg faults: an example from cfdm

```pytb Run date: 2021-01-06 18:17:19.499144 Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10 HDF5 library: 1.10.6 netcdf library: 4.7.4 Python: 3.8.5 /home/sadie/anaconda3/envs/cf-env/bin/python netCDF4: 1.5.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netCDF4/__init__.py numpy: 1.19.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/numpy/__init__.py cfdm.core: 1.8.8.0 /home/sadie/cfdm/cfdm/core/__init__.py cftime: 1.3.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/cftime/__init__.py netcdf_flattener: 1.2.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netcdf_flattener/__init__.py cfdm: 1.8.8.0 /home/sadie/cfdm/cfdm/__init__.py test_read_CDL (__main__.read_writeTest) ... ok test_read_field (__main__.read_writeTest) ... ok test_read_mask (__main__.read_writeTest) ... ok test_read_write_Conventions (__main__.read_writeTest) ... Fatal Python error: Segmentation fault Current thread 0x00007f4ba80c8740 (most recent call first): File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 484 in open File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 133 in __getitem__ File "/home/sadie/cfdm/cfdm/data/data.py", line 264 in __getitem__ File "/home/sadie/cfdm/cfdm/data/data.py", line 542 in _item File "/home/sadie/cfdm/cfdm/data/data.py", line 2491 in last_element File "/home/sadie/cfdm/cfdm/data/data.py", line 455 in __str__ File "/home/sadie/cfdm/cfdm/data/data.py", line 212 in __repr__ File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 2949 in _create_field File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 1355 in read File "/home/sadie/cfdm/cfdm/decorators.py", line 189 in verbose_override_wrapper File "/home/sadie/cfdm/cfdm/read_write/read.py", line 295 in read File "test_read_write.py", line 371 in test_read_write_Conventions File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 633 in _callTestMethod File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 676 in run File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 736 in __call__ File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__ File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__ File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/runner.py", line 176 in run File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 271 in runTests File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 101 in __init__ File "test_read_write.py", line 479 in Segmentation fault (core dumped) ```

valeriupredoi commented 3 years ago

@sadielbartholomew thanks very much for dropping by and telling us about your issues caused by what it looks to be pretty much the same culprit. We just got rid of using netCDF4.Dataset in our tests in this PR https://github.com/ESMValGroup/ESMValCore/pull/938 and from my local tests I noticed a drop in SegFault rate by 10% which is good coz that's half of what we had, roughly. But the problem is still there. Yes, let's chat about this on a call, cheers for the suggestion! :+1:

sadielbartholomew commented 3 years ago

Thanks @valeriupredoi, that's useful to know. I'll arrange a chat with you via the NCAS Slack.

bouweandela commented 3 years ago

@valeriupredoi To try out if setting the number of threads that pytest uses to run the tests to 1 helps, could you open a pull request that replaces pytest -n 2 with pytest everywhere (in .github/workflows/action-test.yml, package/meta.yaml, .circleci/config.yml) and remove the pytest-xdist dependency from setup.py and package/meta.yaml?

valeriupredoi commented 3 years ago

hey guys, I think I have finally tracked a dominant cause of these segfaults - not what it produces the Segmentation Fault, but rather what exacerbates it statistically i.e. what makes the thing segfault more often than other times. Whereas I think the Segfault itself is inside something that calls the C library of either hdf5 or netCDF4, running the tests with xdist makes the thing much worse: I ran sets of 44 runs of test_common.py from the cmor fixes tests, the one that fails very often on CI, and did that in three ways: calling pytest as usual (with -n 2), ran just the function of one of the tests that kept failing (with python like a peasant) and disabling xdist with -n 0 (which runs the test in purely serial mode) and I got these numbers of Segfaults:

regular pytest (-n 2) fails: 7/44 5/44 6/44 -> 18/132
simple run func fails: 0/132
pytest -n 0 fails: 1/44 0/44 1/44 -> 2/132

18/132 = 14% segfaults is perfecly in line with my estimate here which is still OK after we've removed the calls to Dataset building after the PR by @schlunma but as I see it now, it's clear that running these tests without xdist will improve our Segfault rates by a lot. I am going to open a PR with this in :+1:

valeriupredoi commented 3 years ago

I declare SegFaults extinct (about bloody time, that is) :grin: Fix in https://github.com/ESMValGroup/ESMValCore/pull/1064

valeriupredoi commented 3 years ago

this is now closed - but, if we do see SegFaults creeping up, we need to see how often they creep and if it's becoming a problem again, we should mark those sequential too, but we should be careful not to take a hit at speed, that's why I left out the many tests in _fixes since we've not seen those segfault often and I'm afraid if I marked those all we'd take quite a bit of a hit at test runtimes. Also note that the conda build tests have not been changed so we might expect segafults there. Cheers for all the comments and suggestions on this one, folks!

bouweandela commented 3 years ago

Re-opening this issue, since I just saw another case of this. One of the tests failed even when running them with pytest -n 1. Running certain tests sequentially certainly reduced the number of times this problem occurs, but it did not solve it.

valeriupredoi commented 1 year ago

closing this since we've not seen these in a while (well, apart from when netcdf4 is at 1.6.1 but that's a different kettle of fish :fish: )

bouweandela commented 1 year ago

Does that mean we can stop running some tests sequentially now @valeriupredoi?

valeriupredoi commented 1 year ago

ooohhh ahhh I would be too chicken to do that now :chicken: Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?

schlunma commented 1 year ago

You shouldn't have closed this :joy:

The recent test run on CircleCI: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/7862/workflows/ceea1cc3-e590-40d5-8402-5b3e5879679f/jobs/36002

tests/integration/cmor/_fixes/test_common.py .FFatal Python error: Segmentation fault

Current thread 0x00007f740168e700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/fileformats/netcdf.py", line 416 in __getitem__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/array/core.py", line 120 in getter
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/utils.py", line 71 in apply
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 149 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/optimization.py", line 990 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 224 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 58 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7401e8f700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1438 in current_thread
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 23 in _thread_get_id
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 231 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 64 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7402790700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f75482c2740 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 320 in wait
  File "/opt/conda/envs/esmvaltool/lib/python3.10/queue.py", line 171 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 137 in queue_get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 500 in get_async
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 89 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/base.py", line 600 in compute
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 215 in _co_realise_lazy_arrays
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 252 in as_concrete_data
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_data_manager.py", line 206 in data
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/coords.py", line 1702 in bounds
  File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 140 in hybrid_height_coord_fix_metadata
  File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 170 in test_cl_hybrid_height_coord_fix_metadata
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 1789 in runtest
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 339 in from_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 324 in _main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/opt/conda/envs/esmvaltool/bin/pytest", line 10 in <module>

Extension modules: mypy, mypy.api, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, yaml._yaml, cytoolz.utils, cytoolz.itertoolz, cytoolz.functoolz, cytoolz.dicttoolz, cytoolz.recipes, xxhash._xxhash, psutil._psutil_linux, psutil._psutil_posix, markupsafe._speedups, lz4._version, lz4.frame._frame, scipy._lib._ccallback_c, scipy.sparse._sparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.fftpack.convolve, cftime._cftime, cf_units._udunits2, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, numpy.linalg.lapack_lite, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, beta_ufunc, scipy.stats._boost.beta_ufunc, binom_ufunc, scipy.stats._boost.binom_ufunc, nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._hypotests_pythran, scipy.stats._statlib, scipy.stats._mvn, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._unuran.unuran_wrapper, shapely.speedups._speedups, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, cartopy.trace, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, netCDF4._netCDF4, PIL._imaging, _cffi_backend, unicodedata2, stratify._conservative, stratify._vinterp, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.signal._sosfilt, scipy.signal._spectral, scipy.signal._peak_finding_utils, shapely.vectorized._vectorized, msgpack._cmsgpack (total: 195)
/bin/bash: line 4:  6828 Segmentation fault      (core dumped) pytest -n 0 -m "sequential" --cov-append --junitxml=report-2.xml

Exited with code exit status 139

valeriupredoi commented 1 year ago

Manu, that's netCDF4=1.6.2, also reported by @remi-kazeroni in Tool, we need to pin from !=1.6.1 to restrictive <1.6.1, and that's iris' handling (or mis-handling) of threaded netCDF4 with Dask, for which we have https://github.com/ESMValGroup/ESMValCore/issues/1727 and https://github.com/ESMValGroup/ESMValTool/issues/2907 which I will edit to warn users not to use any netCDF4>1.6.0. Iris are debating on it in https://github.com/SciTools/iris/issues/5016

These SegFaults reported here were very rare, and we stopped seeing them after netCDF4 ~ 1.5 something. I am not 100% sure they were connected to iris, they remain a mystery. The new ones are more frequent than a new Tory prime minister (this year).

bouweandela commented 1 year ago

Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?

That would be great, it would really speed up the test runs if we can avoid running some tests sequentially.

ESMValGroup / ESMValCore

Segmentation fault when running tests on CircleCI #644

Our netCDF seg faults: an example from cfdm