Closed bouweandela closed 1 year ago
This is could be because pytest
is running the tests in parallel and the netCDF4
library is not thread safe https://github.com/Unidata/netcdf4-python/issues/844, but I'm not sure if pytest actually does this.
Actually the threads are created by dask, we're not running pytest in parallel.
yeah, seen this, albeit it seems to be transitory. I am smelling hdf5
so I pinned it to catch the latest 1.10.6
in #649 - the one I believe the CI is compiling netcdf4 against is 1.10.5
which is more than 7 months old
gah! just looked at the Artifacts and it is 1.10.6
the CI is installing, bleh
what are these things? https://10300-190192145-gh.circle-artifacts.com/0/logs/versions.txt
nevermind, they're the base system versions, got confused they are the env packages versions
aha! I found the bugger: indeed the final test environment is using hdf5=1.10.5
because the env reverts back to it when installing ncl
:
hdf5 1.10.6-mpi_mpich_ha7d0aea_0 --> 1.10.5-mpi_mpich_ha7d0aea_1004
the problem is pinning it to >=1.10.6
causes conda to blow its pipes and refuse to solve the environment, dumb conda. The good news is I ran multiple test instances with (at least 4, on CI machine) with 1.10.6
and still to see the segfault, so now am pretty sure it's the hdf5
version that's causing it
conda is a good boi, it's ncl
that is not allowing 1.10.6
Instead of using very recent versions, maybe we could try using some older versions of hdf5 or netcdf? Because the problem didn't occur earlier?
OK I had a closer look at this today - some prelim conclusions:
hdf5
that's causing the issue (tests fail with Segfaults for the core package too, where we have 1.10.6 and no ncl
)execnet
https://pypi.org/project/execnet/ and I'm a bit concerned about their bold warning message right at the top: execnet currently is in maintenance-only mode, mostly because it is still the backend of the pytest-xdist plugin. Do not use in new projectsI also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo since we were not seeing these for Core (where hdf5 was and is indeed at a recent 1.10.6 version)
I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo
Could you attach a stack trace to back that up?
Do we have a list of tests that are affected? This issue is somewhat annoying, because it happens all the time.
Can we mark the failing tests as xfail
until this is resolved?
tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py::test_cl_fix_metadata
tests/integration/cmor/_fixes/test_common.py::test_cl_hybrid_pressure_coord_fix_metadata_with_a
...
Do we have a list of tests that are affected?
That would probably be all files that use iris to load a netcdf file
Can we mark the failing tests as xfail until this is resolved?
I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.
Can we mark the failing tests as xfail until this is resolved?
I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.
I was thinking they still show up in the summary, but they don't slow down PRs. At the moment, every time I make a commit to a PR I have to go into CircleCI and re-run the failing test (and make sure the fail is because of a segmentation error) and wait another 20 minutes for the result. It is not a good experience.
OK this bugs me badly so I had a systematic look at one of the tests failing on a regular basis, namely tests/integration/cmor/_fixes/test_common.py
. On my machine I ran sets of 20 repeated tests (2 gworkers per run):
No of fails in 20 runs of the test
run1: 3/20
run2: 5/20
run3: 1/20
run4: 4/20
-----------
avg: 3.25/20 = 16%
with each of the fails counted from the gworker's crash due to Segmentation Fault. So pretty bad on my laptop - a ~20% crash rate is nasty, I noticed worse rates (by eye) on the CI machine.
Now, here's the funny thing - I even found a fix that makes the crash rate drop significantly: encasing the Dataset
import in two print statements eg:
print("DATbefore")
from netCDF4 import Dataset
print("DATafter")
makes the thing run with a much lower SegFault rate (I ran the test 160 times and it crashed only once, so a crash rate of <1%);
Another experiment was to run on with a single worker: running with a single gworker (pytest -n 1
) reduces the fail rate too to 3/120<3%.
I think there is some time race condition that fails badly when using netCDF4.Dataset
- we have not seen these fail anywhere else but the cmor fixes. @bouweandela has been around the block e.g. https://github.com/SciTools/iris/issues/2063 and the related netCDF4-python Issue - so it looks to me as something related to, if not identical. Also hdf5
and netCDF4
are not thread safe. I am not sure if we can ever fix this issue but rather maybe try reduce the segfaulting rate?
the only tests that are at risk from this issue are the ones using netCDF4.Dataset
:
tests/integration/cmor/_fixes/cmip6/test_cesm2_waccm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_noresm2_lm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_esm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cesm2.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/test_common.py:from netCDF4 import Dataset
I suggest we change these and create a dataset in a different way - iris cube saved to netCDF or simply pass the cube, bypassing the load step. Note that using cdo
or such will mean building the dataset with netCDF4.Dataset
so we shouldn't do that - what you guys think @jvegasbsc @stefsmeets @bouweandela
Would it help to remove the dataset creation from the files and instead store the netcdf files in the repository? Creating the files with iris
does not work unfortunately since we need to set invalid attributes.
Would it help to remove the dataset creation the files and instead store the netcdf files in the repository? Creating the files with
iris
does not work unfortunately since we need to set invalid attributes.
I think this is a good solution. It also simplifies the tests and reduces the load on CircleCI.
All right, let me open a PR for that.
@schlunma yeah I like the idea! they're not big files anyways
It might also be possible to cache the call with pytest, like we are doing here:
The .pytest_cache
directory itself is already cached in CircleCI too.
PR is open: #938
@schlunma great job with the PR, mate! I think it's a very useful one (I will post a minor comment there) but unfortunately it does not solve this particular problem BUT what it does is it narrows the problem down and now I actually know what the heck is going on: look at (again) tests/integration/cmor/_fixes/test_common.py
:
# Raw cube
cube = cubes.extract_strict(var_name_constraint(short_name))
iris.save(cube, "shit3.nc")
virtually realizing the data, all goes fine, I get no more SegFaults (0/80 tests run!)
So - there is something awfully wrong happening at the interface of iris with dask, for some reason associated to accessing coordinates - @bouweandela and @bjlittle may know more - I am stumped but I at this point I don't care because if we realize the data for each of the cubes in these tests (not big data hence not much memory) we won't get the SegFaults no more - I forgot how do you realize the data cube.co_realize()
or something?
EDIT: iris.save(cube, "shit3.nc")
may fail throwing a SegFault right at the point where data needs to be realized, see comment below, but if that doesn't fail, nothing else fails anymore (no more SegFaults)
OK this is the interface iris-dask that is causing the issue:
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
and this is very alarming because this is the elementary call to compute of dask, nothing fancy. This means this issue may happen not only to our tests but to any call to iris to realize some data, the reason we haven't seen it frequently is simply because we run these tests so many more times.
So - for our tests - we should work with realized data; for the more general problem - it's beyond me what the solution is :beer:
do you know what? I am going to take iris off the hook, I just got this SegFault due to nothing iris-related:
Thread 0x00007f16c058d740 (most recent call first):
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 279 in compute
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 1416 in __array__
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83 in asarray
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/_lib/_util.py", line 272 in _asarray_validated
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/interpolate/polyint.py", line 84 in _prepare_x
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/scipy/interpolate/polyint.py", line 72 in __call__
File "/home/valeriu/ESMValCore/esmvalcore/cmor/_fixes/shared.py", line 251 in add_plev_from_altitude
File "/home/valeriu/ESMValCore/esmvalcore/cmor/_fixes/common.py", line 44 in fix_metadata
File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 147 in hybrid_height_coord_fix_metadata
File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 168 in test_cl_hybrid_height_coord_fix_metadata
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
OK there is nothing inherently wrong with the data realization from iris, I ran a toy test that reproduces one of the conditions that was is producing a SegFault in the test_common.py
:
import iris
cubes = iris.load("sample.nc")
def var_name_constraint(var_name):
""":mod:`iris.Constraint` using `var_name` of a :mod:`iris.cube.Cube`."""
return iris.Constraint(cube_func=lambda c: c.var_name == var_name)
def test_realize():
"""Realize data."""
ap = cubes.extract_strict(var_name_constraint("ps"))
air_pressure_coord = ap.coord('latitude')
assert air_pressure_coord.points is not None
assert air_pressure_coord is not None
I got 0 Segfaults from 80 run instances. Combined with the issue from the previous comment I can safely say there is no reproducible issue with iris and realization of data from dask! I am starting to think that by bite-sizing the tests (breaking them into smaller, easier to be transferred to threads) tests we won't see these Segfaults anymore, maybe it's the old enemy - a function too large to be optimally pickled/packed for parallel runs!
OK - so here's a new development - the SegFaults are Poisson-distributed - using again test_common.py
as benchmark and running equal numbers of runs for: the last test only, the first two tests only, and all of the tests (the whole test script as we have it), I get the following rates using @schlunma 's #938
1 tests (last one): 2/80 fails
2 tests (first two): 3/80 fails
3 tests (all tests): 5/80 fails
yielding the beautiful Poisson distribution equation:
P((a+b)x) = P(ax) + P(bx)
basically we can estimate the SegFault rate for each test that realizes data to 1/80=1.3% (this is of course machine-dependent) and all these happen because of the realization of data at iris-dask interface:
Thread 0x00007fc8880d1740 (most recent call first):
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 262 in as_concrete_data
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_data_manager.py", line 216 in data
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/coords.py", line 640 in points
File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 78 in hybrid_pressure_coord_fix_metadata
File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 101 in test_cl_hybrid_pressure_coord_fix_metadata_with_ap
File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
so not using netCDF4.Dataset
improves the rate of SegFaults but doesn't eliminate the risk; still, a 10% decrease in rate is pretty good. Shrinking the tests themselves has no effect, one needs to understand the problem that the system encounters between iris and dask, as I yapped above, but beats me if I know how to debug that. I need a break from doing this stuff since I finally hit a deadend - @bjlittle have you guyes seen these sort of SegFaults before?
another piece of information we can infer from the study of these tests and their distribution is that we can get a probability of at least one of them happening and screwing up the test run: they are roughly 15 of these tests in the fixes tests so the probability is 1.3 x 15 = 20% so with Manuel's fix we will see an improvement in the number of SegFaults but still, expect 1 in 5 test runs to be plagued by SegFaults :+1:
Hi, I'm not involved in the ESMValGroup projects at all but I work on other libraries that use netcdf4-python, namely cf-python and cfdm, and thought it might be relevant that we have also been seeing numerous segmentation faults across our test suite, both locally and on GitHub Actions, since the Autumn, where previously we had never seen any seg faults that weren't isolated and related to another C library we use.
I'm trying to follow the conclusions you have reached here to get a feel for whether we might be encountering the same underlying problem(s). It is notable that these libraries I work on use neither Iris nor Dask (and we test via unittest
rather than pytest
), but we do use netCDF4
. So if we are encountering similar issues, they will most likely be from netCDF4
. Indeed from what I think I have deduced with the seg faults we see with cf-python and cfdm that the seg faults happen in certain cases of opening netCDF files via netCDF4.Dataset(filename, "r")
(I've put an example traceback collected using the faulthandler
library below in a collapsed section in case you'd like to get a feel for what we are seeing on our side).
So that seems in line with the seg fault issues you are encountering with ESMValCore
, at least your tentative conclusion:
the only tests that are at risk from this issue are the ones using netCDF4.Dataset:
@valeriupredoi do you think it might be useful to have a discussion about this in case we might both be running into similar issues? We might be able to help each other out. And I'd really love to get to the bottom of the seg faults in cf-python and cfdm, they are a real hindrance to development work!
@sadielbartholomew thanks very much for dropping by and telling us about your issues caused by what it looks to be pretty much the same culprit. We just got rid of using netCDF4.Dataset
in our tests in this PR https://github.com/ESMValGroup/ESMValCore/pull/938 and from my local tests I noticed a drop in SegFault rate by 10% which is good coz that's half of what we had, roughly. But the problem is still there. Yes, let's chat about this on a call, cheers for the suggestion! :+1:
Thanks @valeriupredoi, that's useful to know. I'll arrange a chat with you via the NCAS Slack.
@valeriupredoi To try out if setting the number of threads that pytest uses to run the tests to 1 helps, could you open a pull request that replaces pytest -n 2
with pytest
everywhere (in .github/workflows/action-test.yml, package/meta.yaml, .circleci/config.yml) and remove the pytest-xdist
dependency from setup.py and package/meta.yaml?
hey guys, I think I have finally tracked a dominant cause of these segfaults - not what it produces the Segmentation Fault, but rather what exacerbates it statistically i.e. what makes the thing segfault more often than other times. Whereas I think the Segfault itself is inside something that calls the C library of either hdf5 or netCDF4, running the tests with xdist
makes the thing much worse: I ran sets of 44 runs of test_common.py
from the cmor fixes tests, the one that fails very often on CI, and did that in three ways: calling pytest
as usual (with -n 2
), ran just the function of one of the tests that kept failing (with python
like a peasant) and disabling xdist
with -n 0
(which runs the test in purely serial mode) and I got these numbers of Segfaults:
regular pytest (-n 2) fails: 7/44 5/44 6/44 -> 18/132
simple run func fails: 0/132
pytest -n 0 fails: 1/44 0/44 1/44 -> 2/132
18/132 = 14% segfaults is perfecly in line with my estimate here which is still OK after we've removed the calls to Dataset
building after the PR by @schlunma but as I see it now, it's clear that running these tests without xdist
will improve our Segfault rates by a lot. I am going to open a PR with this in :+1:
I declare SegFaults extinct (about bloody time, that is) :grin: Fix in https://github.com/ESMValGroup/ESMValCore/pull/1064
this is now closed - but, if we do see SegFaults creeping up, we need to see how often they creep and if it's becoming a problem again, we should mark those sequential
too, but we should be careful not to take a hit at speed, that's why I left out the many tests in _fixes
since we've not seen those segfault often and I'm afraid if I marked those all we'd take quite a bit of a hit at test runtimes. Also note that the conda build tests have not been changed so we might expect segafults there. Cheers for all the comments and suggestions on this one, folks!
Re-opening this issue, since I just saw another case of this. One of the tests failed even when running them with pytest -n 1
. Running certain tests sequentially certainly reduced the number of times this problem occurs, but it did not solve it.
closing this since we've not seen these in a while (well, apart from when netcdf4 is at 1.6.1 but that's a different kettle of fish :fish: )
Does that mean we can stop running some tests sequentially now @valeriupredoi?
ooohhh ahhh I would be too chicken to do that now :chicken: Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?
You shouldn't have closed this :joy:
The recent test run on CircleCI: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/7862/workflows/ceea1cc3-e590-40d5-8402-5b3e5879679f/jobs/36002
tests/integration/cmor/_fixes/test_common.py .FFatal Python error: Segmentation fault
Current thread 0x00007f740168e700 (most recent call first):
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/fileformats/netcdf.py", line 416 in __getitem__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/array/core.py", line 120 in getter
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/utils.py", line 71 in apply
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 149 in get
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/optimization.py", line 990 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 224 in execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 58 in run
File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f7401e8f700 (most recent call first):
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1438 in current_thread
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 23 in _thread_get_id
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 231 in execute_task
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 64 in run
File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f7402790700 (most recent call first):
File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f75482c2740 (most recent call first):
File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 320 in wait
File "/opt/conda/envs/esmvaltool/lib/python3.10/queue.py", line 171 in get
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 137 in queue_get
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 500 in get_async
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 89 in get
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/base.py", line 600 in compute
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 215 in _co_realise_lazy_arrays
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 252 in as_concrete_data
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_data_manager.py", line 206 in data
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/coords.py", line 1702 in bounds
File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 140 in hybrid_height_coord_fix_metadata
File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 170 in test_cl_hybrid_height_coord_fix_metadata
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 1789 in runtest
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 260 in <lambda>
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 339 in from_call
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 220 in call_and_report
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 324 in _main
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 270 in wrap_session
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 167 in main
File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 190 in console_main
File "/opt/conda/envs/esmvaltool/bin/pytest", line 10 in <module>
Extension modules: mypy, mypy.api, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, yaml._yaml, cytoolz.utils, cytoolz.itertoolz, cytoolz.functoolz, cytoolz.dicttoolz, cytoolz.recipes, xxhash._xxhash, psutil._psutil_linux, psutil._psutil_posix, markupsafe._speedups, lz4._version, lz4.frame._frame, scipy._lib._ccallback_c, scipy.sparse._sparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.fftpack.convolve, cftime._cftime, cf_units._udunits2, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, numpy.linalg.lapack_lite, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, beta_ufunc, scipy.stats._boost.beta_ufunc, binom_ufunc, scipy.stats._boost.binom_ufunc, nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._hypotests_pythran, scipy.stats._statlib, scipy.stats._mvn, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._unuran.unuran_wrapper, shapely.speedups._speedups, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, cartopy.trace, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, netCDF4._netCDF4, PIL._imaging, _cffi_backend, unicodedata2, stratify._conservative, stratify._vinterp, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.signal._sosfilt, scipy.signal._spectral, scipy.signal._peak_finding_utils, shapely.vectorized._vectorized, msgpack._cmsgpack (total: 195)
/bin/bash: line 4: 6828 Segmentation fault (core dumped) pytest -n 0 -m "sequential" --cov-append --junitxml=report-2.xml
Exited with code exit status 139
Manu, that's netCDF4=1.6.2, also reported by @remi-kazeroni in Tool, we need to pin from !=1.6.1 to restrictive <1.6.1, and that's iris' handling (or mis-handling) of threaded netCDF4 with Dask, for which we have https://github.com/ESMValGroup/ESMValCore/issues/1727 and https://github.com/ESMValGroup/ESMValTool/issues/2907 which I will edit to warn users not to use any netCDF4>1.6.0. Iris are debating on it in https://github.com/SciTools/iris/issues/5016
These SegFaults reported here were very rare, and we stopped seeing them after netCDF4 ~ 1.5 something. I am not 100% sure they were connected to iris, they remain a mystery. The new ones are more frequent than a new Tory prime minister (this year).
Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?
That would be great, it would really speed up the test runs if we can avoid running some tests sequentially.
The unit tests are failing regularly due to a segmentation fault.
If this happens to your pull request, please go to CircleCI by clicking 'Details' behind the CircleCI 'commit' check below your pull request, and clicking on the failed test job. Once you're on CircleCI, click the 'Rerun' button in the top right corner and choose 'Rerun Workflow from Failed'. Usually, the test will succeed the second time, if not, try again. If you do not see the 'Rerun' button, you need to log in to CircleCI with your GitHub account first.
The problem seems to originate from the netcdf4 library.
Example crashed run: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/2482/workflows/f8e73729-c4cf-408c-bdae-beec24238ac1/jobs/10300/steps
Example stack trace: