Open jacobtomlinson opened 1 month ago
In #654 I've been playing around with skipping various tests and enabling them again. It seems like enabling any two of the tests results in the segfault. Enabling more than one only causes the error to appear once though.
I have a local reproducer now. Here are the steps I took to get it set up on my machine.
# Build SGE container
cd ci/sge
cp ../environment.yaml .
docker compose build
# Start SGE stack (based on ci/sge.sh)
./start-sge.sh
docker exec sge_master /bin/bash -c "chmod -R 777 /shared_space"
# Install dask-jobqueue in editible install
docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd /dask-jobqueue; pip install -e ."
I also installed anyio
and used @pytest.mark.anyio
instead of @pytest.mark.asyncio
because I find the behaviour a lot more consistent. See #655.
I then created a new test file with a single test that consistently reproduces the segfault.
# dask_jobqueue/tests/test_jsge_segfault.py
from dask_jobqueue.sge import SGECluster
from dask.distributed import Client
import pytest
@pytest.mark.anyio
@pytest.mark.env("sge")
async def test_cluster():
async with SGECluster(1, cores=1, memory="1GB", asynchronous=True) as cluster:
async with Client(cluster, asynchronous=True):
pass
Then you can run the test via docker exec
.
$ docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge"
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': corrupted size vs. prev_size: 0x0000560d54c76aa0 ***
/bin/bash: line 1: 29477 Aborted (core dumped) pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge
ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge` failed. (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.19, pytest-8.3.2, pluggy-1.5.0 -- /opt/anaconda/envs/dask-jobqueue/bin/python3.8
cachedir: .pytest_cache
rootdir: /dask-jobqueue
plugins: anyio-4.4.0
collecting ... collected 1 item
../dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py::test_cluster PASSED
============================== 1 passed in 1.07s ===============================
$ echo $?
134
Since upgrading to Python 3.9 in CI this issues seems to have gone away. It's strange because I'm still able to reproduce some problems locally, but perhaps there is something cached that I'm not taking into account.
Given that CI is all green and PRs and merges are passing consistently I'm going to close this out.
Looks like a similar erorr happened when running CI for #660.
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': free(): invalid pointer: 0x0000557fe477a210 ***
/bin/bash: line 1: 588 Aborted (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge
ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)
Perhaps it's not as resolved as I had hoped.
Still seeing this after bumping to Python 3.10.
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.10': double free or corruption (!prev): 0x000055e87b68bf90 ***
/bin/bash: line 1: 591 Aborted (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge
ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)
Opening an issue to triage the segfault that seems to be happening in th SGE tests.
For some time the SGE tests have been failing. When you look at the logs of a recent run on
main
it contains the following error.I also opened #652 to bump the minimum Python version here to 3.9 and I see a similar issue happening but with a slightly different error.
Strangely in both cases
pytest
reports everything has passed.