dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
234 stars 142 forks source link

SGE Tests segfault in CI #653

Open jacobtomlinson opened 1 month ago

jacobtomlinson commented 1 month ago

Opening an issue to triage the segfault that seems to be happening in th SGE tests.

For some time the SGE tests have been failing. When you look at the logs of a recent run on main it contains the following error.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': double free or corruption (!prev): 0x00005626c18fa470 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

I also opened #652 to bump the minimum Python version here to 3.9 and I see a similar issue happening but with a slightly different error.

*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': corrupted size vs. prev_size: 0x00005560404681a0 ***
/bin/bash: line 1:   592 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

Strangely in both cases pytest reports everything has passed.

9 passed, 270 skipped in 26.69s
jacobtomlinson commented 1 month ago

In #654 I've been playing around with skipping various tests and enabling them again. It seems like enabling any two of the tests results in the segfault. Enabling more than one only causes the error to appear once though.

jacobtomlinson commented 1 month ago

I have a local reproducer now. Here are the steps I took to get it set up on my machine.

# Build SGE container
cd ci/sge
cp ../environment.yaml .
docker compose build

# Start SGE stack (based on ci/sge.sh)
./start-sge.sh
docker exec sge_master /bin/bash -c "chmod -R 777 /shared_space"

# Install dask-jobqueue in editible install
docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd /dask-jobqueue; pip install -e ."

I also installed anyio and used @pytest.mark.anyio instead of @pytest.mark.asyncio because I find the behaviour a lot more consistent. See #655.

I then created a new test file with a single test that consistently reproduces the segfault.

# dask_jobqueue/tests/test_jsge_segfault.py
from dask_jobqueue.sge import SGECluster
from dask.distributed import Client

import pytest

@pytest.mark.anyio
@pytest.mark.env("sge")
async def test_cluster():
    async with SGECluster(1, cores=1, memory="1GB", asynchronous=True) as cluster:
        async with Client(cluster, asynchronous=True):
            pass

Then you can run the test via docker exec.

$ docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge"
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': corrupted size vs. prev_size: 0x0000560d54c76aa0 ***
/bin/bash: line 1: 29477 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge` failed. (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.19, pytest-8.3.2, pluggy-1.5.0 -- /opt/anaconda/envs/dask-jobqueue/bin/python3.8
cachedir: .pytest_cache
rootdir: /dask-jobqueue
plugins: anyio-4.4.0
collecting ... collected 1 item

../dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py::test_cluster PASSED

============================== 1 passed in 1.07s ===============================

$ echo $?
134
jacobtomlinson commented 1 month ago

Since upgrading to Python 3.9 in CI this issues seems to have gone away. It's strange because I'm still able to reproduce some problems locally, but perhaps there is something cached that I'm not taking into account.

Given that CI is all green and PRs and merges are passing consistently I'm going to close this out.

jacobtomlinson commented 1 month ago

Looks like a similar erorr happened when running CI for #660.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': free(): invalid pointer: 0x0000557fe477a210 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)

Perhaps it's not as resolved as I had hoped.

jacobtomlinson commented 3 weeks ago

Still seeing this after bumping to Python 3.10.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.10': double free or corruption (!prev): 0x000055e87b68bf90 ***
/bin/bash: line 1:   591 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)