dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
314 stars 148 forks source link

Finalizers can cause a deadlock that causes the test suite to time out on teardown #546

Open jacobtomlinson opened 2 years ago

jacobtomlinson commented 2 years ago

Many tests start the kopf controller via the kopf_runner fixture and perform work within the context manager that it provides.

def test_foo(kopf_runner):
    # Start the controller
    with kopf_runner:
        # do work

Any resources that use @kopf.timer, @kopf.daemon, @kopf.on.shutdown all set finalizers on the resources they are related to. This means that the controller must be running at the time of their deletion. However we rely a bunch on cascade deletion which means the parent resource can be removed and the test will exit, but the child resources will take some amount of time to be reaped which can result in a race condition that ends in a deadlock. If the test has exited the controller is no longer running, and therefore the reaping will hang.

This happens in dask_kubernetes/experimental/tests/test_kubecluster.py::test_adapt.

def test_adapt(kopf_runner, docker_image):
    # Start the controller
    with kopf_runner:

        # Create a DaskCluster resource with KubeCluster
        with KubeCluster(name="adaptive", image=docker_image,  n_workers=0) as cluster:

            # Create a DaskAutoscaler resource with cluster.adapt
            # This resource uses a @kopf.timer to periodically update the number of workers
            # and therefore has a finalizer to deactivate the timer
            cluster.adapt(minimum=0, maximum=1)

            # Do some testing
            with Client(cluster) as client:
                assert client.submit(lambda x: x + 1, 10).result() == 11

        # When leaving the KubeCluster context manager the DaskCluster resource is deleted and
        # the DaskAutoscaler will be deleted at some later point via cascade deletion

    # When leaving the kopf_runner context manager the controller is stopped

    # Eventually the DaskAutoscaler is reaped but there is no controller to handle the finalizer 
    # so it never gets deleted

The hang happens when the CRDs are deleted by the customresources fixture because if there are any resources with finalizers left over the CRDs can never be deleted and the CI will time out.

 ==================================== ERRORS ====================================
_______________________ ERROR at teardown of test_adapt ________________________

k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fbf79595040>

    @pytest.fixture(scope="session", autouse=True)
    def customresources(k8s_cluster):

        temp_dir = tempfile.TemporaryDirectory()
        crd_path = os.path.join(DIR, "operator", "customresources")

        for crd in ["daskcluster", "daskworkergroup", "daskjob", "daskautoscaler"]:
            run_generate(
                os.path.join(crd_path, f"{crd}.yaml"),
                os.path.join(crd_path, f"{crd}.patch.yaml"),
                os.path.join(temp_dir.name, f"{crd}.yaml"),
            )

        k8s_cluster.kubectl("apply", "-f", temp_dir.name)
        yield
>       k8s_cluster.kubectl("delete", "-f", temp_dir.name)

...

E               Failed: Timeout >300.0s

To workaround this particular example we can call cluster.scale(0) after we test the adaptive behaviour which will delete the DaskAutoscaler before we leave the kopf_runner.

Ideally we should update the customresources fixture to walk through all custom resources and remove any finalizers before we try to delete the CRDs themselves.

jacobtomlinson commented 1 year ago

This came up in #649 which used @kopf.on.shutdown and the workaround was to set optional=True which appears to still call the hook but doesn't block the resource deletion on it.