Many tests start the kopf controller via the kopf_runner fixture and perform work within the context manager that it provides.
def test_foo(kopf_runner):
# Start the controller
with kopf_runner:
# do work
Any resources that use @kopf.timer, @kopf.daemon, @kopf.on.shutdown all set finalizers on the resources they are related to. This means that the controller must be running at the time of their deletion. However we rely a bunch on cascade deletion which means the parent resource can be removed and the test will exit, but the child resources will take some amount of time to be reaped which can result in a race condition that ends in a deadlock. If the test has exited the controller is no longer running, and therefore the reaping will hang.
This happens in dask_kubernetes/experimental/tests/test_kubecluster.py::test_adapt.
def test_adapt(kopf_runner, docker_image):
# Start the controller
with kopf_runner:
# Create a DaskCluster resource with KubeCluster
with KubeCluster(name="adaptive", image=docker_image, n_workers=0) as cluster:
# Create a DaskAutoscaler resource with cluster.adapt
# This resource uses a @kopf.timer to periodically update the number of workers
# and therefore has a finalizer to deactivate the timer
cluster.adapt(minimum=0, maximum=1)
# Do some testing
with Client(cluster) as client:
assert client.submit(lambda x: x + 1, 10).result() == 11
# When leaving the KubeCluster context manager the DaskCluster resource is deleted and
# the DaskAutoscaler will be deleted at some later point via cascade deletion
# When leaving the kopf_runner context manager the controller is stopped
# Eventually the DaskAutoscaler is reaped but there is no controller to handle the finalizer
# so it never gets deleted
The hang happens when the CRDs are deleted by the customresources fixture because if there are any resources with finalizers left over the CRDs can never be deleted and the CI will time out.
==================================== ERRORS ====================================
_______________________ ERROR at teardown of test_adapt ________________________
k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fbf79595040>
@pytest.fixture(scope="session", autouse=True)
def customresources(k8s_cluster):
temp_dir = tempfile.TemporaryDirectory()
crd_path = os.path.join(DIR, "operator", "customresources")
for crd in ["daskcluster", "daskworkergroup", "daskjob", "daskautoscaler"]:
run_generate(
os.path.join(crd_path, f"{crd}.yaml"),
os.path.join(crd_path, f"{crd}.patch.yaml"),
os.path.join(temp_dir.name, f"{crd}.yaml"),
)
k8s_cluster.kubectl("apply", "-f", temp_dir.name)
yield
> k8s_cluster.kubectl("delete", "-f", temp_dir.name)
...
E Failed: Timeout >300.0s
To workaround this particular example we can call cluster.scale(0) after we test the adaptive behaviour which will delete the DaskAutoscaler before we leave the kopf_runner.
Ideally we should update the customresources fixture to walk through all custom resources and remove any finalizers before we try to delete the CRDs themselves.
This came up in #649 which used @kopf.on.shutdown and the workaround was to set optional=True which appears to still call the hook but doesn't block the resource deletion on it.
Many tests start the kopf controller via the
kopf_runner
fixture and perform work within the context manager that it provides.Any resources that use
@kopf.timer
,@kopf.daemon
,@kopf.on.shutdown
all set finalizers on the resources they are related to. This means that the controller must be running at the time of their deletion. However we rely a bunch on cascade deletion which means the parent resource can be removed and the test will exit, but the child resources will take some amount of time to be reaped which can result in a race condition that ends in a deadlock. If the test has exited the controller is no longer running, and therefore the reaping will hang.This happens in
dask_kubernetes/experimental/tests/test_kubecluster.py::test_adapt
.The hang happens when the CRDs are deleted by the
customresources
fixture because if there are any resources with finalizers left over the CRDs can never be deleted and the CI will time out.To workaround this particular example we can call
cluster.scale(0)
after we test the adaptive behaviour which will delete theDaskAutoscaler
before we leave thekopf_runner
.Ideally we should update the
customresources
fixture to walk through all custom resources and remove any finalizers before we try to delete the CRDs themselves.