dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 149 forks source link

Controller loses API connection after token expiry on Azure Kubernetes Service (AKS) 1.30 due to `kopf` bug #913

Open creste opened 2 days ago

creste commented 2 days ago

dask-operator fails to create Dask Jobs on Azure Kubernetes Service (AKS) 1.30:

See this kopf bug report for details.

Minimal Complete Verifiable Example:

  1. Install dask-operator on AKS 1.30.
  2. Wait an hour for the authentication token to expire.
  3. Create a DaskJob resource.

dask-operator will not create the DaskJob because dask-operator's kubernetes authentication token has expired and kopf's watchers are no longer connected to kubeapi. A bug in kopf prevents kopf from refreshing the authentication token.

This only occurs on AKS 1.30+ because that is the first AKS version that sets --service-account-extend-token-expiration to false.

Environment:

jacobtomlinson commented 2 days ago

Thanks for flagging this here. I don't see any immediate solution we can implement in dask-kubernetes to work around this so I expect we will need to wait for a fix in https://github.com/nolar/kopf/issues/980