Closed Barteus closed 2 years ago
Just to note it seems to do some weird stuff with a webhook, so that when you remove the spark-k8s application it then prevents anything from deploying in the namespace.
What happens is that the spark workload creates a MutatingWebhook
(spark-config
) that points back to the spark workload's deployment as the backend for the webhook. When we juju remove-application spark
, the workload's deployment (eg: webhook's backend) is removed, but the webhook itself is not (see this bug about juju not cleaning up resources).
Once in this state, trying to create any pod in the model's namespace (ex: kubectl run netshoot --rm -i --tty --image nicolaka/netshoot -- /bin/bash
) will get the error: Error from server (InternalError): Internal error occurred: failed calling webhook "webhook.sparkoperator.k8s.io": Post "https://spark.fix-spark.svc:443/webhook?timeout=30s": service "spark" not found
because the webhook has no backend.. This also blocks new
juju deploy ...` calls in the model, leaving any new application stuck "installing agent".
I tried adding a remove
hook in the current podspec charm to remove the offending webhook, but even after deploying with --trust
we do not get permission to do this. I received the following exception:
application-spark-k8s: 14:06:24 INFO unit.spark-k8s/4.juju-log mutatingwebhookconfigurations.admissionregistration.k8s.io "broken-webhook" is forbidden: User "system:serviceaccount:kubeflow:spark-k8s-operator" cannot delete resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
I'm not sure if this is because podspec charms should not be able to use trust and cannot ever have this permission, or if there's just a bug in trusting podspec charms.
--trust
is supposed to work for podspec, so this is a bug in Juju that likely will not be addressed as they're focusing efforts on other tasks. Easiest way for us to resolve this is to rewrite the charm using pebble/sidecar.
Spark charm does not remove all resources and this blocks it from being installed again.
To reproduce:
Juju status: