canonical / spark-operator

Spark Operator
Apache License 2.0
1 stars 2 forks source link

Spark does not remove all resource during remove-application #20

Closed Barteus closed 2 years ago

Barteus commented 2 years ago

Spark charm does not remove all resources and this blocks it from being installed again.

To reproduce:

juju deploy spark-k8s spark
juju remove-application spark-k8s
juju deploy spark-k8s spark

Juju status:

ubuntu@ip-172-31-31-10:~$ juju status
Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  micro       microk8s/localhost  2.9.31   unsupported  12:57:58Z

App    Version  Status   Scale  Charm      Channel  Rev  Address  Exposed  Message
spark           waiting    0/1  spark-k8s  stable     1           no       installing agent

Unit     Workload  Agent       Address  Ports  Message
spark/1  waiting   allocating                  installing agent
grobbie commented 2 years ago

Just to note it seems to do some weird stuff with a webhook, so that when you remove the spark-k8s application it then prevents anything from deploying in the namespace.

ca-scribner commented 2 years ago

What happens is that the spark workload creates a MutatingWebhook (spark-config) that points back to the spark workload's deployment as the backend for the webhook. When we juju remove-application spark, the workload's deployment (eg: webhook's backend) is removed, but the webhook itself is not (see this bug about juju not cleaning up resources).

Once in this state, trying to create any pod in the model's namespace (ex: kubectl run netshoot --rm -i --tty --image nicolaka/netshoot -- /bin/bash) will get the error: Error from server (InternalError): Internal error occurred: failed calling webhook "webhook.sparkoperator.k8s.io": Post "https://spark.fix-spark.svc:443/webhook?timeout=30s": service "spark" not found because the webhook has no backend.. This also blocks newjuju deploy ...` calls in the model, leaving any new application stuck "installing agent".

ca-scribner commented 2 years ago

I tried adding a remove hook in the current podspec charm to remove the offending webhook, but even after deploying with --trust we do not get permission to do this. I received the following exception:

application-spark-k8s: 14:06:24 INFO unit.spark-k8s/4.juju-log mutatingwebhookconfigurations.admissionregistration.k8s.io "broken-webhook" is forbidden: User "system:serviceaccount:kubeflow:spark-k8s-operator" cannot delete resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

I'm not sure if this is because podspec charms should not be able to use trust and cannot ever have this permission, or if there's just a bug in trusting podspec charms.

ca-scribner commented 2 years ago

--trust is supposed to work for podspec, so this is a bug in Juju that likely will not be addressed as they're focusing efforts on other tasks. Easiest way for us to resolve this is to rewrite the charm using pebble/sidecar.