GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
890 stars 218 forks source link

Side-effect: Cannot delete any kubernetes CRDs after installing config-connector v1.11.1 #202

Closed bbhuston closed 4 years ago

bbhuston commented 4 years ago

Describe the bug After installing the config-connector (V0.11.1) and its CRDs in a GKE cluster, I can no longer delete any kubernetes CRDs. I would expect that CRDs would continue to be deletable even if config connector is installed.

ConfigConnector Version Run the following command to get the current ConfigConnector version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}' 

v1.11.1 GKE v1.15.9

To Reproduce Steps to reproduce the behavior:

If you run the following commands

kubectl delete ns cnrm-system
kubectl delete crds -l cnrm.cloud.google.com/system=true --wait=true

You should see the following message

Error from server (InternalError): Internal error occurred: failed calling webhook "abandon-on-uninstall.cnrm.cloud.google.com": Post https://abandon-on-uninstall.cnrm-system.svc:443/abandon-on-uninstall?timeout=30s: service "abandon-on-uninstall" not found

My namespace is stucking in a terminating state so I cannot reinstall config-connector and try to recreate the missing "abandon-on-uninstall" svc. To make matters worse, I get the exact same error when I try to delete any CRD at this point -- even those unrelated to config-connector. Help!

YAML snippets:

N/A

caieo commented 4 years ago

Hi @bbhuston, (edit: accidentally closed this issue, my bad)

I recommend following these uninstall commands for uninstalling ConfigConnector (you'll need to make sure you're on the tab for 'Manual uninstall').

I believe what happened to you is that you deleted the cnrm-system namespace before deleting your CRDs. This basically cuts the connection to the webhook that usually checks CRDs & is causing you all these errors. You won't need to worry about trying to recreate the cnrm-system namespace, just run through the commands and the cleanup should be thorough despite throwing some errors. Let me know if you run into other issues.

kibbles-n-bytes commented 4 years ago

As the system got into an undefined state due to the namespace deletion, we'll need to clean some stuff up manually first. You'll need to delete our mutating and validating webhooks manually:

kubectl delete validatingwebhookconfiguration abandon-on-uninstall.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete validatingwebhookconfiguration validating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete mutatingwebhookconfiguration mutating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true

After these delete, the CRDs should be unblocked from deletion. Note that if you have any CRs sticking around, you'll either need to reinstall KCC and then run through the uninstall steps so it can finalize resource deletions, or manually edit all the resources with kubectl edit to remove the finalizers.

However, it's unclear why the cnrm-system namespace is stuck in Terminating; the CRDs being unable to delete shouldn't block cnrm-system usually, unless you have orphaned CRs in there as well. Does kubectl get gcp --namespace cnrm-system print any resources? If not, could you see what is keeping the namespace from cleaning up properly with kubectl describe namespace cnrm-system?

bbhuston commented 4 years ago

@caieo and @kibbles-n-bytes Thanks for the tips and sorry for the delayed response! I was finally able to resolve the issue I hit the other day. The chicken and egg problem that I ran into -- i.e., where CRDs couldn't be deleted because the finalizer service they were referencing was already deleted, but this service couldn't be reinstalled because the cnrm-system namespace was stuck in a terminating state -- was fixed with the following steps.

# Remove the finalizers that the `cnrm-system` namespace is using to allow it to finish terminating

# First, create a utility shell script
cat << EOF > delete-sticky-namespace.sh;
NAMESPACE=$1  # this is the name of the namespace that is stuck in terminating mode
kubectl proxy &
kubectl get ns $NAMESPACE -o json > tempfile
sed -i '' '/"kubernetes"/d' ./tempfile
curl --silent -H "Content-Type: application/json" -X PUT --data-binary @tempfile http://127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize
rm tempfile
killall kubectl
EOF

# Run the shell script and pass the value "cnrm-system" as its first argument.  This will allow the namespace to completely terminate
/bin/bash  delete-sticky-namespace.sh cnrm-system

# Now recreate the cnrm-system namespace and reinstall config connector.
wget -O  0-cnrm-system.yaml https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-config-connector/master/install-bundles/install-bundle-namespaced/0-cnrm-system.yaml
kubectl apply -f 0-cnrm-system.yaml 

# At this point one can finally delete the config connector CRDs (and then optionally remove the rest of the installation as well)
kubectl delete crds -l cnrm.cloud.google.com/system=true --wait=true
kubectl delete ns cnrm-system

Thanks again!

bbhuston commented 4 years ago

Closing issue. Thanks!