appuio / seiso

Seiso (清掃). Clean up your Docker image registry and Kubernetes resources.
https://github.com/appuio/seiso/releases
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

ConfigMap in use is being deleted #39

Open bittner opened 3 years ago

bittner commented 3 years ago

ConfigMaps that are still in use should not be deleted (as confirmed by the README).

Unfortunately, the current implementation seems to identify configmaps that are still in use. This results in ConfigMap manifests being deleted for deployments that are running at that very moment, and failing volume mounts for configmaps when pods are restarted.

This should be fixed by constructing a test case that manages to reproduce the behavior.

Related source locations

Example

A somewhat complete example of a Kustomize-based deployment, which is followed by a Seiso run:

...
$ pushd deployment/base
/builds/customer/waf-foo-bar/deployment/base /builds/customer/waf-foo-bar
$ kustomize edit add configmap rules-before-crs --from-file='rules/before-crs/*.conf'
$ kustomize edit add configmap rules-after-crs --from-file='rules/after-crs/*.conf'
$ kustomize edit set namesuffix -- "-${APP_NAME}"
$ kustomize edit add label "environment:${ENVIRONMENT}"
$ popd
/builds/customer/waf-foo-bar
$ oc project "${PROJECT_NAME}-${TARGET}"
Now using project "waf-foo-bar" on server "https://console.cluster.customeron.ch".
$ kustomize build deployment/overlays/${TARGET} | kubeval --strict
PASS - stdin contains a valid ConfigMap (modsecurity-config-foo-bar-hkc68d6ck7)
PASS - stdin contains a valid ConfigMap (netdata-config-foo-bar-ghtm9bddh6)
PASS - stdin contains a valid ConfigMap (rules-after-crs-foo-bar-7f7h4gc58t)
PASS - stdin contains a valid ConfigMap (rules-before-crs-foo-bar-7tc6fdbf65)
PASS - stdin contains a valid Service (waf-foo-bar)
PASS - stdin contains a valid Deployment (waf-foo-bar)
$ kustomize build deployment/overlays/${TARGET} | oc apply -f -
configmap/modsecurity-config-foo-bar-hkc68d6ck7 unchanged
configmap/netdata-config-foo-bar-ghtm9bddh6 unchanged
configmap/rules-after-crs-foo-bar-7f7h4gc58t unchanged
configmap/rules-before-crs-foo-bar-7tc6fdbf65 unchanged
service/waf-foo-bar unchanged
deployment.apps/waf-foo-bar configured
$ seiso configmap --delete --older-than 3w -l app=waf-${APP_NAME} -l component=waf -l environment=${ENVIRONMENT}
level=info msg="Deleting ConfigMap kof-jminvest-development/rules-before-crs-foo-bar-7tc6fdbf65"
level=info msg="Deleting ConfigMap kof-jminvest-development/rules-after-crs-foo-bar-h9f582fccb"
Job succeeded

Note the ConfigMap with the "7tc6fdbf65" hash that is being deleted, even though it's obviously included in the deployment.

ccremer commented 3 years ago

AFAIK the tests should cover this... We should also check the possibility for race conditions. Maybe a simple sleep 5 fixes it, or cleanup before applying (could also be dangerous)

ccremer commented 3 years ago

I noticed that the current oc image is using seiso 0.6.0, which is very outdated. There have been lots of changes since, but no new release tag (maybe it got forgotten). I have now pushed 0.7.0 and I'm going to update the oc image as well.

After that, please try again with Seiso 0.7.0

ccremer commented 3 years ago

There's now a new oc image available with Seiso 0.7.0. Please try again and see if the issue still persists.

ccremer commented 3 years ago

Using the latest version of Seiso in the pipeline has resolved the issue for the customer.

bittner commented 3 years ago

This issue is still present in Seiso 0.7.1. We are using the latest appuio/oc image on GitLab. Since seiso is slow computing objects in use we've started to run it after oc apply to avoid delaying the deployment process.

As a result, a deployment on APPUiO half an hour ago deleted the ConfigMap that is being referred to by the active Deployment object:

...
$ seiso configmaps -l app=lawbrary3 --delete
level=info msg="Seiso 0.7.1, commit fddf46ce9081c232cb98084f22abe254fd889165, date 2020-11-17T09:34:43Z"
level=info msg="Deleted ConfigMap lawb-production/grafana-datasources-86t8876df9"
...

Please, reopen the issue!

bittner commented 3 years ago

Seiso again deleted a ConfigMap in a pipeline run today, even though there was only one ConfigMap left and keep is --keep 3 by default:

...
$ seiso configmaps -l app=lawbrary3 --delete
level=info msg="Seiso 0.7.1, commit fddf46ce9081c232cb98084f22abe254fd889165, date 2020-11-17T09:34:43Z"
level=info msg="Deleted ConfigMap lawb-production/grafana-datasources-86t8876df9"
...

Please try to fix this problem and warn your customers about this bug. It would also help if more information were printed about (default) settings being used and why a candidate was selected for deletion.

The use of seiso configmap is dangerous in its current state.

ccremer commented 3 years ago

Hi Peter

We understand the situation is annoying for you. Seiso is clearly not production ready. Would you mind opening a support ticket at control.vshn.net if this is urgent for you? Our engineering focus is currently not with Seiso.

ghost commented 3 years ago

Hello,

I think it was certainly linked to #49 which fixed the ResourceContains in kubernetes/utils.

Indeed, I did this PR because seiso were deleting currently running images.

ghost commented 3 years ago

I think it was certainly linked to #49 which fixed the ResourceContains in kubernetes/utils.

To explain why I'm almost sure the current issue is fixed:

The method GetUnused from pkg/configmap/configmap.go, lines 64-89 is using ResourceContains to check if the config map name is used by any resource defined in the namespace resources.

The configmap is only checking resources of types defined in openshift.PredefinedResource. I think it should be right, because this list contains Deployment and ReplicaSet.

Before the fix of #49, ResourceContains was doing this check only for the first resource of the namespace. Now, it checks for all resources if the config id is used.

That's explain why it was difficult to reproduce the configmap deletion bug in real environment.

Furthermore, the config map unit test has mocked the ResourceContains, that's why it always passed.