3scale-ops / saas-operator

3scale SaaS Operator - www.3scale.net
Apache License 2.0
9 stars 2 forks source link

bug: controller not removing removed resources randomly #126

Closed slopezz closed 2 years ago

slopezz commented 3 years ago

We have seen, that with current redhat-cop/operator-utils:v1.1.3 (based on operator-sdk v1.3, the same operatorsdk version used at saas-operator), when removing resources like PDB/HPA, randomly they are not deleted and need to be deleted manually (they are not recreated because the controller no longers reconciles it).

It seems the recocile controller to delete it fails, and on next reconcile thinks it is no longer required to watch it.

I have done some changes on backend and system controllers, deploying initially a basic CR:

apiVersion: saas.3scale.net/v1alpha1
kind: Backend
metadata:
  name: example
spec:
  image:
    tag: v3.2.0
  config:
    rackEnv: dev
    redisStorageDSN: backend-redis-storage
    redisQueuesDSN: backend-redis-queues
    systemEventsHookURL:
      fromVault:
        key: URL
        path: secret/data/some/path
    systemEventsHookPassword:
      fromVault:
        key: PASSWORD
        path: secret/data/some/path
    internalAPIUser:
      fromVault:
        key: USER
        path: secret/data/some/path
    internalAPIPassword:
      fromVault:
        key: PASSWORD
        path: secret/data/some/path
  listener:
    loadBalancer:
      eipAllocations:
        - eip-123
        - eip-456
    endpoint:
      dns:
        - backend.example.com
    config:
      redisAsync: false
    marin3r:
      ports:
        - name: backend-http
          port: 38080
        - name: http-internal
          port: 38081
        - name: backend-htttps
          port: 38443
        - name: envoy-metrics
          port: 9901
  worker:
    config:
      redisAsync: true

By default, if not specified, it creates a PDB/HPA for worker and listener:

$ oc get hpa
NAME               REFERENCE                     TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
backend-listener   Deployment/backend-listener   <unknown>/90%   2         4         2          2m10s
backend-worker     Deployment/backend-worker     <unknown>/90%   2         4         2          2m9s

$ oc get pdb
NAME               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
backend-listener   N/A             1                 0                     2m15s
backend-worker     N/A             1                 0                     2m15s

Then deploy the same CR, but removing their PDB/HPA:

apiVersion: saas.3scale.net/v1alpha1
kind: Backend
metadata:
  name: example
spec:
  image:
    tag: v3.2.0
  config:
    rackEnv: dev
    redisStorageDSN: backend-redis-storage
    redisQueuesDSN: backend-redis-queues
    systemEventsHookURL:
      fromVault:
        key: URL
        path: secret/data/some/path
    systemEventsHookPassword:
      fromVault:
        key: PASSWORD
        path: secret/data/some/path
    internalAPIUser:
      fromVault:
        key: USER
        path: secret/data/some/path
    internalAPIPassword:
      fromVault:
        key: PASSWORD
        path: secret/data/some/path
  listener:
    hpa: {}
    pdb: {}
    loadBalancer:
      eipAllocations:
        - eip-123
        - eip-456
    endpoint:
      dns:
        - backend.example.com
    config:
      redisAsync: false
    marin3r:
      ports:
        - name: backend-http
          port: 38080
        - name: http-internal
          port: 38081
        - name: backend-htttps
          port: 38443
        - name: envoy-metrics
          port: 9901
  worker:
    hpa: {}
    pdb: {}
    config:
      redisAsync: true

And randomly one PDB and HPA (backend or listener), persists and is not deleted, and need to be manually deleted:

$ oc get hpa
NAME             REFERENCE                   TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
backend-worker   Deployment/backend-worker   <unknown>/90%   2         4         2          15s

$ oc get pdb
NAME             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
backend-worker   N/A             1                 0                     16s
slopez @ ~/work/sidekiq-split/saas-operator

After doing some tests using newer redhat-cop/operator-utils:v1.1.4 (based on operator-sdk v1.9), the issue seems to be fixed, at least I have not been able to reproduce it, so when removing PDB/HPA from CR spec, they are really deleted.

slopezz commented 2 years ago

operator-utils was updated to github.com/redhat-cop/operator-utils v1.2.2 at https://github.com/3scale-ops/saas-operator/commit/8618a1afbcbff4007abda5fe80d6363b5f068763

Pending to try to reproduce the issue test documented at https://github.com/3scale-ops/saas-operator/issues/126#issue-1012282475 to verify it is already fixed