dapr / kubernetes-operator

Apache License 2.0
10 stars 6 forks source link

BUG: dapr-control-plane OOMKilled when DaprInstance provisioned #135

Closed ryorke1 closed 6 months ago

ryorke1 commented 7 months ago

Expected Behavior

dapr-control-plane pod should remain stable and have configurable resource limits and requests.

Current Behavior

The dapr-control-plane pod is continuously being OOMKIlled as long as there is a DaprInstance created. If we remove the DaprInstance, the pod stablizes. The dapr-control-plane pod does seem to survive long enough to deploy the DaprInstance pods and CRDs but it takes a few OOMKills to complete. The pod still continues to crash but doesn't seem to affect the Dapr components.

Possible Solution

  1. Increase the resource limits to 512Mi (Memory) and 1000m (CPU)
  2. Make the resource limits and request configurable

Steps to Reproduce

  1. Uninstall any previous version of Dapr Operator (including cleaning up all CRDs and CRs)
  2. Install Dapr Operator 0.0.8 (at this point the dapr-control-plane will start and is stable)
  3. Create a new DaprInstance with the following configuration (see below)
  4. Monitor the pods and watch the dapr-control-plane pod get OOMKilled
# DaprInstance 
apiVersion: operator.dapr.io/v1alpha1
kind: DaprInstance
metadata:
  name: dapr-instance
  namespace: openshift-operators
spec:
  values:
    dapr_operator:
      livenessProbe:
        initialDelaySeconds: 10
      readinessProbe:
        initialDelaySeconds: 10
    dapr_placement:
      cluster:
        forceInMemoryLog: true
    global:
      imagePullSecrets: dapr-pull-secret
      registry: internal-repo/daprio
  chart:
    version: 1.13.2

Environment

OpenShift: RedHad OpenShift Container Platform 4.12 Dapr Operator: 0.0.8 with 1.13.2 Dapr components

lburgazzoli commented 7 months ago

To change the resource la request and limits, the only option is to tweak the subscription https://github.com/dapr-sandbox/dapr-kubernetes-operator/issues/77#issuecomment-1856067695

unfortunately the memory cannot be made configurable but I will digg into the memory consumption.

Do you have a way to reproduce it ? I never experienced such behavior

ryorke1 commented 7 months ago

All we did was execute the steps above and that reproduced it. I don't think the dapr-control-plane would be affected by any existing pods that had dapr annotations for sidecar injection but maybe you can correct me if I am wrong. We did have a number of pods running that had the annotations during the initialization of the DaprInstance.

Do you have an example of how we could use the subscription to tweak the requests and limits in the context of the dapr-control-plane? Or am I mistaken about what you mean?

lburgazzoli commented 7 months ago

All we did was execute the steps above and that reproduced it. I don't think the dapr-control-plane would be affected by any existing pods that had dapr annotations for sidecar injection but maybe you can correct me if I am wrong. We did have a number of pods running that had the annotations during the initialization of the DaprInstance.

it should not as the one that is affected is the dapr-operator and other resources. The dapr control plane only generates the manifest. Maybe the watcher watches too many objects. I'll have a look.

Do you have an example of how we could use the subscription to tweak the requests and limits in the context of the dapr-control-plane? Or am I mistaken about what you mean?

No, I don't but there are a number of examples in the documentation mentioned in the linked comment.

lburgazzoli commented 7 months ago

I've tried to reproduce the issue but I've failed. What I did is:

But the operator works as expected and does not get OOMKilled:

➜ k get pods -l control-plane=dapr-control-plane -w
NAME                                  READY   STATUS    RESTARTS   AGE
dapr-control-plane-7796c9ff85-htk4g   1/1     Running   0          2m49s
➜ k top pod dapr-control-plane-7796c9ff85-htk4g    
NAME                                  CPU(cores)   MEMORY(bytes)   
dapr-control-plane-7796c9ff85-htk4g   7m           68Mi           

I don't have any dapr application running so it is not 100% the same test, but for what concern the dapr-kubernetes-operator, it should not matter.

ryorke1 commented 6 months ago

OK we are going to look into the OLM and see if we can adjust the resources of the dapr-control-plane. While we are doing that, I am curious to know if the dapr-control-plane being killed will cause any issues? IN our case, so far we do see the components in places and the CRDS were deployed (permission issues still exists #136 ) and we are using the dapr components so far without issues. What's your thoughts on this?

ryorke1 commented 6 months ago

Also, was finally able to capture a screenshot of this crash (it goes OOMKilled and then immediately into CrashBackoffLoop so hard to capture as well).

image

ryorke1 commented 6 months ago

Some logs from OpenShift as well

image

lburgazzoli commented 6 months ago

OK we are going to look into the OLM and see if we can adjust the resources of the dapr-control-plane. While we are doing that, I am curious to know if the dapr-control-plane being killed will cause any issues? IN our case, so far we do see the components in places and the CRDS were deployed (permission issues still exists #136 ) and we are using the dapr components so far without issues. What's your thoughts on this?

It should jot cause any issue as the role of the operator ia just to setup dapr and be sure the setup is in sync with the DaprInstance spec

lburgazzoli commented 6 months ago

Some logs from OpenShift as well

image

are you able to provide a reproducer ? like by deploying a DaprInstance similar to your one does not trigger OOMKiller on my environment so I need something similar to your setup to digg onto it further

ryorke1 commented 6 months ago

Hi @lburgazzoli. Using subscriptions in OLM we were able to stabilize the dapr-control-plane pod. Here is the subscription we used for future reference if others run into this issue.

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/dapr-kubernetes-operator.openshift-operators: ""
  name: dapr-kubernetes-operator
  namespace: openshift-operators
spec:
  channel: alpha
  config:
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 256Mi
  installPlanApproval: Manual
  name: dapr-kubernetes-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: dapr-kubernetes-operator.v0.0.8

As a side note, this did not resolve the propagation to the roles. We still need a admin to manually create roles for us to use these CRDs.

lburgazzoli commented 6 months ago

@ryorke1 I would really love to be able to reproduce it so I can fix the real problem (which maybe it is just about increasing the memory) so if at any point you have a sort of reproducer, please let me knoe