application-stacks / runtime-component-operator

A generic Operator capable of deploying any runtime component image with enterprise QoS and bind it to other services
Apache License 2.0
49 stars 23 forks source link

Should this operator requeue reconcile attempts while waiting for cert-manager to issue a certificate? #135

Open greglanthier opened 4 years ago

greglanthier commented 4 years ago

Type of question

I want to make sure I understand how Runtime Component instances should be reconciled in a cluster that is hosting a cluster-wide cert-manager instance.

In clusters where cert-manager takes a moment to issue a certificate I've observed that a RuntimeComponent that includes a certificate: {} stanza like the one described here is sometimes never updated. This results in no deployments or services being created.

Question

What did you do?

I deployed runtime-component-operator using OLM and cert-manager using helm to an IKS cluster.

$ kubectl get csv
NAME                                DISPLAY                      VERSION   REPLACES   PHASE
runtime-component-operator.v0.6.0   Runtime Component Operator   0.6.0                Succeeded
$ helm ls
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
cert-manager    ca-operators    1               2020-06-08 15:05:56.991869 -0400 EDT    deployed        cert-manager-v0.15.1    v0.15.1    
$

Once both operators were running I applied the following YAML to my cluster:

---
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: self-signed
spec:
  selfSigned: {}
---
apiVersion: app.stacks/v1beta1
kind: RuntimeComponent
metadata:
  name: ldap
spec:
  applicationImage: osixia/openldap:1.3.0
  service:
    type: ClusterIP
    port: 389
    portName: ldap-port
    certificate: {}

but no deployments or services for the ldap RuntimeComponent were created. I observed the following status updates in the RuntimeComponent:

$ kubectl get runtimecomponents.app.stacks ldap -o yaml
apiVersion: app.stacks/v1beta1
kind: RuntimeComponent
  ...
status:
  conditions:
  - lastTransitionTime: "2020-06-09T00:48:42Z"
    lastUpdateTime: "2020-06-09T00:48:42Z"
    status: "True"
    type: DependenciesSatisfied
  - lastTransitionTime: "2020-06-09T00:48:42Z"
    lastUpdateTime: "2020-06-09T00:48:42Z"
    message: Waiting for service certificate to be generated
    reason: CertificateNotReady
    status: "False"
    type: Reconciled
  imageReference: osixia/openldap:1.3.0
$

What did you expect to see?

I was expecting the runtime-component-operator to continue to attempt to reconcile the RuntimeComponent until the self signed certificate was issued by cert-manager (i.e.: the equivalent of return reconcile.Result{Requeue: true }).

What did you see instead? Under which circumstances?

It looks like runtime-component-operator failed to reconcile the self-signed certificate and gave up trying to create the corresponding deployment & service.

This happened in bigger clusters with many namespaces and more busy workloads.

Environment

I produced this problem on an IBM Kubernetes Service cluster running 1.18.2.

I ran the same test using kind v0.8.1 with a kindest/node:v1.18.2-based cluster and the self-signed certificate was reconciled with the ldap RuntimeComponent. My hunch is the IKS cluster was busier with more namespaces to scan, and so cert-manager-issued certificates take longer to appear than in a simple kind cluster.

Additional context

I briefly dug into the reconcile code in runtimecomponent_controller.go and observed the following lines:

https://github.com/application-stacks/runtime-component-operator/blob/c7eaf97c155eaf0a9cd3be9e0b1fff1fdc1fab16/pkg/controller/runtimecomponent/runtimecomponent_controller.go#L443-L455

Is it possible that line 454 is a typo and this block should read like this:

    result, err = r.ReconcileCertificate(instance)
    if err != nil || result != (reconcile.Result{}) {
        return result, err
    }

Have I misunderstood what runtime-component-operator intends to do in this case?

greglanthier commented 4 years ago

Digging a bit deeper, maybe the following:

https://github.com/application-stacks/runtime-component-operator/blob/c7eaf97c155eaf0a9cd3be9e0b1fff1fdc1fab16/pkg/utils/reconciler.go#L337-L347

should be something like this?

            if !crtReady {
                c := ba.GetStatus().NewCondition()
                c.SetType(common.StatusConditionTypeReconciled)
                c.SetStatus(corev1.ConditionFalse)
                c.SetReason("CertificateNotReady")
                c.SetMessage("Waiting for service certificate to be generated")
                ba.GetStatus().SetCondition(c)
                rtObj := ba.(runtime.Object)
                r.UpdateStatus(rtObj)
                return reconcile.Result{Requeue: true}, nil
            }
arturdzm commented 4 years ago

Once, the certificate is generated and updated, the reconcile loop should be triggered through a watch we set on Certificate CRD.

https://github.com/application-stacks/runtime-component-operator/blob/c7eaf97c155eaf0a9cd3be9e0b1fff1fdc1fab16/pkg/controller/runtimecomponent/runtimecomponent_controller.go#L281

This avoids unnecessary reconciles if certificate is not ready yet or it is invalid. Unfortunately if it is not ready the Certificate CRD does not provide the good message.

You could verify that Certificate is Ready and if not you can check CertificateRequest object.

However, maybe there is condition in which Certificate update event is not caught

greglanthier commented 4 years ago

So in my reproducer the certificate is ready, but there is no follow up reconcile requests being queued in the runtime-component-operator.

i.e.: after applying the yaml in my original question I get a certificate that looks like this:

$ kubectl get certificates.cert-manager.io ldap-svc-crt -o json |jq .status  
{
  "conditions": [
    {
      "lastTransitionTime": "2020-06-09T15:40:56Z",
      "message": "Certificate is up to date and has not expired",
      "reason": "Ready",
      "status": "True",
      "type": "Ready"
    }
  ],
  "notAfter": "2021-06-09T15:40:56Z"
}

Maybe the watch you linked to isn't firing? Maybe there is no owner reference being set on the RuntimeComponent itself at this point?

I'll see if I can enable some extra logging in the operator in my testing cluster.

greglanthier commented 4 years ago

ah - wait a moment...

My certificate is using apiVersion: cert-manager.io/v1alpha3.

apiVersion: cert-manager.io/v1alpha3
kind: Certificate
metadata:
  ...
spec:
  ...
status:
  conditions:
  - lastTransitionTime: "2020-06-09T15:40:56Z"
    message: Certificate is up to date and has not expired
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2021-06-09T15:40:56Z"

I bet that's the problem...

I'm using cert-manager-v0.15.1. Maybe if I downgrade to an earlier cert-manager?

I tried to find a version of cert-manager I could install via OLM but I didn't have any luck :(

arturdzm commented 4 years ago

That could be the case, I have not tested with new version, I think last one I've tried is 0.13 Also wondering if you do trigger additional reconcile (change RC CR) will it get correct status

greglanthier commented 4 years ago

Oh if I reapply the RuntimeComponent yaml a 2nd time everything works - by which I mean the deployment & service is created and the ready certificate is injected into the workload containers.

Sorry - I should have mentioned that in my original question.

I wonder if given this there is some other way to requeue reconcile attempts (on a less aggressive interval) to account for these cases?

I'm thinking of the case where runtime-compoent-operator creates a cert-manager.io/v1alpha2 version of Certificate but that is possibly upgraded silently by an upgrade-webhook in cert-manger (not saying thats happening here, but symptomatically thats similar to what I'm seeing).

greglanthier commented 4 years ago

FYI: Rolling back to cert-manager 0.13.1 corrected the behaviour I originally asked about (thanks for the hints @arturdzm!).

As long as an operator that supports api version cert-manager.io/v1alpha2 resources is in the cluster then runtime-component-operator runs as expected.

This issue can probably be closed, but a new issue should likely be logged to support cert-manager.io/v1alpha3 and/or call out explicitly in the readme that there is a dependency on specific versions of cert-manager.

It was pretty easy to grab the latest cert-manager chart and ignore the apiVersions the CRs in the samples.

arthurdm commented 4 years ago

thanks @greglanthier

@arturdzm - if you can please add the API requirement to the readme, that would be great.

In the near future we can evaluate if we would upgrade the support to the v1alpha3 API or if there's another approach / framework that we should support.