bottlerocket-os / bottlerocket-update-operator

A Kubernetes operator for automated updates to Bottlerocket
Other
178 stars 41 forks source link

Error when `SCHEDULER_CRON_EXPRESSION` is set without `UPDATE_WINDOW_START` & `UPDATE_WINDOW_STOP` #428

Closed prashant-prodigal closed 1 year ago

prashant-prodigal commented 1 year ago

We are installing bottlerocket update operator in an EKS with no internet access. But the operator deployment starts failing, its giving this error:

2023-03-09T13:12:28.369373Z INFO actix_server::builder: starting 2 workers at /src/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-server-2.2.0/src/builder.rs:200

2023-03-09T13:12:28.369436Z ERROR controller: controller exited at controller/src/main.rs:110

I am using Latest image as per the docs. Could you pls point me to what could be wrong here?

jpmcb commented 1 year ago

Hi @prashant-prodigal - the error you are seeing at controller/src/main.rs:110 is related to the controller's metric server attempting to bind to the local loopback network and start serving metrics.

We are installing bottlerocket update operator in an EKS with no internet access.

Note that the Bottlerocket update operator requires network access to updates.bottlerocket.aws: this is how update operator system queries for new OS updates. Read more about it here: https://github.com/bottlerocket-os/bottlerocket-update-operator#why-are-my-bottlerocket-nodes-egressing-to-httpsupdatesbottlerocketaws

Does your node have some network attached? In order for the prometheus server to come up, it'll at least need to be able to bind on 0.0.0.0 for IPv4 clusters or [::] for IPv6 clusters.

Can you provide the full logs from the failed controller deployment?

 kubectl logs -n brupop-bottlerocket-aws pod/brupop-controller-deployment-{YOUR-DEPLOYMENT}
prashant-prodigal commented 1 year ago

Hello, I have allowed the URL https://updates.bottlerocket.aws still we are getting below error from command kubectl logs -n brupop-bottlerocket-aws pod/brupop-controller-deployment-{YOUR-DEPLOYMENT}

2023-03-10T04:36:28.369259Z INFO actix_server::builder: starting 2 workers at /src/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-server-2.2.0/src/builder.rs:200

2023-03-10T04:36:28.369368Z ERROR controller: controller exited at controller/src/main.rs:110

jpmcb commented 1 year ago

What's the shape of your network? Are there any other logs in from the other update operator components?

blakeromano commented 1 year ago

I am seeing the same thing when my node has access to egress.

brupop-controller-deployment-875956b84-l42nf   0/1     CrashLoopBackOff   7 (2m32s ago)   13m
2023-04-11T20:56:31.570124Z  INFO actix_server::builder: starting 1 workers
at /src/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-server-2.2.0/src/builder.rs:200
2023-04-11T20:56:31.570208Z ERROR controller: controller exited
at controller/src/main.rs:110

With the deployment like

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: brupop-controller
    app.kubernetes.io/managed-by: brupop
    app.kubernetes.io/part-of: brupop
    brupop.bottlerocket.aws/component: brupop-controller
  name: brupop-controller-deployment
  namespace: brupop-bottlerocket-aws
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      brupop.bottlerocket.aws/component: brupop-controller
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        brupop.bottlerocket.aws/component: brupop-controller
      namespace: brupop-bottlerocket-aws
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: kubernetes.io/arch
                operator: In
                values:
                - amd64
                - arm64
      containers:
      - command:
        - ./controller
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: SCHEDULER_CRON_EXPRESSION
          value: '* * * * * * *'
        - name: MAX_CONCURRENT_UPDATE
          value: "1"
        image: public.ecr.aws/bottlerocket/bottlerocket-update-operator:v1.1.0
        imagePullPolicy: IfNotPresent
        name: brupop
        resources:
          limits:
            cpu: 10m
            memory: 50Mi
          requests:
            cpu: 3m
            memory: 8Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      priorityClassName: brupop-controller-high-priority
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: brupop-controller-service-account
      serviceAccountName: brupop-controller-service-account
      terminationGracePeriodSeconds: 30
ghost commented 1 year ago

Good afternoon team,

Is there any further information regarding this issue? We're currently facing the same issue in an installation we have done this morning using operator version v1.1.0

$> kubectl logs deployment/brupop-controller-deployment --namespace brupop-bottlerocket-aws
  2023-04-20T09:54:18.670695Z  INFO actix_server::builder: starting 1 workers
    at /src/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-server-2.2.0/src/builder.rs:200

  2023-04-20T09:54:18.670766Z  INFO actix_server::server: Actix runtime found; starting in Actix runtime
    at /src/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-server-2.2.0/src/server.rs:196

   2023-04-20T09:54:18.968337Z ERROR controller: controller exited
    at controller/src/main.rs:110

It is deployed in a regular EKS cluster with no customizations. Services are configured to use IPv4 addresses. The current BottleRocket version is 1.12.0 and the only workloads currently installed besides the default ones are:

Let us know please if we can help with providing any other information.

tmahalligan commented 1 year ago

For me this error happens when SCHEDULER_CRON_EXPRESSION is set and UPDATE_WINDOW_START & UPDATE_WINDOW_STOP are removed

If all three are present then the controller runs fine, though I am pretty sure my SCHEDULER_CRON_EXPRESSION is ignored

In my case I am rolling back to use of UPDATE_WINDOW_START & UPDATE_WINDOW_STOP to control update window.

ghost commented 1 year ago

Thanks for the tip @tmahalligan, we're going to give it a try!!

prashant-prodigal commented 1 year ago

Thanks @tmahalligan. This has solved the problem and i am able to run the controller now. @jpmcb This might be a bug you would like to address? Also @jpmcb are UPDATE_WINDOW_START & UPDATE_WINDOW_STOP ignored when SCHEDULER_CRON_EXPRESSION is set?

stmcginnis commented 1 year ago

Updated title to reflect what I think is the root issue here. Please correct me if I'm wrong.

stmcginnis commented 1 year ago

Verified this is expected behavior when both a time window and a cron expression are provided:

https://github.com/bottlerocket-os/bottlerocket-update-operator/blob/57f4f70c800b6f348728cd10f81b885c9813c639/controller/src/scheduler.rs#L113

This could be handled a little more gracefully though...

Edit: Actually... that is the opposite of what is noted above:

For me this error happens when SCHEDULER_CRON_EXPRESSION is set and UPDATE_WINDOW_START & UPDATE_WINDOW_STOP are removed

If all three are present then the controller runs fine, though I am pretty sure my SCHEDULER_CRON_EXPRESSION is ignored

More investigation needed then.

gthao313 commented 1 year ago

@tmahalligan Hi. what version of bottlerocket update operator container were you using? I think it might because you were using the latest version yaml file but still use the old bottlerocket update operator. cron scheduler is a new feature which we will introduce in next release, so the errors on the controller could be related to the system still need time window but cron expression scheduler provided. Can you try to use this yaml file? thanks!

tmahalligan commented 1 year ago

Am using v1.1.0 here is relevant config @gthao313

containers:

gthao313 commented 1 year ago

@tmahalligan yeah, v1.1.0 doesn't have SCHEDULER_CRON_EXPRESSION, and we plan to release v1.2.0 later which will introduce cron scheduler. Currently, can you remove SCHEDULER_CRON_EXPRESSION from the config and everything should be work. This is the v1.1.0 config. : )

tmahalligan commented 1 year ago

I was under the impression from the documentation https://github.com/bottlerocket-os/bottlerocket-update-operator#set-scheduler that the released version of the Operator supported the cron functionality. I assume others may have made same mistake perhaps the docs should be amended. @gthao313

Thanks for the follow-up will adjust and wait on next release

jpmcb commented 1 year ago

Also note that we attach the relevant configs to the release for each version: https://github.com/bottlerocket-os/bottlerocket-update-operator/releases/tag/v1.1.0