SumoLogic / sumologic-kubernetes-collection

Sumo Logic collection solution for Kubernetes
Apache License 2.0
147 stars 184 forks source link

Deleted `3.0.0-beta.0` version and installed `4.5.1` on top. `sumologic-sumologic-otelcol-logs-collector` is stuck in CrashLoopBackOff. #3587

Closed saymolet closed 7 months ago

saymolet commented 8 months ago

Previously, sumologic Helm Chart of version 3.0.0-beta.0 was installed to the cluster. This release was deleted with Helm and a new 4.5.1 release was installed to the same cluster. After the installation sumologic-sumologic-otelcol-logs-collector daemonset is not able to bring up pods, they are stuck in CrashLoopBackOff state. This problem did not occur at any other cluster, only the one where an older version of sumologic was installed.

~ ❯ kubectl get po -n sumologic
NAME                                                           READY   STATUS             RESTARTS      AGE
sumologic-kube-state-metrics-ddc4bd668-zh77s                   1/1     Running            0             22m
sumologic-opentelemetry-operator-7c75546d6b-6rmcs              2/2     Running            0             22m
sumologic-prometheus-node-exporter-2gxvs                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-84qmk                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-8j6xp                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-fj6wg                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-grctk                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-h62rh                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-p5t9s                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-slmtm                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-vs7rt                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-z4kcj                       1/1     Running            0             22m
sumologic-prometheus-node-exporter-zqpsb                       1/1     Running            0             22m
sumologic-sumologic-metrics-collector-0                        1/1     Running            1 (22m ago)   22m
sumologic-sumologic-metrics-targetallocator-665c9864f8-nbr8f   1/1     Running            0             22m
sumologic-sumologic-otelcol-events-0                           1/1     Running            0             22m
sumologic-sumologic-otelcol-instrumentation-0                  1/1     Running            0             22m
sumologic-sumologic-otelcol-instrumentation-1                  1/1     Running            0             22m
sumologic-sumologic-otelcol-instrumentation-2                  1/1     Running            0             22m
sumologic-sumologic-otelcol-logs-0                             1/1     Running            0             22m
sumologic-sumologic-otelcol-logs-1                             1/1     Running            0             22m
sumologic-sumologic-otelcol-logs-2                             1/1     Running            0             22m
sumologic-sumologic-otelcol-logs-collector-4hr6v               0/1     CrashLoopBackOff   9 (71s ago)   22m
sumologic-sumologic-otelcol-logs-collector-4p56f               0/1     CrashLoopBackOff   9 (54s ago)   22m
sumologic-sumologic-otelcol-logs-collector-7c92p               0/1     CrashLoopBackOff   9 (70s ago)   22m
sumologic-sumologic-otelcol-logs-collector-87l4j               0/1     CrashLoopBackOff   9 (51s ago)   22m
sumologic-sumologic-otelcol-logs-collector-gr2z6               0/1     CrashLoopBackOff   9 (64s ago)   22m
sumologic-sumologic-otelcol-logs-collector-jvx4n               1/1     Running            0             22m
sumologic-sumologic-otelcol-logs-collector-p7jqm               0/1     CrashLoopBackOff   9 (58s ago)   22m
sumologic-sumologic-otelcol-logs-collector-q22ld               0/1     CrashLoopBackOff   9 (82s ago)   22m
sumologic-sumologic-otelcol-logs-collector-qh7wk               0/1     CrashLoopBackOff   9 (62s ago)   22m
sumologic-sumologic-otelcol-logs-collector-sbzzx               0/1     CrashLoopBackOff   9 (65s ago)   22m
sumologic-sumologic-otelcol-logs-collector-t249v               0/1     CrashLoopBackOff   9 (66s ago)   22m
sumologic-sumologic-otelcol-metrics-0                          1/1     Running            0             22m
sumologic-sumologic-otelcol-metrics-1                          1/1     Running            0             22m
sumologic-sumologic-otelcol-metrics-2                          1/1     Running            0             22m
sumologic-sumologic-traces-gateway-5ccdd68b9-k9489             1/1     Running            0             22m
sumologic-sumologic-traces-sampler-788bd6b7bc-728g5            1/1     Running            0             22m

Every pod reports some variation of the same error

~ ❯ kubectl logs -n sumologic sumologic-sumologic-otelcol-logs-collector-t249v                                                                                                                                                                                                                            17:58:16
Defaulted container "otelcol" out of: otelcol, changeowner (init)
2024-03-04T15:57:08.520Z        info    service@v0.92.0/telemetry.go:86 Setting up own telemetry...
2024-03-04T15:57:08.520Z        info    service@v0.92.0/telemetry.go:159        Serving metrics {"address": ":8888", "level": "Basic"}
2024-03-04T15:57:08.521Z        info    processor@v0.92.0/processor.go:289      Development component. May change in the future.        {"kind": "processor", "name": "logstransform/systemd", "pipeline": "logs/systemd"}
2024-03-04T15:57:08.522Z        info    service@v0.92.0/service.go:151  Starting otelcol-sumo...        {"Version": "v0.92.0-sumo-0", "NumCPU": 4}
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:34     Starting extensions...
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:37     Extension is starting...        {"kind": "extension", "name": "file_storage"}
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:52     Extension started.      {"kind": "extension", "name": "file_storage"}
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:37     Extension is starting...        {"kind": "extension", "name": "health_check"}
2024-03-04T15:57:08.522Z        info    healthcheckextension@v0.92.0/healthcheckextension.go:35 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"0.0.0.0:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-03-04T15:57:08.522Z        warn    internal@v0.92.0/warning.go:40  Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks        {"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:52     Extension started.      {"kind": "extension", "name": "health_check"}
2024-03-04T15:57:08.522Z        info    extensions/extensions.go:37     Extension is starting...        {"kind": "extension", "name": "pprof"}
2024-03-04T15:57:08.522Z        info    pprofextension@v0.92.0/pprofextension.go:60     Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":"localhost:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2024-03-04T15:57:08.523Z        info    extensions/extensions.go:52     Extension started.      {"kind": "extension", "name": "pprof"}
2024-03-04T15:57:08.523Z        info    adapter/receiver.go:45  Starting stanza receiver        {"kind": "receiver", "name": "journald", "data_type": "logs"}
2024-03-04T15:57:09.524Z        info    adapter/receiver.go:45  Starting stanza receiver        {"kind": "receiver", "name": "filelog/containers", "data_type": "logs"}
2024-03-04T15:57:09.533Z        info    fileconsumer/file.go:64 Resuming from previously known offset(s). 'start_at' setting is not applicable. {"kind": "receiver", "name": "filelog/containers", "data_type": "logs", "component": "fileconsumer"}
2024-03-04T15:57:09.533Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2024-03-04T15:57:09.533Z        info    service@v0.92.0/service.go:177  Everything is ready. Begin running and processing data.
panic: assignment to entry in nil map

goroutine 128 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/reader.(*Factory).NewReaderFromMetadata(0x40028962f0, 0x4002d06210, 0x4002e152f0)
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/internal/reader/factory.go:117 +0x720
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).newReader(0x40028962d0, 0x4002d06210, 0x4002d12a68)
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:263 +0x394
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).makeReaders(0x40028962d0, {0x4002dca600, 0x10, 0x0?})
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:234 +0x11c
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).consume(0x40028962d0, {0x7d1f288?, 0x4002c83040}, {0x4002dca600, 0x10, 0x10})
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:168 +0x134
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).poll(0x40028962d0, {0x7d1f288, 0x4002c83040})
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:150 +0x2f0
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).startPoller.func1()
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:118 +0xb0
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).startPoller in goroutine 1
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.92.0/fileconsumer/file.go:106 +0xa8

Here are the events of daemonset:

~ ❯ kubectl describe daemonset sumologic-sumologic-otelcol-logs-collector -n sumologic
Name:           sumologic-sumologic-otelcol-logs-collector
Selector:       app.kubernetes.io/name=sumologic-sumologic-otelcol-logs-collector
Node-Selector:  <none>
Labels:         app=sumologic-sumologic-otelcol-logs-collector
                app.kubernetes.io/managed-by=Helm
                chart=sumologic-4.5.1
                heritage=Helm
                release=sumologic
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: sumologic
                meta.helm.sh/release-namespace: sumologic
Desired Number of Nodes Scheduled: 11
Current Number of Nodes Scheduled: 11
Number of Nodes Scheduled with Up-to-date Pods: 11
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status:  11 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/app-name=sumologic-sumologic-otelcol-logs-collector
                    app.kubernetes.io/name=sumologic-sumologic-otelcol-logs-collector
                    chart=sumologic-4.5.1
                    heritage=Helm
                    release=sumologic
  Annotations:      checksum/config: 89d2f067c94e7733a930f1e9b5758d5e093dadc70568c30214141b761f99a63e
  Service Account:  sumologic-sumologic-otelcol-logs-collector
  Init Containers:
   changeowner:
    Image:      public.ecr.aws/docker/library/busybox:1.36.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      chown -R \
        0:0 \
        /var/lib/storage/otc

    Environment:  <none>
    Mounts:
      /var/lib/storage/otc from file-storage (rw)
  Containers:
   otelcol:
    Image:       public.ecr.aws/sumologic/sumologic-otel-collector:0.92.0-sumo-0
    Ports:       1777/TCP, 8888/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/etc/otelcol/config.yaml
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   32Mi
    Liveness:   http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:13133/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LOGS_METADATA_SVC:  <set to the key 'metadataLogs' of config map 'sumologic-configmap'>  Optional: false
      NAMESPACE:           (v1:metadata.namespace)
    Mounts:
      /etc/otelcol from otelcol-config (rw)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/lib/storage/otc from file-storage (rw)
      /var/log/journal from varlogjournal (ro)
      /var/log/pods from varlogpods (ro)
  Volumes:
   otelcol-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      sumologic-sumologic-otelcol-logs-collector
    Optional:  false
   varlogpods:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/pods
    HostPathType:  
   varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
   file-storage:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/otc
    HostPathType:  DirectoryOrCreate
   varlogjournal:
    Type:               HostPath (bare host directory volume)
    Path:               /var/log/journal/
    HostPathType:       
  Priority Class Name:  sumologic-sumologic-priorityclass
Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Warning  FailedCreate      15m (x2 over 15m)  daemonset-controller  Error creating: pods "sumologic-sumologic-otelcol-logs-collector-" is forbidden: no PriorityClass with name sumologic-sumologic-priorityclass was found
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-4p56f
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-t249v
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-sbzzx
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-87l4j
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-qh7wk
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-jvx4n
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-gr2z6
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-p7jqm
  Normal   SuccessfulCreate  15m                daemonset-controller  Created pod: sumologic-sumologic-otelcol-logs-collector-4hr6v
  Normal   SuccessfulCreate  15m (x2 over 15m)  daemonset-controller  (combined from similar events): Created pod: sumologic-sumologic-otelcol-logs-collector-7c92p

PriorityClass with the name sumologic-sumologic-priorityclass does in fact exist

~ ❯ kubectl get priorityclass -n sumologic
NAME                                VALUE        GLOBAL-DEFAULT   AGE
sumologic-sumologic-priorityclass   1000000      false            34m
system-cluster-critical             2000000000   false            443d
system-node-critical                2000001000   false            443d

~ ❯ kubectl describe priorityclass sumologic-sumologic-priorityclass -n sumologic
Name:              sumologic-sumologic-priorityclass
Value:             1000000
GlobalDefault:     false
PreemptionPolicy:  PreemptLowerPriority
Description:       This PriorityClass will be used for OTel Distro agents running as Daemonsets
Annotations:       meta.helm.sh/release-name=sumologic,meta.helm.sh/release-namespace=sumologic
Events:            <none>
~ ❯ kubectl get crd
NAME                                         CREATED AT
alertmanagerconfigs.monitoring.coreos.com    2022-12-21T22:16:02Z
alertmanagers.monitoring.coreos.com          2022-12-21T22:16:04Z
certificaterequests.cert-manager.io          2023-11-15T08:15:46Z
certificates.cert-manager.io                 2023-11-15T08:15:46Z
challenges.acme.cert-manager.io              2023-11-15T08:15:46Z
clusterissuers.cert-manager.io               2023-11-15T08:15:46Z
cninodes.vpcresources.k8s.aws                2023-08-15T09:50:43Z
collectorsets.logicmonitor.com               2024-02-22T16:02:56Z
eniconfigs.crd.k8s.amazonaws.com             2022-12-17T11:21:12Z
instrumentations.opentelemetry.io            2024-03-01T14:20:04Z
issuers.cert-manager.io                      2023-11-15T08:15:46Z
opampbridges.opentelemetry.io                2024-03-01T14:20:04Z
opentelemetrycollectors.opentelemetry.io     2024-03-01T14:20:04Z
orders.acme.cert-manager.io                  2023-11-15T08:15:46Z
podmonitors.monitoring.coreos.com            2022-12-21T22:16:05Z
policyendpoints.networking.k8s.aws           2023-09-12T13:36:10Z
probes.monitoring.coreos.com                 2022-12-21T22:16:06Z
prometheuses.monitoring.coreos.com           2022-12-21T22:16:08Z
prometheusrules.monitoring.coreos.com        2022-12-21T22:16:09Z
securitygrouppolicies.vpcresources.k8s.aws   2022-12-17T11:21:14Z
servicemonitors.monitoring.coreos.com        2022-12-21T22:16:09Z
thanosrulers.monitoring.coreos.com           2022-12-21T22:16:11Z

I additionally referred to this issue: https://github.com/SumoLogic/sumologic-kubernetes-collection/issues/3397. But unfortunately it did not help, as all the resources were deleted by helm. Multiple reinstallments, deletion of persistent-volume-claims and deleteion of the namespace did not help.

The only values that are being specified in the values.yaml are:

sumologic:
  clusterName: "value"
  collectorName: "value"
  logs:
    container:
      sourceCategoryPrefix: "value"
      sourceCategoryReplaceDash: "-"
    systemd:
      sourceCategoryPrefix: "value"
    kubelet:
      sourceCategoryPrefix: "value"
    defaultFluentd:
      sourceCategoryPrefix: "value"
saymolet commented 7 months ago

https://help.sumologic.com/docs/send-data/kubernetes/v3/how-to-upgrade/

This page helped a lot. I think it was said before that helm does not upgrade CRD's so you need to do that manually. After the upgrade of CRD's and Helm upgrade the release deployed without any problems. Although I needed to completely uninstall v4.0.0 to install v4.5.1, as some resources were dangling from v4.5.1 I assume, so not all pods were healthy. Thank you