SumoLogic / sumologic-otel-collector

Sumo Logic Distribution for OpenTelemetry Collector
Apache License 2.0
41 stars 37 forks source link

Memory Spike and lots of k8s_tagger processor errors #1267

Open vignesh-codes opened 12 months ago

vignesh-codes commented 12 months ago

Hey team,

I saw a sudden raise in memory and GC not reducing the memory consumptions in my pod. image - sumologic-otel-collector:0.85.0-sumo-0

% kubectl version        
Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.16-eks-2d98532

The pod memory goes beyond 90% of defined limit and did not come back normal. After I manually restarted the pod, the memory went fine.

Upon checking the logs, we found that memory usage went above hard limit but after gc, it did not come down. Also found a couple of error logs stating triggered by k8s_tagger processor

2023-09-27T03:58:32.023Z    error   kube/client.go:199  object received was not of type api_v1.Pod   {
    "kind": "processor",
    "name": "k8s_tagger",
    "pipeline": "logs/otlp/containers",
    "received": {
        "Key": "adventtransplant/xx",
        "Obj": {
            "metadata": {
                "name": "xx",
                "namespace": "adventtransplant",
                "uid": "1c8dd1b2-79a3-46f2-86da-xx",
                "creationTimestamp": null,
                "labels": {
                    "mycompany.io/bu": "dc",
                    "mycompany.io/consumer": "xx",
                    "mycompany.io/created-by": "fn-xx-operator",
                    "mycompany.io/expiry": "9999-01-01",
                    "mycompany.io/fedramp": "true",
                    "mycompany.io/hipaa": "true",
                    "mycompany.io/irap": "true",
                    "mycompany.io/owner": "xx",
                    "mycompany.io/pci": "true",
                    "mycompany.io/stage": "prod",
                    "app": "xx",
                    "fnresources.mycompany.io/application-id": "4c68910e-f1f0-482f-a352-d6ca578da628",
                    "fnresources.mycompany.io/environment-id": "d9d171ca-c63b-4f1d-b87b-e94b6f826812",
                    "fnresources.mycompany.io/inject-mcrouter": "memcached",
                    "fnresources.mycompany.io/inject-mcrouter-exporter": "true",
                    "fnresources.mycompany.io/subscription-id": "89aff85b-cf48-4109-9d76-62036c332bb7",
                    "rollouts-pod-template-hash": "7747cc9cf8"
                },
                "annotations": {
                    "cluster-autoscaler.kubernetes.io/safe-to-evict": "true",
                    "fnresources.mycompany.io/php-apache-config-hash": "421794d60a294800473424de4e20e8xxx",
                    "kubernetes.io/psp": "eks.privileged",
                    "vpc.amazonaws.com/pod-ips": "99.11.11.111"
                },
                "ownerReferences": [
                    {
                        "apiVersion": "apps/v1",
                        "kind": "ReplicaSet",
                        "name": "xx-7747cc9cf8",
                        "uid": "b3cf6222-911a-48a8-98ca-01b4378dd003",
                        "controller": true,
                        "blockOwnerDeletion": true
                    }
                ]
            },
            "spec": {
                "containers": null,
                "nodeName": "ip-xx-26-152-xxx.ec2.internal"
            },
            "status": {
                "podIP": "99.11.11.111",
                "startTime": "2023-09-27T02:35:32Z"
            }
        }
    }
}

We used the default values

k8s_tagger:
        extract:
          annotations:
          - key: '*'
            tag_name: pod_annotations_%s
          delimiter: _
          labels:
          - key: '*'
            tag_name: pod_labels_%s
          metadata:
          - daemonSetName
          - deploymentName
          - hostName
          - namespace
          - nodeName
          - podName
          - serviceName
          - statefulSetName
          namespace_labels:
          - key: '*'
            tag_name: namespace_labels_%s
        owner_lookup_enabled: true
        passthrough: false
        pod_association:
        - from: build_hostname

memory_limiter:
        check_interval: 5s
        limit_percentage: 75
        spike_limit_percentage: 20

I also see that in restarted pods previous logs from multiple clusters the memory went above hard limit but did not come down after gc and lots of that type errors

2023-09-29T10:58:46.777Z    warn    memorylimiterprocessor@v0.85.0/memorylimiter.go:276 Memory usage is above hard limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6324}
2023-09-29T10:58:58.857Z    info    memorylimiterprocessor@v0.85.0/memorylimiter.go:266 Memory usage after GC.  {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6315}
2023-09-29T10:58:58.857Z    warn    memorylimiterprocessor@v0.85.0/memorylimiter.go:276 Memory usage is above hard limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6315}
2023-09-29T10:59:16.271Z    info    memorylimiterprocessor@v0.85.0/memorylimiter.go:266 Memory usage after GC.  {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6458}
2023-09-29T10:59:16.271Z    warn    memorylimiterprocessor@v0.85.0/memorylimiter.go:276 Memory usage is above hard limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6458}
2023-09-29T10:59:29.058Z    info    memorylimiterprocessor@v0.85.0/memorylimiter.go:266 Memory usage after GC.  {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6339}

Lots of these logs are found and it remains elevated often.

We also see increased error count in log collector pod which forwards data to this memory bound otel pod.

"2023-09-27T05:50:30.628Z   error   exporterhelper/queued_retry.go:319  Exporting failed. The error is not retryable. Dropping data.    {\"kind\": \"exporter\", \"data_type\": \"logs\", \"name\": \"otlphttp\", \"error\": \"Permanent error: error exporting items, request to http://sumologic-logging-metadata-logs.xx.svc.cluster.local.:0011/v1/logs responded with HTTP Status Code 500, Message=data refused due to high memory usage; data refused due to high memory usage; data refused due to high memory usage, Details=[]\", \"dropped_items\": 546}go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send    

I want to know the following

Expected behavior:

rupeshnemade commented 11 months ago

Another logs example when OTEL pod reaches 95% memory utilisation. 2023-10-09T14:41:22.020Z info memorylimiterprocessor@v0.85.0/memorylimiter.go:266 Memory usage after GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6408} 2023-10-09T14:41:22.020Z warn memorylimiterprocessor@v0.85.0/memorylimiter.go:276 Memory usage is above hard limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6408}

After few hours pod got OOM killed.

image
c-kruse commented 11 months ago

Thank you for reporting this @vignesh-codes! We were able to fix a bug in our k8s_tagger processor, and have likely found one in the upstream k8sattributes processor as well.

I am afraid I can't say with any confidence that this bug is the root of your troubles, though. The nature of the bug suggests to me that the collector had been in trouble for some time.

It is a reality that telemetry data can be sent to a collector faster than it can be processed and forwarded on to sumo logic (or any other service.) Our default pipeline configurations include the memory limiter processor to attempt to apply back pressure in these situations, refusing to accept new telemetry data. It looks like you've already discovered that this tact is not sufficient on its own. The collector needs to be scaled.

If you are using our helm chart, we do have a solution for autoscaling. Relevant docs for the stable release on autoscaling.