SumoLogic / sumologic-kubernetes-collection

Sumo Logic collection solution for Kubernetes
Apache License 2.0
147 stars 184 forks source link

Why`sending_queue.queue_size` is set to 10 by default, which is easily to lead sending queue is full? #3474

Closed txjjjjj closed 9 months ago

txjjjjj commented 10 months ago

eks 1.28 chart v4.3.1 Sumo is mainly used to store eks logs.

Whysending_queue.queue_size is set to 10 by default, which is easily filled?

exporters:
  otlphttp:
    endpoint: http://${LOGS_METADATA_SVC}.${NAMESPACE}.svc.{{ .Values.sumologic.clusterDNSDomain }}.:4318
    sending_queue:
      queue_size: 10
    # this improves load balancing at the cost of more network traffic
    disable_keep_alives: true
2023-12-15T13:24:28.112Z        warn    batchprocessor@v0.89.0/batch_processor.go:258   Sender failed   {"kind": "processor", "name": "batch", "pipeline": "logs/containers", "error": "sending queue is full"}

I think we should set a reasonable default value to avoid losing logs.

How should I set it up to make sure 100% no loss of logs? I know this can be overridden in helm values.

Even though I read the fine-tuning manual, I don't know how to adjust the parameters because there are so many of them.

Do you have any recommendations for out-of-the-box parameters? Thank you. There may be nodes in a cluster with widely varying configurations. Some nodes have more logs, some have less.

aboguszewski-sumo commented 10 months ago

This exporter is used to send data to another otelcol in the same cluster, so if the sending fails, it usually means that the problem lays in the metadata layer and the data could be lost anyway. One of such problems might be too much load: please make sure that you have either sumologic.autoscaling.enabled or metadata.logs.autoscaling.enabled set to true.

In general, the sending queue is used to not lose data in cases such as temporal failure on the backend side. In the metadata layer, this is set to a higher value: https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/f37afb640af7a22a925812c9c1f3da6df2744350/deploy/helm/sumologic/conf/logs/otelcol/config.yaml#L9-L12

Here you can find more info on how to adjust the parameters for the sending queue. In particular, please take a look to queue_size. The num_seconds variable there is a value that you should think of "in case of a backend outage, how many seconds do I want to buffer data before starting to drop?".

To override the sending queue values for logs collector, use otellogs.config.merge option:

otellogs:
  config:
    merge:
      exporters:
        otlphttp:
          sending_queue:
            queue_size: <custom_size>

However, basing on the fact that we did not have problems with this particular option before, I'd also make sure that everything is fine in your cluster (the network works fine, the metadata layer pods don't crashloop etc.).