envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

Closed jbilliau-rcd closed 1 month ago

jbilliau-rcd commented 2 months ago

Current, envoy hardcodes the fact that a failing OLTP export must be written to ENVOY_LOG, and thus, produced in the log. We are currently piloting istio 1.22 with Dynatrace and are constantly seeing this in the logs:

2024-09-05T18:28:10.787073363Z 2024-09-05T18:28:10.786914Z error envoy tracing external/envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc:86 OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message thread=23

This is being produced by this block of code - https://github.com/envoyproxy/envoy/blob/a65765e00f4dda2bd92aee7b263ffcdbee08775a/source/extensions/tracers/opentelemetry/http_trace_exporter.cc#L81-L92

PROPOSED FIX

Since Istio currently hardcodes a sample rate of 100 when using a custom OTEL provider (another issue I need to raise on their GitHub), and Dynatrace rate-limits OLTP exports, this makes parsing envoy logs very difficult since 70% of the logs are this 503 line. It would be nice if this could have a variable or argument to optionally "fail silent" since I don't really care if OTEL traces fail when sending to Dynatrace.

phlax commented 2 months ago

cc @alexanderellis @yanavlasov

jbilliau-rcd commented 1 month ago

Per our conversation with our Dyntrace support rep:

Our dev's have idenitified some notable behavior. As mentioned before, on the cluster/active gate side we were not seeing 503's as shown by the OTel exporter. This was the main source of confusion, but while 503's were not present, we were reporting large spikes in 429 errors. This throttling error lines up far more with the envoy side spam. The reasoning for why the 429's appear as 503s seem to be a envoy/otel quirk. Searching online for similar issues has shown some similar cases where when the export endpoint stops responding, the 503 appears rather than the expected result. Unfortunately there is not much information on why this is, but it is envoy side and not a large focus.

Would you be able to setup the exporter based on the previously shared documentation? This should help provide more robust artifacts showing where the traffic is coming from. In addition, I would work further on the Envoy/Otel side via a github issue or so as there seems to be a problematic behavior with the tracing. https://istio.io/latest/docs/tasks/observability/logs/otel-provider/

Even with killing our ingress object, pod monitors, and readinessprobe, thus killing any sources of traffic that would be originating to this test app, we STILL get constant 503's. So what traces is Envoy generating for an app with no ingress/egress traffic? Istio proxy logs are completely blank except for these nonstop 503's. We sure there isn't a problem here?

joaopgrassi commented 1 month ago

To add, what I observed and was able to reproduce was: I configured envoy to export the traces to a "dummy" OTLP back-end that returned always HTTP 429s and Envoy somehow interprets that as 503's and seems to constantly retry to send it. I sent a single request to Envoy but the Envoy logs are filled with these (non-stop)

image |

joaopgrassi commented 1 month ago

Actually I think I found what it is :D

I ran a bare bones Envoy and while debugging, I noticed it was always calling the log function on the exporter https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L189

But the OTLP request is empty. There's just a resource there and it sends those non-stop. I pointed Envoy to webhook.site and immediately got 429s.

image

The problem is that we seem to still export when there's no spans. It is simply exporting if there's an exporter configured https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L187

So the fix here should be to only export when there's an exporter AND there are spans to export. @jbilliau-rcd The reason I think you run into the Dynatrace 429s is this. And I also saw that by default the "timer" to export is called every 5ms? Not sure what to make of this line https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L161.

I will send a PR to address this.

jbilliau-rcd commented 1 month ago

@joaopgrassi one last questions of you; if we were to proceed with Istio 1.22 rollout in spite of these errors, would we have issues with reporting of spans/traces to Dynatrace? Meaning, if Envoy is constantly trying to export spans, even when empty, it results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

joaopgrassi commented 1 month ago

results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

Yes, exactly. The "empty" OTLP messages are still valid so they still are accepted by Dynatrace, thus consume your tenant's quota. I don't know from the top of my head, but it may be possible to filter those out if you export them to a OTel collector, and then from the collector to Dynatrace. If that's not an option, then I'm afraid you would need to wait for a release of the updated Istio.

What we can do is see if we can backport the fix to Envoy 1.30/1.31 releases and then let Istio just update that. Then you use the new Istio patch release.

joaopgrassi commented 1 month ago

I submitted backports for this issue on versions 1.30 and 1.31 https://github.com/envoyproxy/envoy/pull/36405, https://github.com/envoyproxy/envoy/pull/36404.

One thing I saw is that Envoy seems to have a way to configure the export timeout (tracing.opentelemetry.flush_interval_ms) https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L196

So, in theory you could increase that to, say 1 minute or more. That then can help because it won't export that often, and will increase the chance of actually having real spans in there. I think this might help then with reducing the number of requests and Dynatrace throttling them. The problem is that I don't think this can be enabled via Istio. Maybe via an Envoy filter? But can be dangerous 🤔

joaopgrassi commented 1 month ago

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF

Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:

template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn

The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

joaopgrassi commented 1 month ago

I will open separated issues to allow configuring both flush_interval_ms and min_flush_spans together with the rest of the OTel configs instead of only via the Envoy runtime config. I believe this is important enough that should be easy for users to change.

jbilliau-rcd commented 1 month ago

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF

Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:

template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn

The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

This is good stuff, thanks for all the deep diving into this, it's been a huge help and the only thing lifting the block that we've had on this migration for 2 months now.

joaopgrassi commented 1 month ago

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

Not quite. flush_interval_ms acts like a timer, so when the time configured there passes, an export happens, regardless of the number of spans. If there's 0, because of the bug an empty request is still executed. The default is 5s (I believe), so changing it to say 1, or 5 min then basically reduces drastically the number of requests it would make during the configured time, but won't reduce that empty spans. But it for sure will help with the 429/503 you see, so it should "fix" it, and you won't get the logs anymore.

min_flush_spans there then acts just as another improvement. The exporter sends a request when either of two things are true:

  1. The timer has elapsed
  2. The number of buffered spans is higher than min_flush_spans

The default for min_flush_spans is 5, which is very low. So, by increasing it, you also reduce then the chance of having frequent requests. 512 here comes from the default similar value for the Batch Span Processor defined in the OTel specification https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

jbilliau-rcd commented 1 month ago

Both those settings sounds really low....defaults should be much higher, or better yet, configurable (ideally through Istio, so we aren't writing custom configmaps and annotations all over the place). Not trying to give you another PR to do, just kinda speaking out loud. As per my response to you on another issue, we are going to go down the route of this OTEL collector first, since the scope of deploying annotations and configmaps to thousands of deployments across hundreds of clusters isn't the most feasible option (hence my first sentence).

joaopgrassi commented 1 week ago

@jbilliau-rcd just as a heads up, Envoy released a patch release for version v1.30.7 that includes the fix https://github.com/envoyproxy/envoy/pull/36404 and Istio picked it up already here https://github.com/istio/proxy/pull/5894/files

I assume there will be soon another patch release of Istio for version 1.22, so the issue with empty exports is solved. You are free to keep the config where you tweak the values of min_flush_spans and flush_interval_ms though but will not be required.