Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.2k stars 4.8k forks source link

Errors not being reported for kong plugin Opentelemetry #13776

Open pawandhiman10 opened 1 week ago

pawandhiman10 commented 1 week ago

Is there an existing issue for this?

Kong version ($ kong version)

3.7.1

Current Behavior

We have installed this plugin for traces in Kong. Plugin link

We are sending traces to Datadog with this. But the traces are only for requests and errors are not captured at all.

Screenshot 2024-10-20 at 10 50 14 PM

Expected Behavior

Errors should be captured and reported in the UI.

Steps To Reproduce

Install this plugin using Helm. We have attached this plugin to a service with the kong annotations: konghq.com/plugins: opentelemetry Configuration reference for the kongplugin on the plugin definition page. This is connected to otel configuration exposed by Datadog agent on http port 4318.

Anything else?

No response

ProBrian commented 1 week ago

@pawandhiman10 can you attach you config file for the OTEL plugin? @samugi can you help to have a look at this issue?

samugi commented 1 week ago

hello @pawandhiman10 errors in traces are captured and reported by the OpenTelemetry plugin as span events using the exception event key and the exception.message attribute as defined in the OpenTelemetry specification.

I believe Datadog might be expecting slightly different fields based on this, are you using the OTLP ingestion with the datadog agent? In that case, I would expect the ingestion process of OTLP data to take care of parsing any errors and translating any fields as needed.

I can confirm that errors are displayed correctly by other tools such as Jaeger and Grafana.

Also: could you share an example of an error that is being reported by your system and you are expecting to be visible in the UI?

pawandhiman10 commented 1 week ago

@samugi Errors mean 5xx status code errors here. As per the screenshot earlier, it only displays the 2xx requests and not the 5xx. These are not actual errors but response is 5xx. This is missing. Below screenshot gives an idea that 5xx is occuring regularly and also increasing sometimes, we are not able to see these numbers in datadog while the same is visible in the load balancer logs.

Screenshot 2024-10-21 at 12 36 34 PM

Yes, currently the ingestion is based on OLTP ingestion with Datadog agent. Could you point me to some reference docs here on how to get the errors (5xx) parsed?

samugi commented 1 week ago

@pawandhiman10 in that case, it might be expected. 500 response status codes are not always reported as errors. Today errors are reported for failures (exceptions) that occur during the execution of plugin phase handlers, errors returned by the internal http client and dns failures.

Could you test with a 500 status code that originates from an exception in a plugin? This can be tested with a plugin that throws an explicit error like: error("fail") from its access phase. The expected behavior is for this to generate a span with name kong.access.plugin.yourplugin with error state and an event that contains the exception message. If Datadog is parsing things correctly, this should be displayed in the UI as well.

pawandhiman10 commented 5 days ago

@samugi Seems like the error tag is missing with the 5xx status codes. Is there any way to add this tag with >=500 status code from the requests, span.error: true? Not sure on this, but is there a relation to any of these headers: w3c, b3, jaeger, ot, aws, datadog, gcp? Can see the error code there but coming under ok status.

Screenshot 2024-10-27 at 2 07 13 AM
samugi commented 4 days ago

@pawandhiman10 what you describe is currently expected. Status codes are reported (whether they 2xx, 4xx, 5xx, etc), but spans are not being set the error state when this happens. Errors are currently reserved for actual errors occurring during the execution of plugins code (exceptions) and a few other scenarios that don't include specific response codes.

Would you be interested in setting an error state to your root span in case of 5xx status code from your upstream? If that is the case, such a change should probably be configurable, for example some might expect 4xx response codes from time to time, while others would consider them error conditions.

We are always interested in improving/updating our tracing instrumentation, but I cannot guarantee this change in particular will be made. I would recommend, to achieve this behavior, to write your own code (in a custom or serverless plugin) that applies this simple logic for your trace. You can use the tracing and the response PDK modules to achieve this.

julianocosta89 commented 3 days ago

@pawandhiman10 if you have access to the collector configuration, you can use the transformprocessor to do that:

processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - set (status.code, 2) where attributes["http.status_code"] >= 500

Here are the enums available, and the explanation why I've suggested set (status.code, 2): https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/contexts/ottlspan#enums

pawandhiman10 commented 2 days ago

@julianocosta89 This is managed by Datadog agent currently without any collector config.

@samugi Could you also help me with setting up OTEL_TRACES_SAMPLER=always_on with this plugin? Setting it as environment variable did not work as expected. Opentelemetry

As of the error(5xx) reporting in Datadog, will experiment with the code. Thanks for that.