apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
807 stars 272 forks source link

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

Open BrynCooke opened 1 year ago

BrynCooke commented 1 year ago

Currently we have an experimental_logging section that we should bring out of experimental and in addition support json formatting options. There have also been several other users asking for much more control over what is output in both spans and logs, so it's maybe time to take a more holistic look.

The things that users have asked us for are:

Config

Suggest the following format:

telemetry:

  tracing:
    common: # Renamed from trace_config
      max_attributes_per_event: 128
      max_attributes_per_span: 128
      max_attributes_per_link: 128
      max_events_per_span: 128
      max_links_per_span: 128
      parent_based_sampler: true
      sampler: always_on
      service_name: router
      service_namespace: "default"
      resource:
        d: e

      # Resources are otel config not represented in the yaml config

    propagation:
      baggage: false
      jaeger: false
      datadog: false
      request:
        header_name: "X-REQUEST-ID"
      trace_context: false
      zipkin: false

    otlp:
      enabled: true
      endpoint: "http://localhost:4317/v1/traces"

  metrics:
    common:
      service_namespace: "default"
      service_name: router
      buckets:
        - 0.1

      resource:
        test: foo
    prometheus:
      enabled: true
      path: /metrics
    otlp:
      enabled: true

  instruments:
    default_attribute_requirement_level: required
    router:
      http.server.active_requests: true
      my_instrument:
        value: unit
        type: counter
        unit: kb
        description: "my description"
        event: on_error
        attributes:

          http.response.status_code: false
          "my_attribute":
            response_header: "X-MY-HEADER"
            default: "unknown"
            redact: "foo"

    supergraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"
    subgraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"

  events:
    router:
      request: true
      response: false
      error: false
      test:
        message: "foo"
        level: info
        attributes:
          http.response.body.size: false

  spans:
    default_attribute_requirement_level: required
    legacy_request_span: true
    # The request span will be disappearing
    # router is the new root span
    router:
      attributes:
        dd.trace_id: false
        http.request.body.size: false
        http.response.body.size: false
        http.request.method: false
        http.request.method.original: false
        http.response.status_code: false
        network.protocol.name: false
        network.protocol.version: false
        network.transport: false
        error.type: false
        network.type: false
        trace_id: false
        user_agent.original: false
        client.address: false
        client.port: false
        http.route: false
        network.local.address: false
        network.local.port: false
        network.peer.address: false
        network.peer.port: false
        server.address: false
        server.port: false
        url.path: false
        url.query: false
        url.scheme: false
        "x-custom1":
          trace_id: datadog
        "x-custom2":

          response_header: "X-CUSTOM2"

          default: "unknown"
        "x-custom3":
          request_header: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom8":
          env: "ENV_VAR"

      #etc...
    supergraph:
      attributes:
        graphql.document: false
        graphql.operation.name: true
        graphql.operation.type: true

        "x-custom":
          query_variable: "arg1"
          default: "unknown"
          redact: ""
        "x-custom2":
          response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom6":
          operation_name: string
        "x-custom7":
          operation_name: hash
        "x-custom8":
          env: "ENV_VAR"
      #etc...
    subgraph:

      attributes:

        graphql.federation.subgraph.name: false
        graphql.operation.name: false
        graphql.operation.type: true
        "x-custom":
          subgraph_operation_name: string
          default: "unknown"
        "x-custom2":
          subgraph_response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"

Related is: https://github.com/apollographql/router/issues/1840 which also has some good ideas that have been folded into the above.

Path forward

This is a fairly large set of changes and will need to be split into several tickets. Let's get agreements that this is the way forward and try and schedule it in as it is a blocker for some users to adoption and could have some major performance improvement ramifications due to reduced logging requriements.

Related issues

Implementation plan

There are a large number of items that need to be tackled. Once the new config structure has been implemented then we can start farming the tasks out to separate people.

Clean up existing config

New config structure

Spans

Tracing

Testing

Logging

Instruments

Events

bnjjj commented 1 year ago

I suggest to use format in that way:

format:
    json:
      location: true | false
      filename: true | false
      line_number: true | false
      spans: true | false
    # text:
    #   location: true | false
    #   filename: true | false
    #   line_number: true | false
    #   spans: true | false

it's also related to the discussion https://github.com/apollographql/router/discussions/1961

abernix commented 1 year ago

Should we close #1840 since its ideas have been folded into the above?

piclemx commented 1 year ago

This request include another way to send logs to?

BrynCooke commented 1 year ago

Not currently. But let's add it as this could be implemented using https://github.com/tokio-rs/tracing/tree/master/tracing-appender

kindermax commented 1 year ago

Hi, I can be wrong, but does the new logging configuration support injecting custom fields into JSON? It would be very convenient to add some custom field to all logs such as service: my-apollo-router ?

BrynCooke commented 1 year ago

Let's add this, I''ll have a think about how to update the config.

Geal commented 1 year ago

pointing out now that any initiative around logging configuration has a huge risk of destroying performance, so this should be built carefully

hrkfdn commented 1 year ago

Would also be cool if field names could be specified, i.e. to fit into a certain log structure, which would be necessary to make logs searchable/indexable. The example below would "rename" message to msg:

format:
    json:
        - type: spans
          name: spans
        - type: message
          name: msg

Wording of type and name is debatable of course. Also name could be optional if it's not deviating from type.

Bjohnson131 commented 1 year ago

Hello, my issue #3502 was linked here. I think that if I were to propose a location for this issue, maybe

telemetry -> redaction -> subgraph_errors (boolean)

would be the proper place? That would require you to change redaction from a list to an object.

Geal commented 1 year ago

related issues:

BrynCooke commented 1 year ago

I've spent some time thinking about this and updated the example. I have not put in anything around customization of json format. Can people who have indicated that they want this give some example of why this is needed and what formats they are targeting. My fear is that even if we do something it won't be flexible enough to actually be useful. If there are specific well known formats that you'd like to see then that would be different.

hrkfdn commented 1 year ago

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in https://github.com/apollographql/router/issues/3226#issuecomment-1606915924), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

yanns commented 1 year ago

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in #3226 (comment)), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

same usecases here (naming fields + adding static fields)

BrynCooke commented 1 year ago

Can you post some examples of the formats that you are seeking to work with?

yanns commented 1 year ago

Example 1

What we get from the router:

        {
            "timestamp": "2023-03-22T16:18:38.164112Z",
            "level": "INFO",
            "fields": {
                "message": "bla bla."
            },
            "target": "apollo_router::executable"
        }

Transform into:

            {
                "timestamp": "2023-03-22T16:18:38.164112Z",
                "severity": "info",
                "message": "bla bla.",
                "type": "router"
            }

Example 2

What we get from the router:

        {
            "timestamp": "2023-03-22T16:20:11.507683Z",
            "level": "INFO",
            "message": "POST /<project-key>/graphql 200",
            "duration": 0.07599999755620956,
            "http.method": "POST",
            "http.status_code": 200,
            "ctp.api_endpoint": "POST /<project-key>/graphql",
            "type": "router-http-access",
            "span": {
                "correlation_id": "6c8e9093-d702-4cb5-b274-14c3bc3c5367",
                "ctp.project_key": "cornichon-0322-172010-279-0",
                "name": "coco_tracing"
            },
            "spans": [
                {
                    "http.flavor": "HTTP/1.1",
                    "http.method": "POST",
                    "http.route": "/cornichon-0322-172010-279-0/graphql",
                    "otel.kind": "SERVER",
                    "name": "request"
                },
                {
                    "correlation_id": "6c8e9093-d702-4cb5-b274-14c3bc3c5367",
                    "ctp.project_key": "cornichon-0322-172010-279-0",
                    "name": "coco_tracing"
                }
            ]
        }

transform it into

            {
                "timestamp": "2023-03-22T16:20:11.507683Z",
                "severity": "info",
                "message": "POST /<project-key>/graphql 200",
                "duration": 0.07599999755620956,
                "http.method": "POST",
                "http.status_code": 200,
                "ctp.api_endpoint": "POST /<project-key>/graphql",
                "correlation_id": "6c8e9093-d702-4cb5-b274-14c3bc3c5367",
                "ctp.project_key": "cornichon-0322-172010-279-0",
                "type": "router-http-access"
            }

Recap

kindermax commented 1 year ago

Same case with adding custom fields.

In our log system each log message must contain two fields

'logformat': 'logevo' and 'service': 'my-service' - without it, logs won't even reach kibana's elasticsearch.

There is also a third field txid: "some-uuid" that helps to collect all related logs under one trace_id

So basically without those fields, we need a way to transform every log message before sending it to elasticsearch.

Full example will be:

{
   "message": "Some message, 321",
   "level": "info",
   "service": "myservice",
   "timestamp": 1653582183.778443,
   "txid": "abcdefg-1234-4321-moremoremore",
}
yanns commented 1 year ago

If it helps, this how to we manipulate the logs currently: https://github.com/apollographql/router/issues/1651#issuecomment-1493830314

kindermax commented 1 year ago

If it helps, this how to we manipulate the logs currently: #1651 (comment)

Yes, the only way to got the format I want is to postprocess logs in some way. In our case we are collecting logs from k8s pods stdout, so running process with pipelined jq or use filebit/fluenbit is a way, but to have the builtin ability to customize logs format is a perfect solution to me.

kindermax commented 1 year ago

I understand that it can be hard to come up with a universal solution using yaml syntax, so for me having the ability to write a plugin in rust to transform/reformat logs is totally fine too.

dmcilwee commented 1 year ago

I'd like to +1 at least getting the trace_id and span_id into logs consistently. We rely on logs being correlated to traces as part of our observability platform (and its very powerful when its hooked up).

Geal commented 1 year ago

additional ask for this initiative, coming from #3000

There's a large performance cost in using the fmt layer, for 2 reasons:

the new log layer should take care of avoiding those performance pitfalls.

Other performance issues to take care about:

Geal commented 1 year ago

I'm closing https://github.com/apollographql/router/pull/3887 since #3968 makes most of it redundant. There's one small thing that we should probably do though: wrap the ForwardValues vecs in Arc to avoid copies in the hot path https://github.com/apollographql/router/pull/3887/files#diff-7737729cbb75d74b8921ae972883e90ba02b94cce0217b662822ad5183e966e7R273-R276

danpayne17 commented 1 year ago

Hi With the gateway, we add a lot of attributes to DataDog spans and log statements. I wanted to verify that with this ticket we would be able to do the same. We need to be able to add the following attributes:

We need to be able to add the above attributes:

BrynCooke commented 1 year ago

@danpayne17 I've updated the config in the ticket body with something that we will be working on implementing. I'll need to add env variable support and operation hash. If there's anything missing then there is a natural place for those attributes to land in the new config structure.

BrynCooke commented 1 year ago

@danpayne17 Note that operation name/ID cannot be added to all log statements because at the router request we have not parsed the incoming query. These will be available from supergraph spans onwards though.

There may be similar issues with rhai generated trace ids, rhai gets executed after all built in plugins, so there may not be the opportunity to get a custom trace ID into the span early enough, but we will do some experimentation and let you know.

BrynCooke commented 1 year ago

For those looking for custom logging formats, the plan is to support some well known formats out of the box, and then if there is still demand a custom formatter can be added.

piclemx commented 1 year ago

For those looking for custom logging formats, the plan is to support some well known formats out of the box, and then if there is still demand a custom formatter can be added.

Will JSON format be possible or something Open Telemetry?

Bjohnson131 commented 1 year ago

For those looking for custom logging formats, the plan is to support some well known formats out of the box, and then if there is still demand a custom formatter can be added.

Will JSON format be possible or something Open Telemetry?

IIRC OTEL is supported today, no?

BrynCooke commented 1 year ago

There are many JSON formats, the one that currently ships with the router is similar to the json format that ships with tracing_subscriber.

We'll initially support text and json as we do now. However formatters will be represented by an enum, which can contain custom configuration.

We'll create tickets for formats such as google, gelf, bunyan, and maybe also custom_json.

We will either get round to them based on demand or if someone want to submit a PR then they can make things happen sooner.

There is a lot of work to cover to implement what is now in the ticket description, but at least if the config is agreed upon and in the codebase we can start to divide and conqure.

BrynCooke commented 10 months ago

Just a quick update for people following this ticket.

We've spent considerable effort to make telemetry spans customizable via yaml and to follow the Otel semantic conventions. This will land in the next release.

Users with a commercial license will have fine grained control over what span attributes are present on spans and are able to attach arbitrary attributes. Free users are able to use the less granular requirement level that align with the opentelemetry semantic conventions.

Still TODO are instrument and event customization via yaml. We are hoping to get to these early next year, as they will allow users to setup conditional logging and metrics without having to reach for Rhai or a custom plugin.

Custom logging formats are also still on the table as well as exporting logs via opentelemetry bridge for collection via OTLP.

BrynCooke commented 4 months ago

Another update on this ticket. Custom instruments and events are now possible in yaml, and users seem to be tacking advantage of these new features!

One thing that is still unimplemented is redaction. So if you need this then get in touch.

oskargotte commented 3 months ago

One thing that is still unimplemented is redaction. So if you need this then get in touch.

@BrynCooke What about https://github.com/apollographql/router/issues/3502 ? Are there any options to get the redacted subgraph errors included in the router logs but not in the GraphQL response? OR are there any plans to support it?

BrynCooke commented 3 months ago

@oskargotte We haven't has many folks asking for redaction. Some of them do redaction centrally via their logging software. We'll still need more people to ask for this feature.

Bjohnson131 commented 3 months ago

FWIW, logging redaction isn't really a standard feature for many programs that exist today, so it's understandable that this won't be supported.