elastic / observability-docs

Elastic Observability Documentation
Other
33 stars 161 forks source link

[Request]: Update 'service.name' docs with additional guidance #4102

Open roshan-elastic opened 2 months ago

roshan-elastic commented 2 months ago

Relates to:

Description

We need to add some additional content to the docs which explain how to declare a service in your logs:

https://www.elastic.co/docs/current/serverless/observability/add-logs-service-name

Changes required

  1. Explain that log.level needs to be present in order to display log metrics for a service in the new experience
  2. Document a potentially common use case which a user could come across

Background In order to filter out unhelpful APM logs unrelated to logging (e.g. APM transaction errors) we are forcing the 'log rate' and 'log error %' metrics to require log.level in order to work.

Services Inventory - New Experience image

Services View - New Experience

image

Specifically:

Log Rate Rate of logs per minute observed for given service.name.

Formula Calculation: count(kql='log.level: *') / [PERIOD_IN_MINUTES]

Log Error % % of logs where error detected for given service.name.

Formula Calculation: count(kql='log.level: "error" OR log.level: "ERROR"') / count(kql='log.level: *')

log.level isn't always provided automatically by our various ingestion methods (e.g. beats, elastic agent) so we need to provide some guidance to explain this and suggest how to do this.

Additionally, there is a likely common use case where the log.level is nested within a message and is therefore not used in our metrics. We would like to provide some guidance around this.

1. Explain that log.level needs to be present in order to display log metrics for a service in the new experience

Note : We will be updating the UI to point towards this documentation:

image

We should provide some guidance on how to declare log.level in your logs via Elastic Agent. For example, here is some documentation on how to do this for standalone:

https://www.elastic.co/guide/en/fleet/current/elastic-agent-standalone-logging-config.html

Perhaps there is more comprehensive guidance? I'll add our engineers as contacts in case they can help.

2. Document a potentially common use case which a user could come across

One potentially common use case is where users specify a service.name in container or kubernetes logs (although not exclusive to this use case). We should write a bit about this to give them an example of how to work around this:

image

Example cluster

In this case, the log relating to the service is nested within the message:

message:{"@timestamp":"2024-07-31T08:26:45Z","log.level":"info","message":"proxying API request to http://opbeans:3000"}

There are methods to decode this but they'll need to be careful that they don't override the existing log.level if they do.

We should make them aware of this use case and give them some guidance on how to pull out the encoded log.level and surface it in the main document (being careful not to overwrite the existing log.level).

Resources

Quick demo video:

https://github.com/user-attachments/assets/e3298e51-755b-4746-8991-edf10c859f5b.mp4

Which documentation set does this change impact?

Stateful and Serverless

Feature differences

It's going to be available on both.

What release is this request related to?

8.16

Collaboration model

The documentation team

Point of contact.

Main contact: @roshan-elastic

Stakeholders: @cauemarcondes @kpatticha

roshan-elastic commented 2 months ago

Hey @mdbirnstiehl this is the update I mentioned.

@bmorelli25 wondering if this update could be prioritised? It's relating to a bit of a problem that we can't build around in the product so we want to provide some user guidance for in the docs (and link to it from the product).

We'll be linking to the docs from the product (by next Tuesday) but as it's a short-link, we can point them to the general docs without it being a blocker.

bmorelli25 commented 2 months ago

We should be able to prioritize this. @mdbirnstiehl is booked up for the near future, but I'll try to find someone else on the team to take this on.

bmorelli25 commented 2 months ago

Note to writer: This document also needs to be ported to stateful

dedemorton commented 1 month ago

@cauemarcondes @kpatticha Can you provide a decode_json_fields processor config example that shows how to pull out the encoded log.level and surface it in the main document (being careful not to overwrite the existing log.level). Maybe a config that would work with the following example:

message:{"@timestamp":"2024-07-31T08:26:45Z","log.level":"info","message":"proxying API request to http://opbeans:3000"}
dedemorton commented 1 month ago

@mdbirnstiehl Do you think this new content belongs in the topic about adding the service name or a new topic called something like "Add a log level to logs"? I'm leaning towards a separate topic because it seems like we are mixing things that logically don't belong together. If I add this info to the existing topic, I'll need to completely restructure it. I also wonder if folks who aren't user the new experience might still want to know how to add (or decode) the log level WDYT?

cauemarcondes commented 1 month ago

Can you provide a decode_json_fields processor config example that shows how to pull out the encoded log.level and surface it in the main document (being careful not to overwrite the existing log.level). Maybe a config that would work with the following example:

Hi @dedemorton, I think it's best to ask the @elastic/obs-ux-logs-team team for an official guide on how to add the JSON processor for both the Auto-detect logs and metrics and Stream log files starting guide. @flash1293 This is about what we talked a couple of weeks ago when you helped me parsing the log messages.

tonyghiani commented 1 month ago

Can you provide a decode_json_fields processor config example that shows how to pull out the encoded log.level and surface it in the main document (being careful not to overwrite the existing log.level)

@dedemorton you can use the pre-installed logs@json-pipeline pipeline to parse JSON logs. This is installed by ES by default and takes care of all the steps for the parsing of a JSON-like message.

Please note that the strategy the pipeline follows after parsing is add_to_root_conflict_strategy: merge, which means existing parsed fields will be overwritten.

Here is how you can use it:

{
  "my-pipeline": {
    "processors": [
      {
        "pipeline": {
          "name": "logs@json-pipeline",
          "ignore_missing_pipeline": true
        }
      }
    ]
  }
}

And this is the whole definition of logs@json-pipeline:

{
  "logs@json-pipeline": {
    "processors": [
      {
        "rename": {
          "if": "ctx.message instanceof String && ctx.message.startsWith('{') && ctx.message.endsWith('}')",
          "field": "message",
          "target_field": "_tmp_json_message",
          "ignore_missing": true
        }
      },
      {
        "json": {
          "if": "ctx._tmp_json_message != null",
          "field": "_tmp_json_message",
          "add_to_root": true,
          "add_to_root_conflict_strategy": "merge",
          "allow_duplicate_keys": true,
          "on_failure": [
            {
              "rename": {
                "field": "_tmp_json_message",
                "target_field": "message",
                "ignore_missing": true
              }
            }
          ]
        }
      },
      {
        "dot_expander": {
          "if": "ctx._tmp_json_message != null",
          "field": "*",
          "override": true
        }
      },
      {
        "remove": {
          "field": "_tmp_json_message",
          "ignore_missing": true
        }
      }
    ],
    "_meta": {
      "description": "automatic parsing of JSON log messages",
      "managed": true
    },
    "version": 12,
    "deprecated": false
  }
}
mdbirnstiehl commented 1 month ago

@mdbirnstiehl Do you think this new content belongs in the topic about adding the service name or a new topic called something like "Add a log level to logs"? I'm leaning towards a separate topic because it seems like we are mixing things that logically don't belong together. If I add this info to the existing topic, I'll need to completely restructure it. I also wonder if folks who aren't user the new experience might still want to know how to add (or decode) the log level WDYT?

Yeah, I agree that it makes more sense to me as a separate topic.

dedemorton commented 1 month ago

Spoke with @mdbirnstiehl. He is going to take over this issue because he has time now. Plus, as our logs guy™, he knows more about this subject. Thanks Mike!

mdbirnstiehl commented 1 month ago

Hi @roshan-elastic, I'm not sure I completely understand the scenario of having to declare a log level that don't contain log levels at all.

I understand when there is a log.level present, but it's not parsed and we can use the logs@json pipeline described by @tonyghiani to parse the logs.

With the standalone agent link, wouldn't that just apply to the Agent's logs and not to logs from events that are getting indexed? Are we wanting to create arbitrary log levels for logs that don't contain log levels? I'm not sure if that would create meaningful data or graphs.

roshan-elastic commented 1 month ago

Hey @mdbirnstiehl,

Good timing :)

We're actually going to change:

Current Log Rate : (currently requires logs to have log.level to be included Log Error % : (currently requires logs to have log.level to be included )

Change to

Log Rate : count() / [PERIOD_IN_MINUTES] Log Error Rate : count(kql='log.level: "error" OR log.level: "ERROR"') / [PERIOD_IN_MINUTES]

@iblancof might be the best contact for this but we'll be making these changes as part of this epic:

For example, here

mdbirnstiehl commented 3 weeks ago

Hi @roshan-elastic and @iblancof ,

Would it make sense to add a note or section in the new experience page about the Log Rate and Log Error Rate formulas and the need to parse the log.level for the Log Error Rate? We could then point to some examples like the page on extracting log.level from an unstructured or semi-structured log data, or extracting a log level from k8s logs?

I think it might fit better on the new experiences page rather than the service.name page. WDYT?

roshan-elastic commented 3 weeks ago

Hey @mdbirnstiehl - thanks for reminding me about this.

we're actually about to change how we handle the log rate and log error %:

Log Rate

Now Log Rate count(kql='log.level: *') / [PERIOD_IN_MINUTES]

Changing to Log Rate count() / [PERIOD_IN_MINUTES]

Log Error %

Now Log Error % count(kql='log.level: "error" OR log.level: "ERROR"') / [PERIOD_IN_MINUTES]

Changing to Log Error Rate count(kql='log.level: "error" OR log.level: "ERROR" OR error.log.level : "error"') / [PERIOD_IN_MINUTES]

This means that the log charts in the service views should be far more unlikely to be empty:

Image

We could then point to some examples like the page on extracting log.level from an unstructured or semi-structured log data, or extracting a log level from k8s logs?

Makes sense!

I think it might fit better on the new experiences page rather than the service.name page. WDYT?

Yeah, I think this makes sense to show this in the new experiences page but perhaps it makes sense to link 'how' to do this in the service logs doc so that all of the information about how to declare your services well lives in one place?

In short:

WDYT?

mdbirnstiehl commented 3 weeks ago

@roshan-elastic sounds good, I'll start making the updates!