[META] Make the Elastic Agent easier to debug

cmacknz commented 1 year ago

We have made it easier to tell that the Elastic Agent is not working, mapping errors back to individual inputs and integrations. Sometimes these errors are enough to diagnose a problem, but this is not always the case and this mechanism is not enough to diagnose the most complex failures.

There are several things we can do to improve this, from improving our log messages to making it easier to understand Elastic Agent diagnostics bundles.

Diagnostics Analysis

We can greatly improve how easy it is to debug the Elastic Agent by automating the most common tasks when analyzing diagnostics. We can implement a tool or an integration that does at least the following and generates a summary report of possible problems:

Summarizing component and input errors in the state.yml file.
Finds error level log messages and orders them by how frequently they occur.
Finds previous transitions into the DEGRADED or FAILED state for each unit, summarizing the frequency and reason for each transition.
Parses the periodic metrics summaries from the log files and lists detectable problems. For example detecting and highlighting that the internal queue of a component process is full.
Splits the elastic-agent-*.ndjson files into separate files per component to make it easier to follow the events for a specific input.

- [ ] https://github.com/elastic/beats/issues/33238
- [ ] https://github.com/elastic/elastic-agent/issues/3758
- [ ] https://github.com/elastic/elastic-agent/issues/2940
- [ ] https://github.com/elastic/elastic-agent/issues/3804
- [ ] https://github.com/elastic/ingest-dev/issues/2775
- [ ] https://github.com/elastic/elastic-agent/issues/3826
- [ ] https://github.com/elastic/beats/issues/37929

Improved log messages and per integration logging

Today the utility of the debug log level is limited because of how verbose debug level logs are. We can fix this by allowing users to control the log level on a per input or per integration basis, the foundation for this is already available as described in https://github.com/elastic/ingest-dev/issues/2405. We can additionally expose the internal trace log level the agent uses to hide all per event logging in Beats, which is far too verbose at high data rates but becomes useful again if it can be scoped to a single input.

We can additionally improve the quality of our error level logs, ensuring only actual problems are logged at the error level and that we log enough information to debug critical issues by default.

- [ ] https://github.com/elastic/ingest-dev/issues/2405
- [ ] https://github.com/elastic/elastic-agent/issues/3768
- [ ] https://github.com/elastic/elastic-agent/issues/2212
- [ ] https://github.com/elastic/elastic-agent/issues/3406
- [ ] https://github.com/elastic/elastic-agent/issues/3705
- [ ] https://github.com/elastic/beats/issues/30687
- [ ] https://github.com/elastic/beats/issues/33632
- [ ] https://github.com/elastic/elastic-agent/issues/3363

Always collecting necessary diagnostics

The agent diagnostics command captures heap profiles by default and can capture CPU profiles when instructed to. An obvious gap is the inability to capture CPU profiles when collecting diagnostics from Fleet. For situations where the agent has become unresponsive, we should add a command that creates a compressed archive of the agent installation directory since it usually contains logs or clues about why the agent isn't responding:

- [ ] https://github.com/elastic/kibana/issues/162849
- [ ] https://github.com/elastic/elastic-agent/issues/3491
- [ ] https://github.com/elastic/elastic-agent/issues/2048

Improved error reporting and diagnostics on Kubernetes

There are several problems that are specific to an agent running on Kubernetes that are currently difficult to detect and debug. We should automatically detect inadequate memory allocations and automatically set the GOMAXPROCS variable depending on the configured CPU quota, see https://github.com/golang/go/issues/33803.

- [ ] https://github.com/elastic/elastic-agent/issues/3641
- [ ] https://github.com/elastic/elastic-agent/issues/3642
- [ ] https://github.com/elastic/elastic-agent/issues/2423 
- [ ] https://github.com/elastic/beats/issues/31920

Exposing Elastic Agent Metrics

The Elastic Agent should report report metrics about itself, which should be available on the HTTP interface. We should expose at least the following set of metrics:

Gauges for the number of running components and units.
Counter for the number of successful and failed Fleet check ins.
Counters for the number actions processed for each action type.
Counters specific to each running component and unit: counters for the number of restarts, counters for transitions into each state, counters for the number of checkins received, counters for the number of missed checkins, counters for the number of configuration changes applied when the expected state does not match the observed state, etc.

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

belimawr commented 1 year ago

Some easy way to control the log rotation would be nice as well, maybe even increasing the limit automatically when we set the log level to anything more verbose than info.

Sometimes it is hard to collect logs from a big enough time window the level is set to debug. That, in part, is a factor of the logs being too verbose, but the number of integrations and process as well as the load also plays a role on it.

cmacknz commented 1 year ago

IMO expecting to be able to inspect every event in a high event system is impractical. We should likely separate event captures from the rest of our logs so that they can't flood our other messages and have standard identifiers to correlate between them.

Tracing solutions solve both the correlation problem via span IDs and the volume problem via sampling and aggregation. The trouble we have is that we can't rely on the presence of a trace aggregator like APM server to be able to do this all the time.

We could at least start writing the events themselves out to a separate file with a separate size and rotation policy. I think someone explored the concept of a tee output previously that would be nice for this.

andrewkroh commented 1 year ago

As a user performing troubleshooting, I would like to be able to correlate the stream ID from the agent policy with both the logs and metrics. IIRC from the last time I was trying to do this in Kibana, there were a few different fields that held this ID value and not all of them were indexed.

I think it would be helpful for Elastic Agent to give guidance on logging schema for its constituent inputs to follow. With that written down somewhere we could then make changes to ensure that it is followed and make sure those fields are mapped consistently in the elastic_agent integration. Here's a quick idea building upon on the component fields I see in the elastic_agent logs.

Field	Description	Examples
component.kind	Kind of component. These are general high-level concepts.	input, processor, output
component.type	Type of component. Type specifies a named implementation of the specific component kind.	winlog, filestream, add_cloud_metadata, elasticsearch
component.id	A unique identifier associated with an instance of of the component type.	winlog-windows.sysmon_operational-9c18c120-174e-11ee-925f-b5a3d2f00fd4
component.binary	Name of the binary without OS specific file extensions.	filebeat, auditbeat, elastic-defend

Example of data from beats:

[
  {
    "@timestamp": "2023-10-26 20:09:25 UTC",
    "component": {
      "id": "winlog-windows.sysmon_operational-9c18c120-174e-11ee-925f-b5a3d2f00fd4",
      "kind": "input",
      "type": "winlog"
    },
    "log": {
      "level": "info"
    },
    "message": "Hello world!"
  },
  {
    "@timestamp": "2023-10-26 20:09:25 UTC",
    "component": {
      "id": "winlog-windows.sysmon_operational-9c18c120-174e-11ee-925f-b5a3d2f00fd4",
      "kind": "input",
      "type": "winlog"
    },
    "event": {
      "kind": "metric"
    },
    "input": {
      "errors_total": 0,
      "received_events_total": 446799
    }
  },
  {
    "@timestamp": "2023-10-26 20:09:25 UTC",
    "component": {
      "id": "add_cloud_metadata-001",
      "kind": "processor",
      "type": "add_cloud_metadata"
    },
    "log": {
      "level": "debug"
    },
    "message": "Detected AWS"
  }
]

The component.x labels coming from the raw Beat logs would take precedence over any component.x labels being added by Elastic Agent as the logs flows through it.

cmacknz commented 1 year ago

Great suggestion, I will add that to the description as something we should do.

To address Tiago's request I think something like a tee output is the nicest way to deal with event captures. We could isolate it from log rotation entirely and build configuration around tee-ing events to file like "tee up to 10 MB or 10s of events to this file path and then stop". Maybe this would make sense as a processor so we could put conditions on the events we log to file to filter them more easily.

@leehinman had the original idea for a tee output as part of a spacetime project, Lee did you make any progress on that? Any reason it wouldn't work? I'd link to the issue but the spacetime projects are all private.

pierrehilbert commented 1 year ago

Adding https://github.com/elastic/elastic-agent/issues/3758 to the list as we need to ensure diagnostics are usable.

AndersonQ commented 1 year ago

Field	Description	Examples
component.kind	Kind of component. These are general high-level concepts.	input, processor, output
component.type	Type of component. Type specifies a named implementation of the specific component kind.	winlog, filestream, add_cloud_metadata, elasticsearch

That's a great idea. But I'd suggest to have either kind or type, not both, I believe the distinction between them isn't straight forward and it'd make working with them confusing. Perhaps type and name would be more clear:

Field	Description	Examples
component.type	Kind of component. These are general high-level concepts.	input, processor, output
component.name	Type of component. Type specifies a named implementation of the specific component kind.	winlog, filestream, add_cloud_metadata, elasticsearch

AndersonQ commented 1 year ago

Some easy way to control the log rotation would be nice as well, maybe even increasing the limit automatically when we set the log level to anything more verbose than info.

Sometimes it is hard to collect logs from a big enough time window the level is set to debug. That, in part, is a factor of the logs being too verbose, but the number of integrations and process as well as the load also plays a role on it.

Perhaps we could have a exclusion selector to disable some loggers or at least have the trace level for stand alone beats as well

lucabelluccini commented 1 year ago

Wanted to share some thoughts/ideas about "Logging" items of this issue.

We should ensure debug logs do not dump any content about an event/file read from the user. The idea is to make sure debug logs can be safely enabled and diagnostics collected without any fear sensible information might be present in the diagnostic bundle. This facilitates troubleshooting friction (imagine a public discuss forum or support cases).
We should offer a tracing log level (opt-in for the final user) which can be enabled at global level (elastic agent only or elastic agent and all components) or per integration/component. The tracing log level would be allowed to dump sensible information as contents of requests/files can be present. Note we have already something similar to this in the HTTP JSON Inputs already https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-httpjson.html#_request_tracer_filename and it is also exposed in the integrations using HTTP JSON Input - and it facilitates a lot the troubleshooting especially with pagination errors or invalid requests.
In some cases, such as when dropping events or in case of rejections from the output, we should report as much metadata as possible about what happened (error codes, the destination of the event, etc...), but we should never dump the whole event being sent and being rejected or dropped in any log level except trace. Error logs or Info logs should be used to report the event was dropped (and to what input/module/integration/datastream belongs - so print out the metadata about it), but not the actual contents, which should be logged only in trace mode. Example: if we drop/reject an event, we should report the data_stream fields, the component which produced it and the reason of the rejection in info or warning (if not expected), but the whole content of the event should be available only in trace level.

IICU, we do not have a real "trace" level:

Beats standalone run "debug" with tracing features
Beats running under Elastic agent run "debug" without tracing features

FYI @bczifra

cmacknz commented 1 year ago

Agreed we need to support something like a generalized event trace file that is separate from our logs, and avoid logging full events to ensure we don't accidentally log sensitive information. The existing Beats debugging log level will log events and does not sanitize them.

cmacknz commented 1 year ago

I forked Andrew's suggestion above in https://github.com/elastic/elastic-agent/issues/3640#issuecomment-1781875889 off into its own issue for tracking https://github.com/elastic/elastic-agent/issues/3768

efd6 commented 7 months ago

For long-running support cases, it would be helpful to be able to correlate user configuration changes to changes in behaviour. This is currently difficult when multiple configuration states are tried under support and developer advice. A way to achieve this would be to write out configuration changes made by the user — probably not in the agent logs, though maybe. If a small set of configuration state change files, each containing the from/to change and timestamped in the filename with rotation, this would enable this without significant filesystem burden. These files would then be included in diagnostic bundles.

This config change logger would need to understand secrets annotation to avoid them being included in the files.

lucabelluccini commented 7 months ago

I think logging Policy ID and the revision could be a start. It is possible to retrieve policies by revision, requesting specific documents from Fleet indices.

I am not aware EA is able to compute a diff of before/after after applying a policy. I think it is only able to tell if a component changed or not. Dumping the whole policy would be possible, eventually, but as @efd6 suggests, we would need the e2e secrets awareness in order to redact the secrets. EA atm doesn't have this info.

The Kibana diagnostic can pull more data out of Fleet APIs when the secrets will be fully implemented by integration developers. But I doubt we will be able to pull the whole history of a policy (it would be a lot of info).

cmacknz commented 7 months ago

In 8.13.0 we added a log line to summarize the configuration units that were added/removed/updated:

{"log.level":"info","@timestamp":"2024-01-17T22:08:16.612Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1382},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"added":["udp-default","journald-default"],"updated":["http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]","log-default: [(log-default-logfile-system-9bf446fc-58d4-4767-b42d-3450815d5d3d: updated)]","system/metrics-default: [(system/metrics-default-system/metrics-system-9bf446fc-58d4-4767-b42d-3450815d5d3d: updated)]","beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]"],"count":7},"outputs":{}},"ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2024-01-17T22:08:18.558Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1382},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":7},"outputs":{}},"ecs.version":"1.6.0"}

This is missing the policy id and revision which we should add in to these logs to hopefully enable querying the policy itself out of Fleet.

elastic / elastic-agent