elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
131 stars 141 forks source link

Elastic Defend does not respect original policy log level #5659

Open AndersonQ opened 3 weeks ago

AndersonQ commented 3 weeks ago

If an new agent is enrolled on a policy with log level set to debug and the Elastic Defend integration is on the policy, defend will start with log level info, ignoring the level set on the policy.

For confirmed bugs, please report:

1st diagnostics, just after enroll 2nd dignostics, after changing the agent's log level

elasticmachine commented 3 weeks ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pierrehilbert commented 2 weeks ago

Isn't it an issue on Defend side instead of Agent?

cmacknz commented 2 weeks ago

I see only info logs in the first diagnostics along with endpoint failing to apply the policy, but I can't tell why:

{"@timestamp":"2024-09-27T14:31:54.385200735Z","agent":{"id":"49800f30-88c5-4aeb-b207-acd0732d97ab","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"error","origin":{"file":{"line":226,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:226 Failed to apply new policy from Agent.","process":{"pid":7768,"thread":{"id":7919}}}
{"@timestamp":"2024-09-27T14:31:54.385367197Z","agent":{"id":"49800f30-88c5-4aeb-b207-acd0732d97ab","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"warning","origin":{"file":{"line":485,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:485 Endpoint is setting status to DEGRADED, reason: Policy Application Status","process":{"pid":7768,"thread":{"id":7919}}}

So I'd say this investigation should start on the defend side.

I also see agent's components-expected.yaml which is what we'd send on the control protocol has the endpoint log level set to debug.

There are endpoint debug logs in the second set of diagnostics.

pierrehilbert commented 2 weeks ago

@nfritts Could we have someone to have a look here please?

gabriellandau commented 5 days ago

This just occurred during an Endpoint SDH.

gabriellandau commented 5 days ago

Defend first sets its file logger to fleet.agent.logging.level as received in the config. By design, this is overridden by inputs[0].policy.OS_TYPE.advanced.logging.file if set in Defend policy. Usually this advanced policy option is not set. I can't put private repo links here, but if you DM me, I can point you at the code - it's straightforward.

1st diagnostics, just after enroll

components\endpoint-default\elastic-endpoint.yaml shows fleet.agent.logging.level as info:

fleet:
  agent:
    logging:
      level: info
revision: 3

Looking at computed-config.yaml in those diagnostics shows:

agent:
    logging:
        level: debug
revision: 3

These should be the same, right?

@brian-mckinney you worked on Endpoint's side of V2. Does Endpoint modify the policy YAML received from Agent? Could Endpoint be changing the fleet.agent.logging.level value received from Agent?

Red Herring?

Here's an observation that's probably a red herring but I think I should share it anyway: components\endpoint-default\elastic-endpoint.yaml has multiple inputs[0].policy.OS_TYPE.logging.file keys, which aren't keys that Defend looks for. Note the lack of advanced - they're not in advanced policy. I don't know what put them there.

brian-mckinney commented 5 days ago

I'm looking at this diagnostics and will have a more detailed response shortly. Some initial thoughts:

When an endpoint is initially installed, but before it gets its agent policy, it is started with an initial empty config:

revision: 0

fleet:
  agent:
    id: 00000000-0000-0000-0000-000000000000
    logging:
      level: info
  host:
    id: 00000000-0000-0000-0000-000000000000

inputs:
  - id: 00000000-0000-0000-0000-000000000000
    name: initial
    revision: 0

This should be overwritten when agent pushes a new policy to the endpoint. We don't make any modifications to the config, other than moving the location of some of the keys to match our legacy config format.

brian-mckinney commented 5 days ago

In the first diagnostics package, in the computed config, agent does indeed have a logging level set to debug

agent:
    download:
        sourceURI: https://artifacts.elastic.co/downloads/
        timeout: 2h
    features: null
    logging:
        level: debug
        metrics:
            period: 30s

However, in the same file, in the defend config's section, policy.linux.logging.file is set to info (same for the other operating systems)

      policy:
        linux:
            logging:
                file: info

In the endpoint log, you can see that logging is being read from the supplied config at info level

{"@timestamp":"2024-09-27T14:31:54.383399825Z","agent":{"id":"49800f30-88c5-4aeb-b207-acd0732d97ab","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":89,"name":"PolicyLogging.cpp"}}},"message":"PolicyLogging.cpp:89 Read Logging config:","process":{"pid":7768,"thread":{"id":7919}}}
{"@timestamp":"2024-09-27T14:31:54.383404239Z","agent":{"id":"49800f30-88c5-4aeb-b207-acd0732d97ab","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":92,"name":"PolicyLogging.cpp"}}},"message":"PolicyLogging.cpp:92     file: info","process":{"pid":7768,"thread":{"id":7919}}}

The function that reads in the log level from the config first clears the internal log handlers, to ensure that the log level is read from the config. That function reads from two locations: fleet.agent.logging.level and inputs[0].policy.OS_TYPE.advanced.logging.file

Is it possible to have a log level explicitly set in the defend portion of the policy in Kibana? Because it seems as though the config is being sent down to endpoint with logging at the info level, even though it is set to debug in the agent section.

@cmacknz @AndersonQ

cmacknz commented 5 days ago

Fleet and Agent generally don't (or try very hard not to with few exceptions like giving endpoint the fleet server host name) look into or modify the integration configuration themselves. Fleet only knows to set the agent.logging.level setting here.

The Fleet plugin in Kibana doesn't know the policy.linux.logging.file setting exists and probably shouldn't.

brian-mckinney commented 5 days ago

From everything I can tell, it appears that agent is sending endpoint a log level of info in that initial diagnostics. I just looked at the components-expected.yaml in that package and confirmed that fleet.agent.logging.level seems to be set to info

                    fleet:
                        kind:
                            structvalue:
                                fields:
                                    access_api_key:
                                        kind:
                                            stringvalue: V3dfa001SUItaXVTTGVpY3d5YWk6czFTd0MxYXBSNC1TS285cXRwdEdvdw==
                                    agent:
                                        kind:
                                            structvalue:
                                                fields:
                                                    id:
                                                        kind:
                                                            stringvalue: 49800f30-88c5-4aeb-b207-acd0732d97ab
                                                    logging:
                                                        kind:
                                                            structvalue:
                                                                fields:
                                                                    level:
                                                                        kind:
                                                                            stringvalue: info

So I think we are legitimately being sent an initial config with a log level set to info, when the bug reports states it should be debug. Unless I'm missing something of course. My experience reading these diagnostics packages is not very deep.

cmacknz commented 4 days ago

OK thanks I see what is happening, there are two conflicting versions of the agent log level and endpoint is respecting the stale one in the Fleet configuration. I did not realize or expect the log level to be in there and that is not what I was looking at originally.

The API contract for the expected log level received on the control protocol is the UnitLogLevel in the UnitExpected message:

https://github.com/elastic/elastic-agent-client/blob/86f524ddc40d8aae5a01244ae223fb87f8b33265/elastic-agent-client.proto#L199-L201

// Log level for the unit.
enum UnitLogLevel {
    ERROR = 0;
    WARN = 1;
    INFO = 2;
    DEBUG = 3;
    TRACE = 4;
}

message UnitExpected {
    //...
    UnitExpectedConfig config = 5;
    // Log level of the unit.
    UnitLogLevel log_level = 6;
}

In the components-expected.yaml file this is correctly set to debug:

          id: endpoint-default-08ffdcde-0b3a-453a-836f-0362ff4b55e0
          log_level: 3 # DEBUG
          type: 0

However the log level is also coming in as a sub-key of the fleet configuration, which we inject exclusively into the endpoint configuration because it has to contact Fleet Server as well, and that log level is stale at info (probably it was the log level at enrollment time).

Image

This log level injection happens here: https://github.com/elastic/elastic-agent/blob/1c041a29bfde297be5c29e89c7da5df3bd3cc8b2/internal/pkg/agent/application/fleet_server_bootstrap.go#L141-L152

This log level is out of date because we don't update the log level that was set when agent was installed, if I look at what is in the agent-info.yaml block it is also info:

agent_id: 49800f30-88c5-4aeb-b207-acd0732d97ab
headers: {}
log_level: info
snapshot: true
unprivileged: false
version: 8.16.0

This is despite the fact that the current log level received from Fleet was debug as is shown in pre-config.yaml:

agent:
    download:
        sourceURI: https://artifacts.elastic.co/downloads/
        timeout: 2h
    features: null
    logging:
        level: debug
cmacknz commented 4 days ago

The endpoint side fix is to respect the log_level field in the UnitExpected message instead of the one in Fleet. This is what every other component except endpoint uses because they don't get the Fleet config injected.

The agent side fix would be one or both of:

  1. Making the contents of agent-info.yaml match what was received from Fleet. I remember there were complications around this but I can't remember what specifically.
  2. Having the fleet component modifier populate the log level in the policy from fleet instead of the one in the local agent info, which is not updated.
nfritts commented 2 days ago

The endpoint side fix is to respect the log_level field in the UnitExpected message instead of the one in Fleet. This is what every other component except endpoint uses because they don't get the Fleet config injected.

@brian-mckinney Do you think you could put in a PR for this?