elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
22 stars 143 forks source link

cannot apply a policy that changes output settings with Endpoint Security present #761

Closed gbanasiak closed 6 months ago

gbanasiak commented 2 years ago

Version

8.3.2

Operating System

Centos 7 (not verified on other OSes)

Description

It's impossible to apply a policy that changes default Elasticsearch output to non-default output with Endpoint Security present. Endpoint collides on port TCP/6788 with Elastic Agent.

Policy before:

outputs:
  default:
    type: elasticsearch
    hosts:
      - >-
        https://REDACTED.europe-west1.gcp.cloud.es.io:443

elastic-agent-before.yml.txt

Policy after:

outputs:
  703329b0-ebb2-11ec-816c-e300c95fdbc7:
    type: elasticsearch
    worker: 2
    bulk_max_size: 4096
    hosts:
      - >-
        https://REDACTED.europe-west1.gcp.cloud.es.io:443
  default:
    type: elasticsearch
    hosts:
      - >-
        https://REDACTED.europe-west1.gcp.cloud.es.io:443

elastic-agent-after.yml.txt

Screenshot 2022-07-21 at 17 04 32 Screenshot 2022-07-21 at 17 05 33

Symptoms

Elastic Agent status:

# elastic-agent status
Status: FAILED <--- HERE
Message: (no message)
Applications:
  * endpoint-security      (HEALTHY)
                           Protecting with policy {604429b8-889c-4517-95cd-b534d6492c41}
  * filebeat_monitoring    (HEALTHY)
                           Running
  * metricbeat_monitoring  (HEALTHY)
                           Running
  * endpoint-security      (FAILED)
                           failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: address already in use <--- HERE

Tartget policy not applied:

# elastic-agent inspect
[..]
outputs:
  default:
    api_key: REDACTED
    hosts:
    - https://REDACTED.europe-west1.gcp.cloud.es.io:443
    type: elasticsearch

Logs:

{"log.level":"info","@timestamp":"2022-07-21T16:49:55.466+0200","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is kt6zc5pb","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-21T16:49:55.466+0200","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 2 step(s)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-21T16:49:55.641+0200","log.origin":{"file.name":"operation/operator.go","file.line":307},"message":"operation 'operation-install' skipped for endpoint-security.8.3.2","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-21T16:49:55.642+0200","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-07-21T16:49:55+02:00 - message: Application: endpoint-security--8.3.2[31e595cb-6fa0-4d62-b5ba-772ec96e796d]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-07-21T16:49:55.953+0200","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'error'","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-07-21T16:49:55.953+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":208},"message":"failed to dispatch actions, error: operator: failed to execute step sc-run, error: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: address already in use: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: address already in use","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-07-21T16:49:55.953+0200","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2022-07-21T16:49:55+02:00 - message: Application: endpoint-security--8.3.2[31e595cb-6fa0-4d62-b5ba-772ec96e796d]: State changed to FAILED: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: address already in use - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}

What stands out is: {"file.name":"log/reporter.go","file.line":36},"message":"2022-07-21T16:49:55+02:00 - message: Application: endpoint-security--8.3.2[31e595cb-6fa0-4d62-b5ba-772ec96e796d]: State changed to FAILED: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: address already in use - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}

TCP/6788 is used by Elastic Agent:

# netstat -anp | grep 6788
tcp        0      0 127.0.0.1:6788          0.0.0.0:*               LISTEN      989/elastic-agent <--- HERE

# ps auxwwf
[..]
root       982  1.9  2.2 3103352 177676 ?      Ssl  16:24   0:55 /opt/Elastic/Endpoint/elastic-endpoint run
[..]
root       989  6.8  0.6 1279912 53724 ?       Ssl  16:24   3:18 elastic-agent <--- HERE
root      1247  0.0  0.0      0     0 ?        Zs   16:24   0:00  \_ [elastic-agent] <defunct>
root      2619  0.0  0.0      0     0 ?        Zs   16:27   0:00  \_ [elastic-agent] <defunct>
root      3848  0.0  0.0      0     0 ?        Zs   16:46   0:00  \_ [elastic-agent] <defunct>
root      3877  5.5  2.3 1398108 189384 ?      Sl   16:46   1:25  \_ /opt/Elastic/Agent/data/elastic-agent-16c55b/install/filebeat-8.3.2-linux-x86_64/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E logging.level=info -E path.data=/opt/Elastic/Agent/data/elastic-agent-16c55b/run/default/filebeat--8.3.2--36643631373035623733363936343635
root      3900  0.2  1.4 1428076 119660 ?      Sl   16:46   0:03  \_ /opt/Elastic/Agent/data/elastic-agent-16c55b/install/metricbeat-8.3.2-linux-x86_64/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E logging.level=info -E path.data=/opt/Elastic/Agent/data/elastic-agent-16c55b/run/default/metricbeat--8.3.2--36643631373035623733363936343635

Steps to Reproduce

  1. Create empty policy and add Endpoint Security with default settings.
  2. Deploy to 8.3.2 agent running on Linux and confirm agent healthy.
  3. Modify the policy changing integration output settings as shown above, and apply.

That is similar to https://github.com/elastic/elastic-agent/issues/257.

AndersonQ commented 2 years ago

This seems related to https://github.com/elastic/fleet-server/issues/1672. It should not be possible to have 2 Elasticsearch outputs on a agent policy. By related, I mean, both are adding another ES output, what isn't really supported by fleet-server

nicpenning commented 2 years ago

Same issue on Windows environment deployment of Elastic Agent and Elastic endpoint security integration.

Step 1. Have an initial Endpoint security integration on an agent Step 2. Apply a different policy that has an Endpoint security integration and notice that the "listen tcp 127.0.0.1:6788: bind: address already in use" error log is recorded. It is almost as if the integration does not fully remove the current endpoint security integration and redeploy it from scratch. (The elastic-endpoint.exe process never stops)

The work around to is deploying a different policy that will remove the Endpoint security integration and then deploying the policy you wish to use.

❗However, if you tried to deploy an Endpoint Security policy to an agent that already has one then this error gets the endpoint stuck where it is in a degraded state and won't take the new policies. So a reboot (or restarting the agent) will be required to have the agent attempt to get a working state and then get the correct policy which is a significant issue.

nicpenning commented 2 years ago

Update: This is still a problem in 8.4.0.

ferullo commented 2 years ago

@AndersonQ can you confirm that this is an Agent side issue and there's nothing in Endpoint that needs to be done to fix this? Based on your comment I think we're in agreement but I'd like to make sure I'm not misinterpreting you.

This issue seems the same as an old one which was marked as fixed in 8.3.0. Was it possibly fixed and the bug has reappeared?

AndersonQ commented 2 years ago

Hello folks, first let me take a step back, weŕe dealing with 2 problems here:

nicpenning commented 2 years ago

Sorry if I crossed the streams here! I will note that I was switching from a policy with then endpoint security policy with a Logstash output to a nearly identical policy that as an Elasticsearch output.

AndersonQ commented 2 years ago

ok, so to confirm, the steps to reproduce are:

  1. Create empty policy and add Endpoint Security with default settings and ES as output
  2. Deploy to 8.3.2 agent running on Linux and confirm agent healthy.
  3. Create a Logstash output
  4. create another policy identical to the 1st one, but using Logstash as "Output for integrations"
  5. swap the agent to the new policy

is it correct @nicpenning

nicpenning commented 2 years ago

That should do it!

nicpenning commented 2 years ago

I haven't tested in 8.4.1 yet though.

AndersonQ commented 2 years ago

I did some more tests and even a change of output permissions did not cause the problem, it really seem to be related to a change in the whole output

nimarezainia commented 1 year ago

@AndersonQ & @pierrehilbert can this issue be closed? the main API key issues have long been addressed. Let me know if there's anything remaining.

@anderson you mention: "the other issue, indeed most likely is on the elastic-agent. Let me reproduce it to double check and be completely sure." -- not quiet sure what the other issue here is exactly. thanks

AndersonQ commented 1 year ago

I think it was the port collision, I'm not sure anymore. But yes, it seems ok to close it

jlind23 commented 6 months ago

I think it was the port collision, I'm not sure anymore. But yes, it seems ok to close it

Closing as per @AndersonQ's comment