elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
129 stars 138 forks source link

A Fleet proxy configured with the --proxy-url option when enrolling is not passed to endpoint security #2304

Closed cmacknz closed 1 year ago

cmacknz commented 1 year ago

This appears to be caused by https://github.com/elastic/elastic-agent/pull/2172, reverting it fixes the problem.

To reproduce this, install the agent with the --proxy-url option and add the Elastic Defend integration to the agent policy.

sudo ./elastic-agent install --url=$FLEET_URL --enrollment-token=$ENROLLMENT_TOKEN --proxy-url=$PROXY_URL

Observe that endpoint security fails to connect to Fleet. Running elastic-agent diagnostics we should expect to see the proxy_url key in the fleet section of the endpoint unit configuration, and it will be missing.

Example of the expected output. This can be obtained by reverting https://github.com/elastic/elastic-agent/pull/2172 and installing the agent. With https://github.com/elastic/elastic-agent/pull/2172 the proxy_url key is missing, it seems to be getting overwritten with the empty proxy_url received from Fleet.

fleet:
    kind:
      structvalue:
        fields:
          access_api_key:
            kind:
              stringvalue: <REDACTED>
          agent:
            kind:
              structvalue:
                fields:
                  id:
                    kind:
                      stringvalue: <REDACTED>
                  logging:
                    kind:
                      structvalue:
                        fields:
                          level:
                            kind:
                              stringvalue: info
          enabled:
            kind:
              boolvalue: true
          host:
            kind:
              structvalue:
                fields:
                  id:
                    kind:
                      stringvalue: <REDACTED>
          hosts:
            kind:
              listvalue:
                values:
                - kind:
                    stringvalue: <REDACTED>
          protocol:
            kind:
              stringvalue: http
          proxy_url:
            kind:
              stringvalue: PROXY_URL
          ssl:
            kind:
              structvalue:
                fields:
                  renegotiation:
                    kind:
                      stringvalue: never
                  verification_mode:
                    kind:
                      stringvalue: full
          timeout:
            kind:
              stringvalue: 10m0s
cmacknz commented 1 year ago

This was introduced in v8.6.2 https://github.com/elastic/elastic-agent/commit/ac658893f9f3e0185a89367e809a96d884eb8317

cmacknz commented 1 year ago

We will need a new manual QA regression test case for this that configures the agent to use a Fleet proxy server, installs the Elastic Defend integration, and ensures that the agent remains healthy.

cmacknz commented 1 year ago

This can be added to the proxy test suite that was created in https://github.com/elastic/kibana/issues/140533#issuecomment-1447983586

amolnater-qasource commented 1 year ago

Hi @cmacknz Thank you for the update, we have added the required testcase under Fleet test suite.

Details are shared under https://github.com/elastic/kibana/issues/140533#issuecomment-1450007059

Please let us know if anything else is required from our end.

Thanks!

cmacknz commented 1 year ago

@amolnater-qasource can you test the following scenario:

  1. Setup a proxy for an 8.7.0 Fleet cluster. Install the 8.6.2 agent and enroll it with the --proxy-url command line parameter. Expected Result: The enrolment succeeds and the agent appears healthy.
  2. Install the Elastic Defend integration. Expected Result: The Elastic Defend integration will not be able to reach Fleet server and will be unhealthy. The Endpoint security logs in Fleet should contain a message like:
    [elastic_agent.endpoint_security][error] Http.cpp:327 CURL error 28: Error [Failed to connect to <domain>.fleet.....elastic-cloud.com port 443 after 21272 ms: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected]
  3. Upgrade the agent to 8.7.0. Expected Result: The upgrade should succeed and Elastic Defend should remain unhealthy afterwards.
  4. Configure the proxy URL used in step 1 in the Fleet UI. Expected Result: The Elastic Defend integration should now be healthy.
amolnater-qasource commented 1 year ago

Hi @cmacknz

We have revalidated this on 8.7.0 BC6 Kibana cloud environment and had below observations:

Expected Result: The enrolment succeeds and the agent appears healthy.

Expected Result: The Elastic Defend integration will not be able to reach Fleet server and will be unhealthy. The Endpoint security logs in Fleet should contain a message like:

Expected Result: The upgrade should succeed and Elastic Defend should remain unhealthy afterwards.

Configure the proxy URL used in step 1 in the Fleet UI. Expected Result: The Elastic Defend integration should now be healthy.

Screen Recordings: 6

https://user-images.githubusercontent.com/77374876/227182371-d01db095-67b7-4c25-a281-c74410522d5a.mp4

https://user-images.githubusercontent.com/77374876/227182502-ae096800-9a25-4061-bcbf-f2f7e835dbb8.mp4

Build details:

8.7.0-BC6
BUILD: 61051
COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb
Artifact Link: https://staging.elastic.co/8.7.0-a7fb3750/downloads/beats/elastic-agent/elastic-agent-8.7.0-linux-x86_64.tar.gz
Host OS: Linux .tar

Please let us know if we are missing anything here. Thanks

cmacknz commented 1 year ago

Related https://github.com/elastic/elastic-agent/issues/2390. When the agent is configured to use a proxy you can't tell from the diagnostics in an obvious way.

cmacknz commented 1 year ago

I tried reproducing this again, and I can confirm everything looks healthy with the proxy in use but I believe this is because with my basic proxy setup the internet is still reachable without going through the proxy. This means that even if the proxy isn't used things will still work.

https://github.com/elastic/elastic-agent/issues/2390 means it is hard to confirm whether a proxy is being used at all. I can confirm from some of the agent log messages that it is using the proxy, as far as I can tell endpoint isn't.

cmacknz commented 1 year ago

Configuring a proxy in the UI through the new proxy settings doesn't seem to change this.

cmacknz commented 1 year ago

In my test setup I killed my local proxy instance and the agent reported itself as unable to connect to Fleet but Endpoint kept working. This is even after configuring a proxy in Fleet in addition to using the --proxy-url option at installation time. I don't think that the new proxy settings in 8.7.0 will be a work around for this problem unfortunately.

AndersonQ commented 1 year ago

Folks, I've been trying to reproduce it but I cannot.

It seems to me the problem is when the --proxy-url is passed to the agent install/enrol but the proxy is not defined in the policy. In this case the agent install and enrol, but then keeps trying to connect to fleet forever.

This are my tests and findings:

  1. Setup environment:

    • Run a cloud deployment: 8.7.1-SNAPSHOT
    • Run a Squid proxy onn your local machine
    • Run the agent's Vagrant dev box:
      • vagrant up elastic-agent
      • vagrant ssh elastic-agent
  2. Setup a proxy for an 8.7.0 Fleet cluster. Install the 8.6.2 agent and enroll it with the --proxy-url command line parameter.

    • Use IPTABLES to block all connections to Elastic Cloud:
      • iptables -A INPUT -s 35.224.224.0/24 -j DROP check your cloud IP
      • double check it's blocked:
        • curl https://[YOUR_CLOUD_FLEET_URL]/api/status should hang
  1. Without the proxy/iptables rule
    • all works
    • check for errors:
    • cat /opt/Elastic/Agent/data/elastic-agent-913c02/logs/elastic-agent-[TIMESTAMP].ndjson | grep error
    • No error found
    • Agent is health in fleet as well as on the status command

2.1 Setup proxy and endpoint security after instalation

cmacknz commented 1 year ago

It seems to me the problem is when the --proxy-url is passed to the agent install/enrol but the proxy is not defined in the policy

We spoke about this today. The problem is that the proxy_url received in the agent policy from Fleet always takes precedence over the one that was configured when the agent was installed with the --proxy-url option, even when the proxy_url received from Fleet is empty.

The ability to configure the proxy from Fleet is a new feature in 8.7.0, so nobody has configured a proxy in Fleet from the start. This means upgrading to 8.7.0 will usually result in the proxy setup at installation time being unconditionally ignored.

The path to fix this for 8.7.1 is to use the following precedence rules for the proxy URL:

  1. Prefer the proxy URL from Fleet only if it is a valid IP/host and is not the empty string.
  2. Fall back to the proxy configured when the agent was installed with --proxy-url.

This will ensure that agents configured with a proxy at installation time continue to work while still allowing for the proxy to be changed from Fleet. This has the caveat that the proxy setup at installation time cannot be removed from Fleet at this time.

We will create a follow up issue to allow the agent to distinguish between the proxy URL received from Fleet being empty because it was never configured, and the proxy URL received from Fleet being empty because the user intended to remove all proxy configurations.

cmacknz commented 1 year ago

Filed https://github.com/elastic/kibana/issues/154482 to document that proxies setup when the agent was installed cannot be managed by Fleet.