elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

Open AndersonQ opened 6 months ago

AndersonQ commented 6 months ago
Screenshot 2024-04-22 at 4 29 51 PM

Steps to Reproduce:

Non-authoritative answer: 9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2.gcp.elastic-cloud.com. proxy-production-us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com. Name: proxy-production-us-west2-v2.gcp.elastic-cloud.com Address: 35.235.72.223

address from cloud UI

nslookup my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com Server: 127.0.0.53 Address: 127.0.0.53#53

Non-authoritative answer: my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2.gcp.elastic-cloud.com. proxy-production-us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com. Name: proxy-production-us-west2-v2.gcp.elastic-cloud.com Address: 35.235.72.223

address from fleetUI

nslookup artifacts.elastic.co Server: 127.0.0.53 Address: 127.0.0.53#53

Non-authoritative answer: artifacts.elastic.co canonical name = infra-cdn.elastic.co. Name: infra-cdn.elastic.co Address: 34.120.127.130 Name: infra-cdn.elastic.co Address: 2600:1901:0:1d7::


 - block the IPs

iptables -A INPUT -j DROP -d 34.120.127.130 iptables -A OUTPUT -j DROP -d 34.120.127.130 ip6tables -A OUTPUT -j DROP -d 2600:1901:0:1d7:: ip6tables -A INPUT -j DROP -d 2600:1901:0:1d7::


- run squid proxy (http://10.80.40.162:3128) on another VM with an allow all config
- add the proxy on FleetUI for the ES output, fleet-server and agent binary download
- install the agent

./elastic-agent-8.13.0-linux-x86_64/elastic-agent install -nf --url=https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp.elastic-cloud.com:443 --enrollment-token=ENROLLMENT_TOKEN --proxy-url=http://10.80.40.162:3128

- add an invalid proxy (http://10.40.80.1:8888) on Fleet settings
- add the invalid proxy to fleet server
- agent status show as failed:

Every 2.0s: /opt/Elastic/Agent/elastic-agent stat... elastic-agent: Wed Mar 20 16:43:42 2024

┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h osts failed: 1 error occurred: │ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp .elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2 .gcp.elastic-cloud.com:443/api/status?": context deadline exceeded │ │ ├─ info │ ├─ id: 287e45c6-635e-4461-8c85-4d58704172d2 │ ├─ version: 8.13.0 │ └─ commit: 533443d148f4cf71e7c3e8efb736eda8275c4f69 ├─ beat/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41285' │ ├─ beat/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ beat/metrics-monitoring-metrics-monitoring-beats │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ filestream-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41322' │ ├─ filestream-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ filestream-monitoring-filestream-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ http/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41294' │ ├─ http/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ http/metrics-monitoring-metrics-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ log-default │ ├─ status: (HEALTHY) Healthy: communicating with pid '41251' │ ├─ log-default │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ log-default-logfile-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT └─ system/metrics-default ├─ status: (HEALTHY) Healthy: communicating with pid '41269' ├─ system/metrics-default │ ├─ status: (HEALTHY) Healthy │ └─ type: OUTPUT └─ system/metrics-default-system/metrics-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd



The status does eventually clear if you delete the incorrect proxy.
elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pierrehilbert commented 6 months ago

Are we sure the config is not applied for the fleet-server part?

AndersonQ commented 5 months ago

I could check again, but yes, the agent was not applying the config. A simple test is to reproduce the issue and fix the proxy in the policy and observe the agent will report as health again

cmacknz commented 5 months ago

┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h osts failed: 1 error occurred: │ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp .elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2 .gcp.elastic-cloud.com:443/api/status?": context deadline exceeded

Why is the Fleet status healthy but the agent status isn't? The reason we use a separate Fleet status in the first place was so we'd stop considering transient Fleet errors a reason why the agent would be unhealthy (and if the agent is offline, it can't report Fleet status anyway).

The error appears to be coming from:

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L192-L197

I think that function might be globally setting the agent status regardless of where it was called from:

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/internal/pkg/remote/client.go#L209-L213

AndersonQ commented 5 months ago

Why is the Fleet status healthy but the agent status isn't?

I thought it was global-ish error state for the fleetclient but perhaps it isn't. As you pointed out, the flle status is healthy, what is correct. And paying more attention at the error, it startes with Action: which leads me to believe this error is set because the Polich Change action failed. What is indeed correct, but the way it's presented is confusing.

I had a quick look at the code, and I believe here is where the error is collected and set on the agent status

https://github.com/elastic/elastic-agent/blob/ad7e1b5113ba07d39b23eb06f7b300f0d8c58d82/internal/pkg/agent/application/coordinator/coordinator_state.go#L201-L203

cmacknz commented 5 months ago

What clears that error once it is set? Another successful action?

AndersonQ commented 3 months ago

@cmacknz, IIRC, yes, a successful action would clear the error.

@pierrehilbert @cmacknz it's still relevant right?

nimarezainia commented 3 months ago

I would say this is very relevant. Perhaps even related to this: https://github.com/elastic/ingest-dev/issues/3234 We do want to inform the user if there are proxy issues, ideally before the config is applied.

AndersonQ commented 3 months ago

@nimarezainia, what do you mean by informing the user before the config is applied?

I'm wondering if you mean some how test it before sending to the agents. The only way to be 100% sure the proxy config indeed work is sending it to the agent so the agent can test it. And it is per agent, the same config might be valid for one agent but invalid for another.

amitkanfer commented 3 months ago

I believe Nima is referring to two-phase commit protocol which i don't think we want to focus on right now.. basically all agents report back to fleet server that a new config is valid (eg. "prepare"), and only then the "commit" phase happens where all agents apply the new config.

nimarezainia commented 3 months ago

Yes a two commit would work. Many of these configs (as @AndersonQ stated) would need to be tested at the agent itself. I am thinking mainly of connectivity related configurations, like the connection to Fleet Server, Outputs or the Download, before that config is applied, test whether you even have a route to the endpoint. Then apply/commit the configuration. If the test fails, don't change the config and flag this.

We don't want a small mistake in the configuration to bring down the whole Fleet.