Open AndersonQ opened 6 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Are we sure the config is not applied for the fleet-server part?
I could check again, but yes, the agent was not applying the config. A simple test is to reproduce the issue and fix the proxy in the policy and observe the agent will report as health again
┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h osts failed: 1 error occurred: │ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp .elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2 .gcp.elastic-cloud.com:443/api/status?": context deadline exceeded
Why is the Fleet status healthy but the agent status isn't? The reason we use a separate Fleet status in the first place was so we'd stop considering transient Fleet errors a reason why the agent would be unhealthy (and if the agent is offline, it can't report Fleet status anyway).
The error appears to be coming from:
I think that function might be globally setting the agent status regardless of where it was called from:
Why is the Fleet status healthy but the agent status isn't?
I thought it was global-ish error state for the fleetclient but perhaps it isn't. As you pointed out, the flle status is healthy, what is correct. And paying more attention at the error, it startes with Action:
which leads me to believe this error is set because the Polich Change
action failed. What is indeed correct, but the way it's presented is confusing.
I had a quick look at the code, and I believe here is where the error is collected and set on the agent status
What clears that error once it is set? Another successful action?
@cmacknz, IIRC, yes, a successful action would clear the error.
@pierrehilbert @cmacknz it's still relevant right?
I would say this is very relevant. Perhaps even related to this: https://github.com/elastic/ingest-dev/issues/3234 We do want to inform the user if there are proxy issues, ideally before the config is applied.
@nimarezainia, what do you mean by informing the user before the config is applied?
I'm wondering if you mean some how test it before sending to the agents. The only way to be 100% sure the proxy config indeed work is sending it to the agent so the agent can test it. And it is per agent, the same config might be valid for one agent but invalid for another.
I believe Nima is referring to two-phase commit protocol which i don't think we want to focus on right now.. basically all agents report back to fleet server that a new config is valid (eg. "prepare"), and only then the "commit" phase happens where all agents apply the new config.
Yes a two commit would work. Many of these configs (as @AndersonQ stated) would need to be tested at the agent itself. I am thinking mainly of connectivity related configurations, like the connection to Fleet Server, Outputs or the Download, before that config is applied, test whether you even have a route to the endpoint. Then apply/commit the configuration. If the test fails, don't change the config and flag this.
We don't want a small mistake in the configuration to bring down the whole Fleet.
Steps to Reproduce:
Non-authoritative answer: 9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2.gcp.elastic-cloud.com. proxy-production-us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com. Name: proxy-production-us-west2-v2.gcp.elastic-cloud.com Address: 35.235.72.223
address from cloud UI
nslookup my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com Server: 127.0.0.53 Address: 127.0.0.53#53
Non-authoritative answer: my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2.gcp.elastic-cloud.com. proxy-production-us-west2.gcp.elastic-cloud.com canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com. Name: proxy-production-us-west2-v2.gcp.elastic-cloud.com Address: 35.235.72.223
address from fleetUI
nslookup artifacts.elastic.co Server: 127.0.0.53 Address: 127.0.0.53#53
Non-authoritative answer: artifacts.elastic.co canonical name = infra-cdn.elastic.co. Name: infra-cdn.elastic.co Address: 34.120.127.130 Name: infra-cdn.elastic.co Address: 2600:1901:0:1d7::
iptables -A INPUT -j DROP -d 34.120.127.130 iptables -A OUTPUT -j DROP -d 34.120.127.130 ip6tables -A OUTPUT -j DROP -d 2600:1901:0:1d7:: ip6tables -A INPUT -j DROP -d 2600:1901:0:1d7::
./elastic-agent-8.13.0-linux-x86_64/elastic-agent install -nf --url=https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp.elastic-cloud.com:443 --enrollment-token=ENROLLMENT_TOKEN --proxy-url=http://10.80.40.162:3128
Every 2.0s: /opt/Elastic/Agent/elastic-agent stat... elastic-agent: Wed Mar 20 16:43:42 2024
┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h osts failed: 1 error occurred: │ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp .elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2 .gcp.elastic-cloud.com:443/api/status?": context deadline exceeded │ │ ├─ info │ ├─ id: 287e45c6-635e-4461-8c85-4d58704172d2 │ ├─ version: 8.13.0 │ └─ commit: 533443d148f4cf71e7c3e8efb736eda8275c4f69 ├─ beat/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41285' │ ├─ beat/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ beat/metrics-monitoring-metrics-monitoring-beats │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ filestream-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41322' │ ├─ filestream-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ filestream-monitoring-filestream-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ http/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '41294' │ ├─ http/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ http/metrics-monitoring-metrics-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ log-default │ ├─ status: (HEALTHY) Healthy: communicating with pid '41251' │ ├─ log-default │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ log-default-logfile-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT └─ system/metrics-default ├─ status: (HEALTHY) Healthy: communicating with pid '41269' ├─ system/metrics-default │ ├─ status: (HEALTHY) Healthy │ └─ type: OUTPUT └─ system/metrics-default-system/metrics-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd