Upgrading agent via Fleet not working with multiple server URLs

indygriffiths commented 1 year ago

We have two Fleet Server URLs configured:

one that goes through an AWS VPC Endpoint and private DNS name (https://<private>-logs.fleet.vpce.ap-southeast-2.aws.elastic-cloud.com:443)
one that is standard for our Elastic Cloud deployment (https://<private>-logs.fleet.ap-southeast-2.aws.found.io:443)

On a Ubuntu server that uses a proxy server to pass requests to Elastic (configured through envvars set by systemd), attempting to upgrade the agent from 8.5.3 to 8.6.2 fails during the step to enroll the new agent to Fleet Server. The first URL fails as expected for this server, but the second endpoint which is accessible on the server through the proxy fails with a context canceled error, which causes the agent to roll back to the old version.

{"log.level":"warn","@timestamp":"2023-02-22T00:00:45.932Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 2 errors occurred:\n\t* requester 0/2 to host https://<private>-logs.fleet.vpce.ap-southeast-2.aws.elastic-cloud.com:443/ errored: Post \"https://<private>-logs.fleet.vpce.ap-southeast-2.aws.elastic-cloud.com:443/api/fleet/agents/<hash>/checkin?\": Forbidden\n\t* requester 1/2 to host https://<private>-logs.fleet.ap-southeast-2.aws.found.io:443/ errored: Post \"https://<private>-logs.fleet.ap-southeast-2.aws.found.io:443/api/fleet/agents/<hash>/checkin?\": context canceled\n\n"},"request_duration_ns":58517786785,"failed_checkins":1,"retry_after_ns":74265363532,"ecs.version":"1.6.0"}

I can see traffic in our proxy logs that the request is making it to the proxy so I'm not sure why the second URL is failing. The request duration in the logs also seem off since it seems like it's timing out after a minute.

1677023987.416      0 <ip> TCP_DENIED/403 4063 CONNECT <private>-logs.fleet.vpce.ap-southeast-2.aws.elastic-cloud.com:443 - HIER_NONE/- text/html
1677024016.123     48 <ip> TCP_TUNNEL/200 6401 CONNECT <private>-logs.fleet.ap-southeast-2.aws.found.io:443 - HIER_DIRECT/54.79.224.232 -

Since it's Fleet-managed I also haven't figured out how to increase the log verbosity and have it apply to the upgraded version.

indygriffiths commented 1 year ago

For the time being we've worked around this by splitting our one Fleet Server host with two URLs in KIbana into two separate ones for each URL, and assigning this machine to a different agent policy with the new Fleet host entry. Not the prettiest solution but seems to work.

cmacknz commented 1 year ago

I can see traffic in our proxy logs that the request is making it to the proxy so I'm not sure why the second URL is failing.

This is odd, it suggests the agent proxy configuration is working correctly. It is possible the timeout is applying to both requests sequentially instead of each one individually, but I think the logic here is the same on 8.5.3 as it is on 8.6.2.

3kt commented 1 year ago

Isn't it fixed by https://github.com/elastic/elastic-agent/pull/2239 ?

elastic / elastic-agent

Upgrading agent via Fleet not working with multiple server URLs #2303