elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
17 stars 144 forks source link

Upgrade verifiers do not retry if the download fails or times out. #5163

Open cllasyx opened 3 months ago

cllasyx commented 3 months ago

Hello, I have deployed Elastic Agent with Fleet Server in version 8.14.2 and tried to upgrade few days later to 8.14.3.

When watching the logs through Observability -> Logs -> Stream I have noticed some error messages from elastic_agent dataset. The logs are provided below as well as temporary fix.

Steps to reproduce:

Log output:

12:28:50.843 elastic_agent [elastic_agent][info] download from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz completed in 48 seconds @ 7.16MBps
12:28:50.843 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] download from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps12:28:50.854
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:51.464 elastic_agent [elastic_agent][info] Default PGP appended
12:29:21.465 elastic_agentm [elastic_agent][warn] Skipped remote PGP located at "https://artifacts.elastic.co/GPG-KEY-elastic-agent" because it's unavailable: 2 errors occurred:
    * Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": context deadline exceeded
    * Remote PGP download failed

12:29:21.468 elastic_agent [elastic_agent][warn] Skipped remote PGP located at "https://localhost:8221/api/agents/upgrades/8.14.3/pgp-public-key" because it's unavailable: 2 errors occurred:
    * Get "https://localhost:8221/api/agents/upgrades/8.14.3/pgp-public-key": x509: certificate is valid for myfleet.example.com, not localhost
    * Remote PGP download failed

12:29:21.468 elastic_agent [elastic_agent][info] Using 1 PGP keys
12:29:52.081 elastic_agent [elastic_agent][info] Cleaning up non-matching downloaded versions
12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
    * could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
    * fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

12:29:52.114 elastic_agent [elastic_agent][info] updated upgrade details

Bug fix (manual):

Notes

My Fleet Server host is listening on socket *:8220 on a domain name https://myfleet.example.com:8220. The host has another socket open 127.0.0.1:8221 which is used for internal API operations. My firewall has OUTPUT chain to accept all and INPUT chain has the rule to accept all connections made to loopback adapter as specified in a rule iptables -A INPUT -i lo -j ACCEPT.

cmacknz commented 3 months ago
12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
    * could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
    * fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

This context deadline exceeded for https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc is the source of the failure. It appears to be a timeout downloading the .asc file.

Since you could download it manually later, my first thought is this was a transient network error or problem with our artifacts CDN.

Is this still happening to your agents? Were you able to download the file while the agent was failing? This may indicate the problem is actually that our download timeout for this file needs to be longer.

dhanfocus commented 3 months ago

I'm getting the same error as well upgrading from 8.14.1 to 8.14.3. All I did was applied the upgrade again through the Fleet UI.

upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
    * could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.1-1348b9/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.1-1348b9/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
    * fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded
cllasyx commented 3 months ago
12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
  * could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
  * fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

This context deadline exceeded for https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc is the source of the failure. It appears to be a timeout downloading the .asc file.

Since you could download it manually later, my first thought is this was a transient network error or problem with our artifacts CDN.

Is this still happening to your agents? Were you able to download the file while the agent was failing? This may indicate the problem is actually that our download timeout for this file needs to be longer.

I could indeed download the .asc file manually later while in the upgrade process. I have tried upgrading twice in a row, right after the first failure. The result was the same so my only option was to download it manually while the upgrade process was started to supply for the timeout.

You're most likely right and timeout period is too low.

For the agent part - I don't have any outdated agent right now I could test this all over again on.

ycombinator commented 3 months ago

@cllasyx Would you mind timing your curl command from the same host as before, so we can get a sense of how long it's taking?

time curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc

Thanks.

cmacknz commented 3 months ago

I don't think it matters what the time on this system is, I can see in our code that the .asc download does not share a context timeout with the agent package download and does not have retries. https://github.com/elastic/elastic-agent/blob/ca726a219e7289ca1278653003c8dc299d302093/internal/pkg/agent/application/upgrade/step_download.go#L103-L121

In the case of the HTTP verifier, we make one attempt to get it with a 30s timeout with no retries which is definitely wrong. 30s is fine for the timeout of an individual request, but we should retry as long as the overall upgrade download timeout is not expired.

https://github.com/elastic/elastic-agent/blob/ca726a219e7289ca1278653003c8dc299d302093/internal/pkg/agent/application/upgrade/artifact/download/http/verifier.go#L166-L185

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

lucabelluccini commented 2 months ago

Do we agree the problem here was the download of the asc file?

The PGP key download is not mandatory atm - as it will try anyway to use the one embedded in the binary itself?

cllasyx commented 2 months ago

Do we agree the problem here was the download of the asc file?

The PGP key download is not mandatory atm - as it will try anyway to use the one embedded in the binary itself?

Yes, the problem was definitely the download of the .asc file used for PGP verification.

lucabelluccini commented 2 months ago

The asc is not the PGP key. What I meant by the question is: the PGP warning is a red herring. Downloading the asc was the problem.

cllasyx commented 2 months ago

I didn't say the asc file is PGP key, I said it's used for verification which is true. And in my response is stated that "the problem was definitely the download of the .asc file" which is the answer to your question.