Possible to not ACK upgrade action

blakerouse commented 2 years ago

Seems it is possible that an upgrade action can result in the Elastic Agent successfully upgrading, but never fully ACKing the action.

Seems to happen when an Elastic Agent is having connectivity issues with Fleet Server on successful upgrade. I have seen this issue report in an SDH and worked around the issue, but we need to track down how and why this happens. The upgrade process needs to be full proof and this seems to be an issue that needs to be tracked down and fixed.

For confirmed bugs, please report:

Version: 7.14.3 -> 7.14.4
Operating System: All Supported

Known causes:

[x] https://github.com/elastic/fleet-server/issues/1894

cmacknz commented 1 year ago

I am not sure if these reports are all experiencing the same root cause, without agent logs it is impossible to tell. It is unlikely to be the same bug described in this issue, just a similar symptom.

12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact

This is coming from the code below. Is there an accompany error.message with the reason why the unpack failed? This means the agent successfully downloaded the artifact but could not untar or unzip it.

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/agent/application/upgrade/step_unpack.go#L31-L41

dmgeurts commented 1 year ago

The error is given after a reboot, I'm not sure if the agent or the fleet server would for some reason try to rerun the upgrade. Or if it fails because it's trying to reinstall the same version.

There are no further accompanying error messages. The rest of the messages show a normal restart of the agent and everything eventually returning healthy. Yet Fleet will still show the agent as updating.

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally. And then see how things go with 8.6.2.

cmacknz commented 1 year ago

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally.

Just to clarify, did the agents that show as updating complete the upgrade to the desired version successfully? If so, then I suspect this is a synchronization issue in the UI and it would be fine to leave the agents like this although we would like to know why this is. I'll need to find someone from the Fleet UI team to think about why this might be the case if this is what is happening.

dmgeurts commented 1 year ago

No, they rolled back to 8.6.0 after a while but remained as updating in Fleet. I couldn't retry as Fleet was stuck in updating for these agents. So I did what at the time I thought was the only thing I could do and did sudo elastic-agent update 8.6.1 from the agent machine. This showed me that the upgrade itself wasn't the issue as it completed fine. Fleet would then detect and show the new version, but the status never changed from updating to anything else. I wasn't aware I could do a force update from the Elastic console.

TBH, I must have had two different issues across my agents as some agents came back as healthy when I update the policy applied to them.

What is certain though is that it would be nice to have a way to force Fleet to check the status of the agent and if the agent is happy and on the right version then surely the status should be updated in Fleet.

As for the Fleet server itself showing as updating, I'm at a loss on how to fix that other than removing the Fleet server and reinstalling the agent there.

cmacknz commented 1 year ago

OK thanks we have one bug we just discovered where if an upgrade fails and then rolls back, the local agent downloads directory needs to be restored or the next upgrade will fail: https://github.com/elastic/elastic-agent/pull/2222. The data/elastic-agent-$hash/downloads directory needs to be restored before retrying.

That would not have been the original problem but it may be affecting you now. If you can get us the logs for these failed upgrades we would love to look at them to confirm what is happening. If you have failed upgrades you will have multiple data/elastic-agent-$hash/logs directories at the root of the agent installation directory, grabbing those logs directories is what we would want. If you run the normal diagnostics command it will only get the logs for the currently running agent, not the one that failed to upgrade. Obtaining failed upgrade logs needs to be done manually for now, we are planning to automate this process eventually.

dmgeurts commented 1 year ago

I found that by issuing the upgrade command locally, the upgrade did succeed.

There's only one data/elastic-agent-$hash/logs folder:

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/
total 16
drwxrwx--- 4 root root 4096 Feb  2 23:35 .
drwxrwx--- 4 root root 4096 Feb  3 14:00 ..
-rw------- 1 root root    0 Jan 16 11:55 agent.lock
drwxr-xr-x 6 root root 4096 Feb  3 14:00 elastic-agent-b8553c
drwxr-x--- 2 root root 4096 Feb  3 14:00 tmp

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/elastic-agent-b8553c/logs/
total 22552
drwx------ 2 root root     4096 Feb  3 14:00 .
drwxr-xr-x 6 root root     4096 Feb  3 14:00 ..
-rw------- 1 root root 10485694 Feb  3 07:32 elastic-agent-20230202.ndjson
-rw------- 1 root root  4338164 Feb  3 17:17 elastic-agent-20230203-1.ndjson
-rw------- 1 root root  8242770 Feb  3 14:00 elastic-agent-20230203.ndjson

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

joshdover commented 1 year ago

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

@dmgeurts thanks for reporting this. I agree it's unlikely to be the same root cause. A few questions that may help us narrow down this part of the problem:

What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?
Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.
Do you have a Logstash output configured on the agent policy you're using?

dmgeurts commented 1 year ago

What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?

Yes, Fleet shows the version number as 8.6.1 and the status as updating.

Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.

Aren't the logs deleted when an upgrade succeeds? This is my assumption as the logs folder is inside the agent folder and I only have one agent folder under /opt/Elastic/Agent/data/, which is for v8.6.1. Or would they be the initial logs of the agent?

Do you have a Logstash output configured on the agent policy you're using?

No, I haven't created a LogStash output yet. All agents are still using the default (direct to Elasticsearch).

elastic / elastic-agent

Possible to not ACK upgrade action #760