Elastic agent gets stuck trying to install application

AndersonQ commented 2 years ago

If the elastic agent, managed by fleet, finds the application to be installed on its download folder, it'll unpack and install. However if the install operation fails, it'll retry it indefinitely, preventing any other application to be installed. Others could succeed.

What I did:

create a tar ball adding the endpoint-security to data/elastic-agent-*/download/
the endpoint-security was "broken", the uncompressed folder was different from the .tar.gz file. The uncompressed folder would have the -SNAPSHOT on the name, while the tar.gz file did not.
had a policy including the endpoint-security, system integration and monitoring

For the time I observed, the elastic-agent was stuck in the following loop:

start the install procedure
unpack the endpoint-security
not find the unpacked folder (as the name was different from what the elastic-agent was looking for)
remove the tmp folder and its contents
start the install procedure again

Here are some debug logs showing it stuck on the retry while trying to install the elastic-endpoint:

gist
image:

jlind23 commented 2 years ago

@nimarezainia What should the behaviour to observe here? Should we try only a couple of times until moving to the next step?

cc @ph @blakerouse

ph commented 2 years ago

I wonder if we should attack the problem differently, I believe the startup or software on elastic agent is done serially, maybe we should start them in parallel and have a retry strategy for the different software. Note, I am not sure how hard it would be to move to that startup because we will have to audit code for any races.

blakerouse commented 2 years ago

I agree with @ph we should do them in parallel.

nimarezainia commented 2 years ago

I agree with @ph we should do them in parallel.

also needs to be deterministic. We should have a limit on retry.

AndersonQ commented 2 years ago

I agree the agent should start applications concurrently, however making it concurrent does not solves the problem of the start up process getting stuck. Therefore I believe they should be 2 different tickets, 1 bug and 1 enhancement.

ph commented 2 years ago

@AndersonQ Agree, two differents issues.

nimarezainia commented 2 years ago

@AndersonQ and @ph how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.

the bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.

AndersonQ commented 2 years ago

how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.

Sounds like a plan

he bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.

Sorry I don't follow, what do you mean by bug here seemed manufactured?

Anyway regarding failing to to install/start an application I believe the desired outcome is:

once it fails to install an application, mark the application as "failing", keep the agent as "configuring" on fleet-ui, move to the nest application in the list
once all applications are installed (either running or failing), the agent status is "healthy/unhealthy" depending on if all applications are health or not.

Ideally we won't have a infinite loop retrying to install an application. However I don't know how simple would be to skip it and keep track of it as a running application that is failing/broken. Anyway we'd need it, so we'd need to do it anyway.

how do we want to proceed with this issue?

I would:

make this ticket a bug add it to the planning.
create a enhancement ticket for the concurrent start. This should be done after the bug is fixed

ph commented 2 years ago

I agree with your plan above @AndersonQ.

Is this following assumption correct, we have a counter in place for each application, on failure the counter is increased to a set maximun, when the maximun is reached we move it to unhealth and we stop retry to install the application. If a new agent policy is received the counter is reset to 0 and we retry again.

AndersonQ commented 2 years ago

I'd need to investigate the details. I imagine the counter is only for checkin, not installation attempts.

jlind23 commented 2 years ago

I scoped it for 8.4, we will do the investigation in 8.4 release.

elastic / elastic-agent

Elastic agent gets stuck trying to install application #191