elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
131 stars 139 forks source link

Elastic agent gets stuck trying to install application #191

Open AndersonQ opened 2 years ago

AndersonQ commented 2 years ago

If the elastic agent, managed by fleet, finds the application to be installed on its download folder, it'll unpack and install. However if the install operation fails, it'll retry it indefinitely, preventing any other application to be installed. Others could succeed.

What I did:

For the time I observed, the elastic-agent was stuck in the following loop:

Here are some debug logs showing it stuck on the retry while trying to install the elastic-endpoint:

jlind23 commented 2 years ago

@nimarezainia What should the behaviour to observe here? Should we try only a couple of times until moving to the next step?

cc @ph @blakerouse

ph commented 2 years ago

I wonder if we should attack the problem differently, I believe the startup or software on elastic agent is done serially, maybe we should start them in parallel and have a retry strategy for the different software. Note, I am not sure how hard it would be to move to that startup because we will have to audit code for any races.

blakerouse commented 2 years ago

I agree with @ph we should do them in parallel.

nimarezainia commented 2 years ago

I agree with @ph we should do them in parallel.

also needs to be deterministic. We should have a limit on retry.

AndersonQ commented 2 years ago

I agree the agent should start applications concurrently, however making it concurrent does not solves the problem of the start up process getting stuck. Therefore I believe they should be 2 different tickets, 1 bug and 1 enhancement.

ph commented 2 years ago

@AndersonQ Agree, two differents issues.

nimarezainia commented 2 years ago

@AndersonQ and @ph how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.

the bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.

AndersonQ commented 2 years ago

how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.

Sounds like a plan

he bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.

Sorry I don't follow, what do you mean by bug here seemed manufactured?

Anyway regarding failing to to install/start an application I believe the desired outcome is:

Ideally we won't have a infinite loop retrying to install an application. However I don't know how simple would be to skip it and keep track of it as a running application that is failing/broken. Anyway we'd need it, so we'd need to do it anyway.

how do we want to proceed with this issue?

I would:

ph commented 2 years ago

I agree with your plan above @AndersonQ.

Is this following assumption correct, we have a counter in place for each application, on failure the counter is increased to a set maximun, when the maximun is reached we move it to unhealth and we stop retry to install the application. If a new agent policy is received the counter is reset to 0 and we retry again.

AndersonQ commented 2 years ago

I'd need to investigate the details. I imagine the counter is only for checkin, not installation attempts.

jlind23 commented 2 years ago

I scoped it for 8.4, we will do the investigation in 8.4 release.