Open AndersonQ opened 2 years ago
@nimarezainia What should the behaviour to observe here? Should we try only a couple of times until moving to the next step?
cc @ph @blakerouse
I wonder if we should attack the problem differently, I believe the startup or software on elastic agent is done serially, maybe we should start them in parallel and have a retry strategy for the different software. Note, I am not sure how hard it would be to move to that startup because we will have to audit code for any races.
I agree with @ph we should do them in parallel.
I agree with @ph we should do them in parallel.
also needs to be deterministic. We should have a limit on retry.
I agree the agent should start applications concurrently, however making it concurrent does not solves the problem of the start up process getting stuck. Therefore I believe they should be 2 different tickets, 1 bug and 1 enhancement.
@AndersonQ Agree, two differents issues.
@AndersonQ and @ph how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.
the bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.
how do we want to proceed with this issue? as the enhancement i think we start in parallel and provide a time out.
Sounds like a plan
he bug here seemed manufactured but in the future when something fails to install in the interval provided - I agree that is a bug and needs to be resolved.
Sorry I don't follow, what do you mean by bug here seemed manufactured?
Anyway regarding failing to to install/start an application I believe the desired outcome is:
Ideally we won't have a infinite loop retrying to install an application. However I don't know how simple would be to skip it and keep track of it as a running application that is failing/broken. Anyway we'd need it, so we'd need to do it anyway.
how do we want to proceed with this issue?
I would:
I agree with your plan above @AndersonQ.
Is this following assumption correct, we have a counter in place for each application, on failure the counter is increased to a set maximun, when the maximun is reached we move it to unhealth and we stop retry to install the application. If a new agent policy is received the counter is reset to 0 and we retry again.
I'd need to investigate the details. I imagine the counter is only for checkin, not installation attempts.
I scoped it for 8.4, we will do the investigation in 8.4 release.
If the elastic agent, managed by fleet, finds the application to be installed on its
download
folder, it'll unpack and install. However if the install operation fails, it'll retry it indefinitely, preventing any other application to be installed. Others could succeed.What I did:
endpoint-security to data/elastic-agent-*/download/
-SNAPSHOT
on the name, while the tar.gz file did not.For the time I observed, the elastic-agent was stuck in the following loop:
Here are some debug logs showing it stuck on the retry while trying to install the elastic-endpoint: