Open jlind23 opened 10 months ago
Pinging @elastic/fleet (Team:Fleet)
@juliaElastic @kpollich Could you please take a look at this and tell me what you think / how feasible this is?
@jlind23 and I chatted about this some more.
This came up after starting take a look at the upgrade status telemetry being reported so far to prod (which is likely only coming from tests since 8.12 is not released). One of the things that was the most concerning was that it seemed like no agents were in UPG_SUCCEED or UPG_FAILED status in the telemetry we can see. I think the main thing we need to verify is that those statuses are reported correctly. I also think they should keep being reported as the "last upgrade status" and shouldn't get cleared before that agent starts it's next upgrade. That way we can be sure we're capturing an accurate representation of the upgrade health of the entire fleet.
Unfortunately I think the other ideas here are going to very hard to pull off technically.
Get for each Agent all the status reported and be able to track the lifecycle of a given agent. (We probably need one document to be created every time an agent report its state)
This would require that Fleet Server check if the field has changed on every check in, which has perf implications. We'd then also need a way for Fleet Server to push a document for Kibana to pickup for telemetry reporting.
Know the upgrade success/failure ratio for a given target release. Know when an upgrade is finished and get overall statistics for it.
These two are related - it's hard to know when an upgrade is fully done and how to consider agents that got tasked for an upgrade but never checked in.
Know when an upgrade was started and how many Agents were triggered.
This one is pretty feasible, but if we don't have the completion event above it's not very useful.
I think for now we can go ahead with what we have and just make sure the success and failure upgrade states are continuously being reported.
The current logic takes the current upgrade_details
found in .fleet-agents
at the time of running the telemetry task (hourly).
AFAIK UPG_FAILED
is a final state and not cleared, and should appear in telemetry (unless no failures happened).
Though there is no such state as UPG_SUCCEED
, and upgrade details are cleared if the upgrade was successful.
Though there is no such state as UPG_SUCCEED, and upgrade details are cleared if the upgrade was successful.
Can we find a way to infer this state then?
We could probably infer it based on upgraded_at
field, which is set when the upgrade completed.
Telemetry for Agent upgrade details has already been implemented here.
The current telemetry is great to highlight what are the most common problems while trying to upgrade Agents but this is not enough to have the full picture. What we want to be able to do on a cluster basis is: