fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.94k stars 409 forks source link

Fleet servers should know and report Agents that have partial communications #11844

Open sharon-fdm opened 1 year ago

sharon-fdm commented 1 year ago

Goal

As the developer of the fleet agent, I would like to know whether any of our installed agents have problems communicating in one of the channels while other channels still work (e.g. osquery communicates well while orbit does not)so that I could identify and solve bugs in this area.

Changes

1 - On the fleet server add a new DB table that will be key-ed with the hosts ID and will have one column for each type of communication (osquery, orbit, config, or other...). When any agent communicates to the server (any comm.) the relevant part in the server will add a timestamp for this agent in the relevant column.

2 - an additional health metric will be added called "Problematic agents"

3 - Once every X days/hours the fleet server will go over this table and check for agents that:

QA

Make sure the feature works. (possibly by running an agent with a broken channel. or any other way)

Risk assessment

LOW - Possible more load on the DB but will be spread to same rate of regular agents call in.

Manual testing steps

Testing notes

zhumo commented 1 year ago

Hi @sharon-fdm, unfortunately, we were not able to get to this work in our 6-week timeframe. Please bring this back to Feature Fest if it's still desired. Thanks!

sharon-fdm commented 1 year ago

@zhumo Thank you. I will keep track of all those closed items and bring them to Feature Fest if/when possible.