fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.16k stars 432 forks source link

Add a cron job to cleanup old/done worker jobs #12018

Open mna opened 1 year ago

mna commented 1 year ago

Goal

As we start using the background worker more and more, we will create more jobs stored in the jobs table. Those jobs are never deleted at the moment (this wasn't urgent as the numbers were expected to be pretty low).

Context

The background worker (https://github.com/fleetdm/fleet/tree/main/server/worker) runs as a cron job every so often and process any pending jobs (jobs in a Queued state) from the queue (a mysql table). Once processed, the job is either switched to status "Success", "Queued" for retries, or "Failed" if it failed on every retry (with the latest error message stored in the error field). The jobs are never deleted at the moment.

Potential solutions

Failed jobs can be useful to debug an issue, as it is unlikely (but not impossible, of course) that a job would end in a Failed state without something wrong in the implementation/ unexpected data (we retry 5 times, waiting a bit longer between each retry, so even in the case of a third-party failure such as an external API, it should have sufficient time to get back online). We don't have a dashboard or some UI to help investigate those for now, so at the moment the only way they can be useful is by querying the DB directly.

But after some time, even failed jobs probably have little value (e.g. maybe after 3 months or more). And successful jobs have very little value, except maybe as stats/visibility/history. But again, we don't have any UI exposing the worker's processing, so this is only valuable for folks querying the table directly.

I think we could start conservatively and implement a clean-up cron for failed jobs older than a year, and successful jobs older than a few months? Just to make sure we don't let it grow unbounded.

A future improvement could be to build some UI to inspect those worker jobs, and config options for how long to retain such jobs.

(tagging g-mdm just to make sure it ends up on a product board, feel free to rearrange if this is wrong, but this is not strictly related to MDM)

lukeheath commented 1 year ago

@mna @georgekarrv I'm removing this from the board until it is re-spec'd as a user story.