Closed rfairburn closed 4 months ago
From the slack convo:
ongoing issue that affects a customer 2-3x a day.
Just saw this in my notifications, it may have played a role in increasing redis load (which would have been exacerbated during a live query): https://github.com/fleetdm/fleet/pull/16334.
It's labeled endpoint ops, but I'm happy to take this one if it makes sense for y'all once it's prioritized , @sharon-fdm and @georgekarrv .
@mna @georgekarrv @noahtalerman I have no objection.
My current guess:
QueriesForHost
will perform more than usual Redis calls until it's cleaned up from Redis.(I'm more on the side of amending the clean up job to clean up Redis too.)
/cc @mna
UPDATE: Guess confirmed. Manually removing the orphaned live query from Redis solved the issue (CPU went back to normal load). /cc @rfairburn
FYI @lucasmrod I'm working on a PR that will implement those 3 small changes:
a) add logging to the long-lived Redis connection that receives pubsub events, to log when it got blocked for a long time (and may be a cause of leaking/unavailable connections if many of those exist)
b) add support to conn_wait_timeout
redis configuration for the standalone mode (currently it is only supported for cluster mode)
c) update the cron job that marks the live queries as "completed" to also clean up any dead live queries from the "active" set in Redis (the fix you suggested in the comments above)
/cc @rfairburn
Thank you! (I can help with review.)
While you are at it, please add the following minor change:
// Setting the status to completed stops the query from being sent to
// targets. If this fails, there is a background job that will clean up
// this campaign.
defer svc.CompleteCampaign(ctx, campaign) //nolint:errcheck
to
// Setting the status to completed stops the query from being sent to
// targets. If this fails, there is a background job that will clean up
// this campaign.
defer func() {
if err := svc.CompleteCampaign(ctx, campaign); err != nil {
level.Error(logger).Log("msg", "complete campaign", "err", err)
}
}
So that at least we can search for logs when this happens and know the root cause (was it Redis that failed in svc.CompleteCampaign
? was it MySQL? etc.)
(Assigning myself for review.)
We've discussed a different approach and due to prioritization in the MDM team the endpoints team will be implementing the changes on top of https://github.com/fleetdm/fleet/pull/16855.
Lucas: How about the cron performs the following operations:
- Get active set from Redis (usually low number)
- In batches, check for the status of the queries returned in (1) (new datastore method that return the provided query IDs that are in
status=complete
)- Remove such completed live queries from the Redis active set. Martin: and we could cap this to e.g. max 1000 campaigns, so if there are more than this it would be done on the next cron run (instead of hanging the redis server for too long with too many ids).
@getvictor Assigning this to you because this is related to the PR you've taken over.
@getvictor @lucasmrod Could either of you help me understand if this was part of the fix that went out with 4.46.1? I see an attached PR for the new frequent_cleanups
job which was lightly tested prior to sending 4.46.1 to a customer.
The frequent_cleanups
job was tested to ensure it runs on the 15 minute schedule, can be run manually, and in both cases cleans orphaned campaigns from Redis. The new job was also run simultaneously with other jobs queued with no issues.
If there is more to this PR I am missing that needs additional validation please let me know!
@xpkoala Yes, these fixes and the frequent_cleanups switch (default to off) went out with 4.46.1
Redis strains under load, Queries like rain on leaves fall, Fleet's calm in the storm.
Fleet version: v4.43.0
Web browser and operating system: N/A
💥 Actual behavior
Errors like the following presented themselves during a live query:
The health check for Redis was also starting to fail during the same interval as follows:
Monitoring redis showed a large amount of traffic of this nature:
A one-second sample yielded 1402 requests.
Here are a few screenshots of the Redis load at the time:
🧑💻 Steps to reproduce
🕯️ More info (optional)
N/A