fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.66k stars 379 forks source link

Redis becoming unavailable during a live query, likely due to excessive rate of requests. #16331

Closed rfairburn closed 4 months ago

rfairburn commented 5 months ago

Fleet version: v4.43.0

Web browser and operating system: N/A


💥  Actual behavior

Errors like the following presented themselves during a live query:

{"err":"load active queries: dial tcp 10.10.31.9:6379: i/o timeout","level":"error","op":"QueriesForHost","ts":"2024-01-24T16:54:13.780128981Z"}

The health check for Redis was also starting to fail during the same interval as follows:

{"component":"healthz","err":"reading from redis: dial tcp 10.10.31.9:6379: i/o timeout","health-checker":"redis","ts":"2024-01-24T17:07:10.240474858Z"}

Monitoring redis showed a large amount of traffic of this nature:

1706131306.999634 [0 10.10.3.53:34242] "smembers" "livequery:active"

A one-second sample yielded 1402 requests.

Here are a few screenshots of the Redis load at the time:

image image image image

🧑‍💻  Steps to reproduce

  1. ~50k hosts
  2. m6g.xlarge redis instances
  3. live query activity

🕯️ More info (optional)

N/A

JoStableford commented 5 months ago

Related to a Slack conversation

mna commented 5 months ago

From the slack convo:

ongoing issue that affects a customer 2-3x a day.

mna commented 5 months ago

Just saw this in my notifications, it may have played a role in increasing redis load (which would have been exacerbated during a live query): https://github.com/fleetdm/fleet/pull/16334.

mna commented 5 months ago

It's labeled endpoint ops, but I'm happy to take this one if it makes sense for y'all once it's prioritized , @sharon-fdm and @georgekarrv .

sharon-fdm commented 5 months ago

@mna @georgekarrv @noahtalerman I have no objection.

lucasmrod commented 4 months ago

My current guess:

(I'm more on the side of amending the clean up job to clean up Redis too.)

/cc @mna

lucasmrod commented 4 months ago

UPDATE: Guess confirmed. Manually removing the orphaned live query from Redis solved the issue (CPU went back to normal load). /cc @rfairburn

mna commented 4 months ago

FYI @lucasmrod I'm working on a PR that will implement those 3 small changes:

a) add logging to the long-lived Redis connection that receives pubsub events, to log when it got blocked for a long time (and may be a cause of leaking/unavailable connections if many of those exist)

b) add support to conn_wait_timeout redis configuration for the standalone mode (currently it is only supported for cluster mode)

c) update the cron job that marks the live queries as "completed" to also clean up any dead live queries from the "active" set in Redis (the fix you suggested in the comments above)

/cc @rfairburn

lucasmrod commented 4 months ago

Thank you! (I can help with review.)

lucasmrod commented 4 months ago

While you are at it, please add the following minor change:

// Setting the status to completed stops the query from being sent to
// targets. If this fails, there is a background job that will clean up
// this campaign.
defer svc.CompleteCampaign(ctx, campaign) //nolint:errcheck

to

// Setting the status to completed stops the query from being sent to
// targets. If this fails, there is a background job that will clean up
// this campaign.
defer func() {
  if err := svc.CompleteCampaign(ctx, campaign); err != nil {
    level.Error(logger).Log("msg", "complete campaign", "err", err)
  }
}

So that at least we can search for logs when this happens and know the root cause (was it Redis that failed in svc.CompleteCampaign? was it MySQL? etc.)

lucasmrod commented 4 months ago

(Assigning myself for review.)

lucasmrod commented 4 months ago

We've discussed a different approach and due to prioritization in the MDM team the endpoints team will be implementing the changes on top of https://github.com/fleetdm/fleet/pull/16855.

Lucas: How about the cron performs the following operations:

  1. Get active set from Redis (usually low number)
  2. In batches, check for the status of the queries returned in (1) (new datastore method that return the provided query IDs that are in status=complete)
  3. Remove such completed live queries from the Redis active set. Martin: and we could cap this to e.g. max 1000 campaigns, so if there are more than this it would be done on the next cron run (instead of hanging the redis server for too long with too many ids).
lucasmrod commented 4 months ago

@getvictor Assigning this to you because this is related to the PR you've taken over.

xpkoala commented 4 months ago

@getvictor @lucasmrod Could either of you help me understand if this was part of the fix that went out with 4.46.1? I see an attached PR for the new frequent_cleanups job which was lightly tested prior to sending 4.46.1 to a customer.

The frequent_cleanups job was tested to ensure it runs on the 15 minute schedule, can be run manually, and in both cases cleans orphaned campaigns from Redis. The new job was also run simultaneously with other jobs queued with no issues.

If there is more to this PR I am missing that needs additional validation please let me know!

getvictor commented 4 months ago

@xpkoala Yes, these fixes and the frequent_cleanups switch (default to off) went out with 4.46.1

fleet-release commented 4 months ago

Redis strains under load, Queries like rain on leaves fall, Fleet's calm in the storm.