Add short-duration (~1 second) caching of redis results to Fleet backend.

rfairburn commented 4 months ago

Problem

As an Infrastructure Engineer I would like to have a short (~5 second) cache of Redis data in Fleet so that I can reduce reduce unnecessary load to Redis from fleet, particularly when performing or checking for live queries.

These utilize many times more resources than might otherwise be expected, and with caching in place, I could probably scale down every Fleet redis instance provisioned by several times.

First to describe a bit about the nature of the problem, the first high-traffic item is devices checking if there are any live queries and whether they are targeted for them. There are a couple of pieces to this, but this happens every distributed_interval in osquery (default 10s), and every device does it on that interval.

When a device checks in for queries, Fleet makes a request to Redis for SMEMBERS livequery:active. If I have 100k devices with a distributed interval of 10s, this is already 10k requests to redis per second before we ever have a live query.

Then for every live query in the livequery:active set, there is a request to GETBIT livequery:active:{livequery_id} <host_id> to see if the host is a target of the live query AND CONCURRENTLY there is also a request to GET sql:livequery:{livequery_id} to get a copy of the SQL for the query just in case we actually need it (whether we are a target or not). For a single live query, this also happens every subsequent distributed interval again whether we are a target or not until the distributed campaign ends.

We have some customers that use fairly complex queries with the live query system, sometimes as many as 15k characters of SQL. So if we use 15kb of SQL data times our 100k hosts divided by the distributed interval on a single query only, we end up with something that looks like this:

15000 bytes * 100000 hosts / 10 distributed_interval seconds = 150,000,000 bytes per second per query

This is for GET sql:livequery:{livequery_id} ONLY and no other redis traffic! This is true EVEN IF only 1 host is targeted out of the 100k. With more queries this can get even worse. Here is a network bytes usage graph on redis of a real customer who has less than 10,000 hosts, but leverages live queries fairly heavily:

Now let's say that we cache for a short interval. Even 5 seconds. This reduces the number of requests per distributed interval back to redis to 2 per a 10-second distributed interval. We do have to multiply by the number of Fleet containers because each container would have to have its own cache, but we currently recommend only 20 containers for 100k hosts. So now the number of bytes would be some thing like sql_bytes distributed_interval / cache_seconds fleet_containers bytes per second for sql:livequery:{livequery_id}. The numbers woudl look like this:

10 distributed_interval_seconds / 5 second cache_interval * 20 fleet containers * 15000 sql byes = 600,000 bytes per second

This is nearly 3 orders of magnitude less data on that request only. Everything else would benefit. We could potentially bypass caching of GETBIT livequery:active:{id} <host_id> as it wouldn't be meaningful to do so in any case.

We could pontentially use the smallest redis cache sizes for almost all Fleet configurations. The memory overhead is rather small, we are talking a few kilobytes per query per container. I've never seen more than 10 live queries running at once under legitimate use. Redis can't handle much more regardless of how big you make it right now anyhow.

sharon-fdm commented 4 months ago

@lukeheath This is an Engineering Initiated feature we would like to discuss. @lucasmrod Let's estimate and discuss in tomorrows (Mar 6th) estimation meeting

lucasmrod commented 4 months ago

I agree with Robert we should start with the caching of livequery:active and sql:livequery:{livequery_id} (and leave caching of livequery:active:{livequery_id} for a later iteration).

sharon-fdm commented 4 months ago

Slack discussion

lukeheath commented 4 months ago

@rfairburn Thank you for filing this issue and including all the great info. Can you help me understand the underlying business impact of prioritizing this story? Specifically:

Can you estimate the percentage reduction in Redis costs if this is implemented? I know it will vary depending on usage but best guess estimate.
How will this impact the performance of Fleet? Will customers see a noticeable performance improvement, or is this primarily a stability and cost-saving measure?

lukeheath commented 4 months ago

@sharon-fdm I'm removing the :product label until this is officially prioritized for an upcoming sprint. Please let me know once it's estimated; that will help determine when we can prioritize. Thanks!

rfairburn commented 4 months ago

@rfairburn Thank you for filing this issue and including all the great info. Can you help me understand the underlying business impact of prioritizing this story? Specifically:

Can you estimate the percentage reduction in Redis costs if this is implemented? I know it will vary depending on usage but best guess estimate.

How will this impact the performance of Fleet? Will customers see a noticeable performance improvement, or is this primarily a stability and cost-saving measure?

I would be immediately comfortable with making changes that would be worth at least $1000/month in cloud without blinking an eye. It is possible that we could shave more than that off.
Performance-wise, if redis doesn't end up bogged down during the query, there might not be much to this. If the latency starts going up on redis before it becomes unresponsive from load, it would be mainly that time saved. I'm not sure how much the overhead of re-downloading the query itself from redis every distributed interval for every host actually has overall. It can't be 0 with the extra open sockets, transfers, etc. None of those things are particularly compute expensive, but might be more time expensive than we give credit.

The real time though in a live query coming back are the distributed interval time itself and the time it takes for the host to run the query.

One possible performance benefit is that it might be possible to lower the distributed interval without Fleet and the underlying redis being automatically overwhelmed. This is just speculation though and would need load testing to validate.

lukeheath commented 4 months ago

@rfairburn Got it! Thanks for the info.

lucasmrod commented 4 months ago

We've agreed that 1-second-caching is a good start (to not add 5s overhead to live queries)

zwass commented 4 months ago

Thank you for bringing this up @rfairburn. I agree with the idea of doing this using the approach proposed by @lucasmrod. It's important that we minimize the additional latency added to live queries and 1s seems like a reasonable amount when we have up to 10s expected latency (under default configurations). Caching only the set of active queries and the sql seems like a reasonable compromise, especially given the single bit lookups/responses for the active (targeted) hosts.

noahtalerman commented 1 week ago

Zay: Broke one big query into six smaller queries.

Kathy: They did this because the one long query takes longer than 15 seconds. They don't want the user waiting longer than 15 seconds.

Noah: Beefing up the infrastructure doesn't help?

Zay: We would have to beef up the infrastructure 10x.

Noah: How much?

Hey @lukeheath what are your thoughts? Should we go w/ the infra beef-up? Or should we go w/ the code change? Is it a bug?

noahtalerman commented 1 week ago

cc @zayhanlon

noahtalerman commented 1 week ago

@zwass: I think this could be done (code change) in like less than 1 day.

lukeheath commented 1 week ago

@noahtalerman I think we should make the code change instead of 10x infra resources. It's currently estimated at 8 points, which implies roughly one week for one engineer. @zwass may have an approach in mind that would reduce the level of effort.

noahtalerman commented 1 week ago

@lukeheath sounds good!

I can take updating this to a user story to get it ready for specs. @zwass please feel free to drop your approach as a comment and I'll bring it into the user story.

noahtalerman commented 1 week ago

cc @zayhanlon ^^

zwass commented 1 week ago

What could be done in a day or so I think is what was described above:

we should start with the caching of livequery:active and sql:livequery:{livequery_id} (and leave caching of livequery:active:{livequery_id} for a later iteration).

This would involve just caching a handful of values in-memory, but perhaps more work was estimated due to refactoring that would be required here?

I'm not sure if I'm missing something, but here's what I think should work:

Refactor LoadActiveQueryNames to load from an in-memory cache and reload the cache if expired (this is livequery:active). It would also be useful to load the SQL at the same time to simplify the cache update (sql:livequery:{livequery_id}). https://github.com/fleetdm/fleet/blob/9a3506cc311fcfd171d6ffd260d5ef7ae3eb0db8/server/live_query/redis_live_query.go#L318-L327
Refactor collectBatchQueriesForHost to do just the GETBIT calls and not the GET calls. The GET calls that were used to get the query SQL could just be lookups in the cache updated in step 1 for any query that is found to be active for the host.

https://github.com/fleetdm/fleet/blob/9a3506cc311fcfd171d6ffd260d5ef7ae3eb0db8/server/live_query/redis_live_query.go#L181-L199

Side note: Funny that we even documented our assumptions here "likely has little cost due to the small number of queries and limited size of SQL". Clearly both of these assumptions are no longer the case (at least for this particular customer's use case).

https://github.com/fleetdm/fleet/blob/9a3506cc311fcfd171d6ffd260d5ef7ae3eb0db8/server/live_query/redis_live_query.go#L192-L195

nonpunctual commented 2 days ago

Just FYI:

Brock Walters: so customer X's design is extravagant & custom, but, are we at a point where we should tell people not to do stuff like this?Because I don't see how we could ever match scale to customer workflows in a way that's so tightly coupled...

Robert Fairburn: The problem... is a design flaw in Fleet when it comes to how Fleet->Redis communication happens during a live query.

If a live query targets a single host, we shouldn't be downloading the query from redis for EVERY host on EVERY distributed interval regardless of whether that host is targeted or not.

We can also cache the bitmap of what hosts are targeted. Even the 1s cache I recommended would at-scale likely reduce redis load by multiple orders of magnitude (I could easily see 100x less usage or more on larger install bases).

...we sit at what appears to be to me 3 GB/sec from Redis to Fleet for customer under the current way things work.

fleetdm / fleet

Add short-duration (~1 second) caching of redis results to Fleet backend. #17379

Problem