Open rfairburn opened 4 months ago
@lukeheath This is an Engineering Initiated feature we would like to discuss. @lucasmrod Let's estimate and discuss in tomorrows (Mar 6th) estimation meeting
I agree with Robert we should start with the caching of livequery:active
and sql:livequery:{livequery_id}
(and leave caching of livequery:active:{livequery_id}
for a later iteration).
@rfairburn Thank you for filing this issue and including all the great info. Can you help me understand the underlying business impact of prioritizing this story? Specifically:
Can you estimate the percentage reduction in Redis costs if this is implemented? I know it will vary depending on usage but best guess estimate.
How will this impact the performance of Fleet? Will customers see a noticeable performance improvement, or is this primarily a stability and cost-saving measure?
@sharon-fdm I'm removing the :product
label until this is officially prioritized for an upcoming sprint. Please let me know once it's estimated; that will help determine when we can prioritize. Thanks!
@rfairburn Thank you for filing this issue and including all the great info. Can you help me understand the underlying business impact of prioritizing this story? Specifically:
- Can you estimate the percentage reduction in Redis costs if this is implemented? I know it will vary depending on usage but best guess estimate.
- How will this impact the performance of Fleet? Will customers see a noticeable performance improvement, or is this primarily a stability and cost-saving measure?
The real time though in a live query coming back are the distributed interval time itself and the time it takes for the host to run the query.
One possible performance benefit is that it might be possible to lower the distributed interval without Fleet and the underlying redis being automatically overwhelmed. This is just speculation though and would need load testing to validate.
@rfairburn Got it! Thanks for the info.
We've agreed that 1-second-caching is a good start (to not add 5s overhead to live queries)
Thank you for bringing this up @rfairburn. I agree with the idea of doing this using the approach proposed by @lucasmrod. It's important that we minimize the additional latency added to live queries and 1s seems like a reasonable amount when we have up to 10s expected latency (under default configurations). Caching only the set of active queries and the sql seems like a reasonable compromise, especially given the single bit lookups/responses for the active (targeted) hosts.
Zay: Broke one big query into six smaller queries.
Kathy: They did this because the one long query takes longer than 15 seconds. They don't want the user waiting longer than 15 seconds.
Noah: Beefing up the infrastructure doesn't help?
Zay: We would have to beef up the infrastructure 10x.
Noah: How much?
Hey @lukeheath what are your thoughts? Should we go w/ the infra beef-up? Or should we go w/ the code change? Is it a bug?
cc @zayhanlon
@zwass: I think this could be done (code change) in like less than 1 day.
@noahtalerman I think we should make the code change instead of 10x infra resources. It's currently estimated at 8 points, which implies roughly one week for one engineer. @zwass may have an approach in mind that would reduce the level of effort.
@lukeheath sounds good!
I can take updating this to a user story to get it ready for specs. @zwass please feel free to drop your approach as a comment and I'll bring it into the user story.
cc @zayhanlon ^^
What could be done in a day or so I think is what was described above:
we should start with the caching of
livequery:active
andsql:livequery:{livequery_id}
(and leave caching oflivequery:active:{livequery_id}
for a later iteration).
This would involve just caching a handful of values in-memory, but perhaps more work was estimated due to refactoring that would be required here?
I'm not sure if I'm missing something, but here's what I think should work:
Refactor LoadActiveQueryNames
to load from an in-memory cache and reload the cache if expired (this is livequery:active
). It would also be useful to load the SQL at the same time to simplify the cache update (sql:livequery:{livequery_id}
).
https://github.com/fleetdm/fleet/blob/9a3506cc311fcfd171d6ffd260d5ef7ae3eb0db8/server/live_query/redis_live_query.go#L318-L327
Refactor collectBatchQueriesForHost
to do just the GETBIT
calls and not the GET
calls. The GET
calls that were used to get the query SQL could just be lookups in the cache updated in step 1 for any query that is found to be active for the host.
Side note: Funny that we even documented our assumptions here "likely has little cost due to the small number of queries and limited size of SQL". Clearly both of these assumptions are no longer the case (at least for this particular customer's use case).
Just FYI:
Brock Walters: so customer X's design is extravagant & custom, but, are we at a point where we should tell people not to do stuff like this?Because I don't see how we could ever match scale to customer workflows in a way that's so tightly coupled...
Robert Fairburn: The problem... is a design flaw in Fleet when it comes to how Fleet->Redis communication happens during a live query.
If a live query targets a single host, we shouldn't be downloading the query from redis for EVERY host on EVERY distributed interval regardless of whether that host is targeted or not.
We can also cache the bitmap of what hosts are targeted. Even the 1s cache I recommended would at-scale likely reduce redis load by multiple orders of magnitude (I could easily see 100x less usage or more on larger install bases).
...we sit at what appears to be to me 3 GB/sec from Redis to Fleet for customer under the current way things work.
Problem
As an Infrastructure Engineer I would like to have a short (~5 second) cache of Redis data in Fleet so that I can reduce reduce unnecessary load to Redis from fleet, particularly when performing or checking for live queries.
These utilize many times more resources than might otherwise be expected, and with caching in place, I could probably scale down every Fleet redis instance provisioned by several times.
First to describe a bit about the nature of the problem, the first high-traffic item is devices checking if there are any live queries and whether they are targeted for them. There are a couple of pieces to this, but this happens every
distributed_interval
in osquery (default 10s), and every device does it on that interval.When a device checks in for queries, Fleet makes a request to Redis for
SMEMBERS livequery:active
. If I have 100k devices with a distributed interval of 10s, this is already 10k requests to redis per second before we ever have a live query.Then for every live query in the
livequery:active
set, there is a request toGETBIT livequery:active:{livequery_id} <host_id>
to see if the host is a target of the live query AND CONCURRENTLY there is also a request toGET sql:livequery:{livequery_id}
to get a copy of the SQL for the query just in case we actually need it (whether we are a target or not). For a single live query, this also happens every subsequent distributed interval again whether we are a target or not until the distributed campaign ends.We have some customers that use fairly complex queries with the live query system, sometimes as many as 15k characters of SQL. So if we use 15kb of SQL data times our 100k hosts divided by the distributed interval on a single query only, we end up with something that looks like this:
This is for
GET sql:livequery:{livequery_id}
ONLY and no other redis traffic! This is true EVEN IF only 1 host is targeted out of the 100k. With more queries this can get even worse. Here is a network bytes usage graph on redis of a real customer who has less than 10,000 hosts, but leverages live queries fairly heavily:Now let's say that we cache for a short interval. Even 5 seconds. This reduces the number of requests per distributed interval back to redis to 2 per a 10-second distributed interval. We do have to multiply by the number of Fleet containers because each container would have to have its own cache, but we currently recommend only 20 containers for 100k hosts. So now the number of bytes would be some thing like sql_bytes distributed_interval / cache_seconds fleet_containers bytes per second for
sql:livequery:{livequery_id}
. The numbers woudl look like this:This is nearly 3 orders of magnitude less data on that request only. Everything else would benefit. We could potentially bypass caching of
GETBIT livequery:active:{id} <host_id>
as it wouldn't be meaningful to do so in any case.We could pontentially use the smallest redis cache sizes for almost all Fleet configurations. The memory overhead is rather small, we are talking a few kilobytes per query per container. I've never seen more than 10 live queries running at once under legitimate use. Redis can't handle much more regardless of how big you make it right now anyhow.