fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 429 forks source link

Degraded scheduled query performance after heavy live query use #22094

Closed dantecatalfamo closed 3 weeks ago

dantecatalfamo commented 1 month ago

Fleet version: 4.56.0


💥  Actual behavior

Scheduled query reads causes high CPU and memory usage on instances that have had many ad-hoc queries.

image

image

https://github.com/fleetdm/fleet/blob/2d78943b05d735bb656ea013f5d2e7b1b7a7b198/server/datastore/mysql/hosts.go#L406-L456

MySQL [fleet]> select max(id) from queries;
+---------+
| max(id) |
+---------+
| 2868152 |
+---------+
1 row in set (0.001 sec)

🧑‍💻  Steps to reproduce

  1. TODO
  2. TODO

🕯️ More info (optional)

Old queries are kept around because they are associated with activities.

image

Deleting activities appears to also clear unused queries associated with them, 5k at a time.

https://github.com/fleetdm/fleet/blob/463b0048fc07b0dc85deac4356fdfac030c45a02/server/datastore/mysql/activities.go#L543-L567

This can also be manually triggered using fleetctl trigger --name cleanups_then_aggregation

sharon-fdm commented 1 month ago

Hey team! Please add your planning poker estimate with Zenhub @getvictor @iansltx @lucasmrod @mostlikelee

sharon-fdm commented 1 month ago

Hey team! Please add your planning poker estimate with Zenhub @getvictor @lucasmrod @mostlikelee

mostlikelee commented 1 month ago

@dantecatalfamo @rfairburn was the above mysql query the only slow query? I found it's doing a full table scan on queries. Before solutioning a query purging strategy, we can dramatically reduce the execution time with a new index and small rewrite.

rfairburn commented 1 month ago

the 1 query accounted for over 90% of all db usage

mostlikelee commented 1 month ago

@zayhanlon would the customer be willing to change their workflow a bit if we added the ability to specify a queryID in the request to /hosts/{id}/query?

mostlikelee commented 1 month ago

the queryID addition looks to be a promising longer term solution, but in the short term it seems like we can do 2 things:

1) Query optimization to account for ~2M rows in queries 2) Update the cron that deletes queries to ensure it delete's a full day of rows based on creation timestamp.

Since the cron runs 1x/hr we can target the cleanup job to remove rows older than 24hrs. and in the above case, the expected load is ~60K queries per day so that's ~2500/hr. I suspect the first run may take longer as the amount of queries > 30days is larger.

Thoughts @rfairburn @ksatter ?

xpkoala commented 1 month ago

Tested with 1MM, 2MM, and 3MM rows added to the queries table via the associated tool.

fleet-release commented 3 weeks ago

Heavy queries pass, Performance blooms anew, Fleet's strength in each task.