Deadlock 5XX/Performance issue

rfairburn commented 9 months ago

Fleet version: v4.43.3

Web browser and operating system: N/A

💥 Actual behavior

5xx occurring hourly, most likely when vuln scans are running given the timing and the relative CPU usage on the containers. Large spikes of poor performance show up on the writer instance in RDS performance insights:

Corresponding deadlocks:

Focusing in on the most recent set of deadlocks and performance, this is the most performance-impacting sql:

A potential confounder is that in the instance, there are currently a number of hosts that registered with duplicate UUIDs. Examples of this can be provided, but will not be included here as I do not want to reveal customer data in the ticket

🧑‍💻 Steps to reproduce

Have Duplicate UUIDs (is this required)?
Wait for vuln scans to run
Have deadlocks.

🕯️ More info (optional)

I am not sure if the duplicate UUIDs across several hosts is the confounder here, but there is definitely something going on in the env and it is triggering these 5xx every hour on the hour.

I know that duplicate UUIDs is a non-desired situation, but if this is the deadlock cause, we should handle the failure case better.

xpkoala commented 9 months ago

@rfairburn ~Was this issue found in a customer environment?~

Lucas just informed me this was occurring in a customer environment. After discussing in standup we are trying to understand what actions we can take to reproduce the issue, if needed, and what the proposed fix should be.

~I don't believe duplicate UUIDs is something we would expect in environments. Is this related to osquery-perf?~

Last edit, could you please add the appropriate customer-label to this ticket.

sharon-fdm commented 9 months ago

Estimation of a few hours to check with Robert and reproduce/understand the issue and where it happens.

rfairburn commented 9 months ago

My best-guess (which could be totally wrong) is that because of the duplicate UUID issue that re-enrolls are happening during the vulnerability processing which changes everything about the host. These enrolls were happening about 500-600k times per hour in the env prior to setting FLEET_OSQUERY_ENROLL_COOLDOWN = "5m"

I am still seeing a series of deadlocks with a fairly high performance impact every hour on the hour. I am guessing that is a vuln scan thing, but even that could be wrong.

It is possible that something else is at fault here. Hopefully the most costly SQL queries above give the insight into what was at play at the time.

rfairburn commented 9 months ago

It's worth noting that v4.44.0 did not resolve the deadlocks, so we didn't get lucky there for any reason. I'll provide additional info as soon as I have it.

rfairburn commented 9 months ago

I believe I just found a match for an enroll and a deadlock on the hour. In this case it was an orbit/enroll. Relevant slack thread with the log entry: https://fleetdm.slack.com/archives/C03EG80BM2A/p1707527490975409?thread_ts=1707526938.067899&cid=C03EG80BM2A

getvictor commented 9 months ago

FLEET_OSQUERY_ENROLL_COOLDOWN only applies to osquery/enroll and not to orbit/enroll. Perhaps orbit/enroll should also have a cooldown. @rfairburn How often is orbit/enroll being hit by the same host?

The UPDATE hosts attempt is supposed to retry if it hits 1213 deadlock.

    bo.MaxElapsedTime = 5 * time.Second

We can probably reproduce this by sending continuous osquery/enroll and orbit/enroll requests while running vuln scanning.

sharon-fdm commented 8 months ago

@getvictor Please note that as per the new process we need to merge into the patch branch patch-fleet-v4.47.1

fleet-release commented 8 months ago

Deadlock, storm now passed, Fleet's clear sky holds no clouds, Scan with peace at last.

rfairburn commented 7 months ago

This looks to still happen and with legitimate enrolls same customer is enrolling additional hosts today that overlapped with the vuln scan window:

There were deadlocks triggering 5xx on orbit/enroll that caused a really long row lock time in the db:

All 5xx trigger on orbit/enroll only when this happens.

sharon-fdm commented 6 months ago

@getvictor : Need to reproduce locally and root-cause (and solve).

Probably need loadtest

getvictor commented 6 months ago

@rfairburn I couldn't reproduce this issue locally using osquery-perf, and I'm not confident we can reproduce this in load test (since it also uses osquery-perf).

Looking at the SQL statements for orbit/enroll for new hosts, I don't understand how they can deadlock:

INSERT INTO hosts (auto increment, which should not lock)
INSERT INTO host_display_names with host_id (host_id is the unique key, so should not lock either)

The preceding SELECT statement that checks if the host already exists in the DB may cause a lock, but I'd like to confirm with logs.

So, I'd like to get some more info. Now and/or in the future:

What is the error message from the server?
Can you get the last deadlock on the DB using SHOW ENGINE INNODB STATUS; or using a similar method?
Can we get even more info on deadlocks? Perhaps using innodb_print_all_deadlocks?

lucasmrod commented 6 months ago

Can we get even more info on deadlocks? Perhaps using innodb_print_all_deadlocks?

We turned this ON in load testing environment a few months ago to help me reproduce deadlocks. https://github.com/fleetdm/fleet/blob/08b5e1c7b5853deffd6e1237e3c106b1015a15a0/infrastructure/loadtesting/terraform/rds.tf#L61

rfairburn commented 6 months ago

This looks to also cause a significant spike in HTTP_4XX traffic during the time it happens as reported by the customer and also reflected in the Athena logs. Running the query:

SELECT FROM_UNIXTIME(CEILING(TO_UNIXTIME(parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z'))/300)*300) AS timeslice
     , COUNT(*) AS req_count FROM partitioned_alb_logs WHERE elb_status_code >= 400 and parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') > DATE_ADD('hour', -24, CURRENT_TIMESTAMP) and day >= '2024/05/01' GROUP BY FROM_UNIXTIME(CEILING(TO_UNIXTIME(parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z'))/300)*300) ORDER BY timeslice DESC

I noticed a ~10x spike in 4XX errors that matches the 5-minute interval that the row lock occurs. Here is a sample showing a couple of hours of the pattern:

This coincides with the alerts that trigger between :00 and :05 of the hour and the spike in the row lock time and 2 deadlocks as previously outlined.

While this doesn't provide any additional context and the actual locks I shared with @getvictor are probably the biggest insight into this issue so far, it does show the blast radius/impact of this may be larger than originally expected.

getvictor commented 6 months ago

@rfairburn I have a PR to fix the lock caused by policy_stats insert: https://github.com/fleetdm/fleet/pull/18720

getvictor commented 6 months ago

@xpkoala @sharon-fdm

We should load test this fix with many global policies (like 100) and at least 10 teams. 100,000 total hosts should be fine.

We should look for performance spikes during the cleanups_then_aggregation job, like database load, deadlocks, row lock time, etc.

xpkoala commented 6 months ago

Deadlocks appear to have dropped drastically. And still appear to only be occurring when vuln scan is running. the cleanups_then_aggregation job was triggered manually a few times outside of the vuln scan windows and doesn't appear to have increased strain on the DB or increased the deadlocks.

fleet-release commented 5 months ago

Deadlocks break the flow, Vuln scans shine light below, Fleet's calm stream can grow.

fleetdm / fleet