CivMC / Civ

Monorepo for development of and running a Civ Server.
MIT License
4 stars 10 forks source link

Fix some snitches crashing the server #538

Closed okx-code closed 1 month ago

okx-code commented 1 month ago

The problem with snitches crashing the server is that the MariaDB query planner decides that it is going to be smart, and actually do a query from the (very small) ja_snitch_actions table, and then join the (humongous) ja_snitch_entries table.

To do this, it has the genius idea of filtering the ja_snitch_entries table by the type_id column. This column represents the type of snitch entry, be it entering the snitch, breaking a block, placing a block, etc. Naturally, MariaDB thinks that it can reduce the 10M+ row database to just 1000 rows with this filter. But it turns out that there are LOTS of entries for each type_id, so it actually get about 600K columns, and has to linearly search through them.

Only a handful of snitches have been found to crash the server this way. I don't know how the MariaDB query planner works, nor do I want to, but I can speculate that only very heavily trafficked snitches would crash the server in this way, or perhaps those that show an unusual distribution of entry types.

Here is the MariaDB ANALYZE output for when the query planner thinks it's really smart:

+------+-------------+-------+-------+-----------------------------+---------+---------+------------------+------+-----------+----------+------------+-------------+
| id   | select_type | table | type  | possible_keys               | key     | key_len | ref              | rows | r_rows    | filtered | r_filtered | Extra       |
+------+-------------+-------+-------+-----------------------------+---------+---------+------------------+------+-----------+----------+------------+-------------+
|    1 | SIMPLE      | jsa   | range | PRIMARY                     | PRIMARY | 4       | NULL             | 17   | 1.00      |   100.00 |     100.00 | Using where |
|    1 | SIMPLE      | jse   | ref   | type_id,snitch_action_index | type_id | 5       | jukealert.jsa.id | 1241 | 671443.00 |     0.51 |       0.00 | Using where |
+------+-------------+-------+-------+-----------------------------+---------+---------+------------------+------+-----------+----------+------------+-------------+

As you can see, the query planner thinks it will get 1241 rows, but it actually gets 600K. Big miscalculation.

In contrast, here is a normal snitch:

| id   | select_type | table | type   | possible_keys               | key                 | key_len | ref                   | rows  | r_rows | filtered | r_filtered | Extra       |
+------+-------------+-------+--------+-----------------------------+---------------------+---------+-----------------------+-------+--------+----------+------------+-------------+
|    1 | SIMPLE      | jse   | ref    | type_id,snitch_action_index | snitch_action_index | 5       | const                 | 54054 | 10.00  |    50.00 |     100.00 | Using where |
|    1 | SIMPLE      | jsa   | eq_ref | PRIMARY                     | PRIMARY             | 4       | jukealert.jse.type_id | 1     | 1.00   |   100.00 |     100.00 |             |
+------+-------------+-------+--------+-----------------------------+---------------------+---------+-----------------------+-------+--------+----------+------------+-------------+

Interesting, the query planner drastically over-estimates the amount of rows it will scan. It thinks it will need to comb through 50K logs, but it actually ends up looking through 11.

Perhaps an alternative solution could be trying to update the data analysis the query planner gets - it may be severely inaccurate. However, this fix should be sufficient as STRAIGHT_JOIN forces the join to be done in a specific order (big table joins small table), instead of the wrong order (small table joins big table), which is ALWAYS wrong.

Database query planner bugs are just the best aren't they?

github-actions[bot] commented 1 month ago

badge

Build Successful! You can find a link to the downloadable artifact below.

Name Link
Commit bb949ac6cdfbbe072c990d3970a8222ce570d8fa
Logs https://github.com/CivMC/Civ/actions/runs/9087028824
Download https://github.com/CivMC/Civ/suites/23787093084/artifacts/
Protonull commented 1 month ago

Question, to what extent, if at all, do ancient snitch logs (logs older than 30 days) play into this?

okx-code commented 1 month ago

I don't think snitch log creation time would have any impact on the query planner. Perhaps insertion order could, but I doubt it.

AngrySoundTech commented 1 month ago

@RedDevel2 Code looks good and logic is sound, moved to ready for testing although testing may be somewhat difficult.

okx-code commented 1 month ago

I've been able to manually test this by querying the bugged snitches directly with the old and new statements. Not on the server itself but it should be good enough to deploy straight to production @AngrySoundTech

Protonull commented 1 month ago

The only thing, you may want to add a comment about specifically using STRAIGHT_JOIN to force a specific order of joining.