alliedmodders / sourcemod

SourceMod - Source Engine Scripting and Administration
http://www.sourcemod.net/
973 stars 422 forks source link

MySQL database driver likes to crash #1207

Open asherkin opened 4 years ago

asherkin commented 4 years ago

Following up from #1135, filing this for tracking.

https://crash.limetech.org/stats/dbi.mysql.ext.%25?frame=any

The ones of particular interest are:

These are all caused by the internal VIO pointer getting corrupted, likely across multiple connection instances - the reconnect ones are particularly interesting as in most of the ones I've debugged it has flip-flopped between null and a valid ptr. The read/write ones are mostly just it mysteriously ending up as null.

Only around 10% of reports have MySQL code executing simultaneously on multiple threads, so I don't think that is a requirement - although mixing threaded and non-threaded queries definitely seems to be part of it (possibly even on multiple connections - the specific cases I've been shown have had proper locking on a single connection, but may not have actually been the cause). Some non-threaded queries appear to be unavoidable (SQL_SetCharset).

Clientprefs + MySQL seems to be very, very crashy on connection, but I don't think that is the same issue - something else to look at later, fortunately most people use it with SQLite.

Plan of attack is roughly:

  1. Attempt to get something resembling a reliable reproduction, probably just hammering queries in a loop across a couple of connections, with mixed threaded and non-threaded.
  2. Revive the lock-everything-internally changes (#704), see if that helps.
  3. Start debugging the MySQL connector library, figure out what's changing the VIO ptrs.

It's gotten hard to find the corresponding source code to the MySQL libs we use, download locations that currently exist:

Upgrading the MySQL connector does not look like it would help, this very much appears to be API misuse on the SM side.

asherkin commented 4 years ago

Wrote a fairly comprehensive test plugin: https://gist.github.com/asherkin/db88c10dd6b7a80f46953134602f4c8d

Haven't been able to get it to crash without mixing threaded and non-threaded statements on the same connection (but I can reproduce the my_real_read crash by mixing them). It looks like we should be able to do a slightly better job there than crashing, but as far as I can tell there is no way we can make it just work as Lock/Unlock needs to encompass multiple function calls to be safe (but maybe we can opportunistically pull data out of the driver).

Found two unrelated issues:

Next area to explore testing is transactions (Database.Execute), and SQL queries returning multiple result sets (CALL) - it is likely the latter has issues with the threaded query implementation (Bug 3775), but they're rarely used by plugins.

asherkin commented 4 years ago

as far as I can tell there is no way we can make it just work as Lock/Unlock needs to encompass multiple function calls to be safe (but maybe we can opportunistically pull data out of the driver).

After trawling through the MySQL docs and some older SM issues, I think this is probably the right path to go down. If we pull all the data from the MySQL connector and keep it in our own query objects, we can do all the required safety locking internally and there is no more need for the Lock/UnlockDatabase natives so they can become no-ops. It'll also make CALL work correctly with threaded queries, at the moment reading anything but the first result set is a non-threaded operation.

The MySQL 8.0 connector adds an async API which could be another route to go down (although would be quite a significant upgrade and require some API changes), but I think results set buffering on our side is workable.

Of course with no reproduction that doesn't involve API misuse it isn't a guarantee this changes will fix things, but it seems like a good bet they'll help.

asherkin commented 4 years ago

Made a few in-roads on that plan today, but hit a lot of dead ends (just getting the MySQL driver extension to compile was quite an adventure).

The end result we want here is for the DB API to be "atomic" across functions at the connection API level (as that is the only one used across threads once we hoist all the result loading logic up there). However, at the moment there are some unavoidable API calls on the connection that return data associated with the previously run query (GetError, GetInsertId, GetAffectedRows), so these need changing which'll require breaking the C++ API.

The flaws here have already been fixed in the design of the new methodmap API, so it is mainly just a case of moving the compatibility code for the older APIs out of the drivers and into the natives - where it is guaranteed to only be invoked on the main thread for non-threaded query usage.

These API changes should be able to happen in parallel to the result set storage changes, and once they're both complete it looks like removing user-visible locking should be unblocked.

As always, some unrelated points:

While discussing this on IRC, an alternative solution was presented to instead process the blocking natives on the worker thread in a highest priority queue - that should solve the crashes also and allow removing all locking, but could slow down blocking queries quite significantly and would still require the prerequisite changes to make the plugin API work correctly (or we'd need to block the worker thread after it executed the query until the SourcePawn context finished, to avoid it executing any more queries in the middle).

MySQL client libs required for building the MySQL driver, compiled with VS2017 for x86 Windows, built from the source code package linked previously: https://www.dropbox.com/s/rc3ldzyfo644mev/mysql-5.5.58-win32.zip?dl=0

T1MOXA commented 4 years ago

Half an hour ago, I caught a crash with MySQL and decided to sort it out, judging by the crash dump (Thanks for the great tool for it), the code that is in the screenshot below is to blame. Perhaps this will help in solving the problem.

image

peace-maker commented 4 years ago

It's mixing threaded with non-threaded queries. Do you have the link to the crash on accelerator?

dvander commented 4 years ago

I forget if we talked about this publicly, but probably the best thing moving forward is to perform all queries off the main thread. Then the "threaded" versions simply become "asynchronous" and the non-threaded versions would have a join. The locking stuff would go away and everything would become much simpler.

I can try to prototype this soon.

T1MOXA commented 4 years ago

It's mixing threaded with non-threaded queries. Do you have the link to the crash on accelerator?

Yup, https://crash.limetech.org/vp3f6kdrpgbn

Kxnrl commented 4 years ago

Anyone still working on this?

geominorai commented 4 years ago

I am also looking forward to any updates on this.

Alexeyt89 commented 3 years ago

I seem to have regular crashes for the same reason: https://forums.alliedmods.net/showthread.php?t=328437

slowthgt commented 2 years ago

While risking a bit of a necroposting, I wanted to point out that the MySQL driver has had issues related to multiple threaded stored procedure queries since 2014, as demonstrated from this issue on the old bug tracker, which seriously hamper things like timers, which are forced to do most procedural work in memory, rather than the MySQL server.

asherkin commented 2 years ago

That is already referenced earlier in this issue - the proposed rework of the driver design will resolve it.

https://github.com/alliedmodders/sourcemod/issues/1207#issuecomment-596213400

It'll also make CALL work correctly with threaded queries, at the moment reading anything but the first result set is a non-threaded operation.

Adrianilloo commented 1 year ago

Not having enough time to check if the following subcase is already reported, I'd like to point out that, despite e.g. #1576 was fixed, server also crashes when using clientprefs + mysql under the following conditions/steps (when hosting everything on Linux, at least):

  1. MySQL service being stopped
  2. Start SRCDS server with clientprefs set to use MySQL and the stopped service
  3. Type quit in console

Result: server crashes ("Add "-debug" to the ./srcds_run command line to generate a debug.log to help with solving this problem")

Likely this was happening when I reported #1576, too.

asherkin commented 1 year ago

1576 was never fixed, you closed it yourself. As I explained in that issue, your problems with Clientprefs + MySQL are #778 rather than this driver issue.