Percona-Lab / query-playback

Query Playback
Other
106 stars 47 forks source link

Performance and reliability improvements to thread-per-connection #36

Closed dsmythe closed 7 years ago

dsmythe commented 7 years ago

When playing back long captures and/or captures from high-throughput systems, playback was having some trouble. The thread-pool is insufficient for attempting to reproduce max-connections or system resource exhaustion because it forces the entire playback through a fixed connection pool. Therefore, our requirements were to preserve query time as close as possible, with one thread per connection, as close to reality as possible.

A number of problems were encountered when attempting to do this. The problems consisted of mainly 2 topics: Resource exhaustion on the load generating server ( the one using percona playback ), and inconsistency when MySQL started throwing "Too many connections" or "MySQL has gone away" errors.

In reality, when a MySQL server is struggling to keep up with unbounded inbound workload, clients retry their transactions/connections until they work. We needed to simulate this behavior to make sure the queries actually got ran, but we couldn't because of the aforementioned resource exhaustion ( eg, Boost Thread Resource constraints )

To solve these problems we made some modifications to playback that we would like to give back. We will potentially continue to make improvements when necessary, but we are at an acceptable functionality level now and therefore making this pull request.

To emulate real client behavior, we make the mysql client retry connections ( with increasingly longer back-off sleeping ) specifically when "Too Many Connections" creeps up. But what we found was that when new db threads got created, they immediately connected to the database, whether or not the query log dispatcher was going to sleep or not to sync up the playback timing. Therefore, we removed the immediate connection to the database, and postponed it to when the queries actually needed to execute. This helps minimize the initial blast of "too many connections" as well. In doing so, we also have to track the state of whether or not we have connected, to protect us from mysql_close being called on a never-initialized db_handle.

Solving the thread resource issues on the load-generating machine required really taking a look at what kind of thread management was being done. It appears that the creation of threads was unbounded and pretty much went as fast as the log parser could go, regardless of how fast the database was responding. Especially in circumstances where "too many connections" was reached, this caused the query log dispatcher to get quite a far bit ahead and eventually exhausted the thread resource limits of boost ( i'm assuming. ) Theoretically, on workloads that use a connection pool in reality, this might not have been as much of a problem - but on workloads that do NOT use a connection pool, this can be disastrous to percona-playback. The solution for us was to create a bound on the number of db_threads, 10000 seems to be a fair tradeoff between stability and throughput. We keep track of how many we create, and when we exceed 10000 total threads, go through a "nonblocking" join process of joining any threads which are done, removing them from the DBExecutorsTable, before carrying on to make more. In this way, the query log dispatcher can get "ahead" of the db_threads, but not so far ahead as to exhaust our resources.

After a number of iterations, we have been able to sustain 4+ hours of full-load playback running upwards of 40M queries, with QPS in the upper 2k range. Gzipped full slow query logs (30+ GB Uncompressed ) were piped into percona-playback with the options --query-log-preserve-query-time --query-log-accurate-mode --mysql-test-connect=off --queue-depth 10000 to attain these results.

It is important to note that having a queue-depth of 1 will cause the entire playback to hang the second it attempts to push a query into a db_thread which is busy running a long running query. Therefore, to get sufficient replay value on a workload that contains occasional slow queries, we need to ensure queue-depth is sufficiently large to not restrict the dispatcher from doing its job. Workloads that do NOT have occasional slow queries will NOT suffer from this effect.

vadimtk commented 7 years ago

@undingen I'd like your opinion on this PR

vadimtk commented 7 years ago

@dsmythe can you please sign our CLA https://goo.gl/forms/pfjaTq2akPDLqtaJ2 ?

dsmythe commented 7 years ago

Of course, I just did it.