brianmario / mysql2

A modern, simple and very fast Mysql library for Ruby - binding to libmysql
http://github.com/brianmario/mysql2
MIT License
2.25k stars 550 forks source link

Mysql2::Error: This connection is still waiting for a result, try again once you have the result #772

Closed mmozuras closed 7 years ago

mmozuras commented 8 years ago

We've been experiencing this issue in production to varying degrees for the last two years:

Mysql2::Error: This connection is still waiting for a result, try again once you have the result

I've looked at previous issues (like https://github.com/brianmario/mysql2/issues/566, https://github.com/brianmario/mysql2/issues/532, https://github.com/brianmario/mysql2/issues/99), but didn't find anything that helps. We've also tried multiple things ourselves, but did not manage to solve it.

We're running:

mysql2 (0.4.4)
rails (4.2.5.2)
unicorn (4.9.0)

Ruby 2.2

Percona XtraDB Cluster 5.6 
(based on Percona Server 5.6, and subsequently it's based on MySQL 5.6)

Stacktraces don't show anything interesting, here's one example:

app/domain/user_msgs/for_thread.rb:35:in `user_msg_thread'
app/domain/user_msgs/for_thread.rb:27:in `among_receivers?'
app/domain/user_msgs/for_thread.rb:14:in `fetch'
app/domain/user_msgs/for_thread.rb:3:in `fetch'
app/controllers/api/v2/msg_threads_controller.rb:145:in `user_msg_thread'
app/controllers/api/v2/msg_threads_controller.rb:35:in `show'
config/initializers/1.patches/action_controller.rb:7:in `process_action'
app/controllers/api/v2/base_controller.rb:112:in `api'

We didn't notice any particular pattern to when it happens either:

screenshot 2016-07-12 15 20 05

Any help would be appreciated 😄

dmitry-g commented 8 years ago

After some investigation it appeared that the issue was caused by incorrect signal handling.

  1. A signal (like SIGINT or SIGQUIT) interrupted a Mysql2::Client which was in progress.
  2. mysql_client_wrapper->active_thread was not reset to Qnil and it was still pointing to current Unicorn thread.
  3. Custom signal handling didn't allow unicorn worker to shut down/restart and the worker continued to handle requests.
  4. Each request which came to the worker failed, because mysql_client_wrapper->active_thread was pointing to rb_thread_current (Mysql2::Error: This connection is still waiting for a result, try again once you have the result).
richardlarocque commented 8 years ago

I ran into another, closely related variant of this issue. It can be triggered when someone uses Thread#kill to interrupt a long-running query. Here's a small repro:

#!/usr/bin/env ruby

require 'mysql2'

def long_query(connection)
  begin
    connection.query('SELECT SLEEP(1000);')
  rescue StandardError => e
    puts "Caught #{e}"
  ensure
    puts "Ensured"
  end
  puts 'done'
end

conn = Mysql2::Client.new(:host => "localhost",
                          :username => "root",
                          :password => 'p4ssw0rd')

t = Thread.new { long_query(conn) }
sleep 1
puts "Killing #{t}, #{t.status}"
t.kill
sleep 1
puts "Killed #{t}, #{t.status}"

conn.query('SELECT 1;')

On my machine, the output is:

Killing #<Thread:0x000000022f7ef0>, sleep
Ensured
Killed #<Thread:0x000000022f7ef0>, false
/var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:107:in `_query': This connection is in use by: #<Thread:0x000000022f7ef0@./main.rb:20 dead> (Mysql2::Error)
        from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:107:in `block in query'
        from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:106:in `handle_interrupt'
        from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:106:in `query'
        from ./main.rb:27:in `<main>'

ie. the killed thread is still marked as actively using the connection, even after it's been killed.


This variant is a little bit worse than the ones caused by signals or remotely-triggered exceptions, because there is no reasonable way to mask the Thread#kill signal. The handle_interrupt trick from https://github.com/brianmario/mysql2/pull/731/files won't work here.


I imagine this has been discussed before, but I haven't found any reference to it. Would it be reasonable to try to solve this with an exception-handler in the C extension code? The exception handlers should get run in the Thread#kill case. Would it be possible to register a handler to reset the connection state when an exception is thrown while waiting for a response to some query?

sodabrew commented 8 years ago

@richardlarocque I cannot think of any reasonable workaround for a Thread#kill. At best, the next time you try to use the connection it can say, "The handle is marked in-use by a dead thread, so you need to reconnect."

curtp commented 8 years ago

For those who are using Passenger to serve their app and use a script to kill passenger processes, check your script. We're running a script via cron to kill processes which are consuming too much ram. Thanks to the research by @dmitry-g, I looked into our script and was able to improve it to remove the use of kill. Passenger 5.0 comes with the passenger-config command which can be used to gracefully shut down passenger processes. We switched to that command recently and since we've started using it, this exception went away. Here is the script we're using to kill the processes for those who are interested (thanks to whoever came up with this script, it has been very helpful).

for i in /usr/local/rvm/bin/rvm-shell -c "passenger-memory-stats" |grep RubyApp|awk '{if ($4 > 299 && $7 == "RubyApp:") print $1}' ; do passenger-config detach-process $i; done

sodabrew commented 7 years ago

It looks like #811 would provide a fix for this problem.

mmozuras commented 7 years ago

Thanks for the help, everyone! We no longer encounter this issue. 👍

eredi93 commented 7 years ago

this happened again, we are running the latest gem. here is the log

» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx-rails:ActiveRecord::StatementInvalid (Mysql2::Error: This connection is still waiting for a result, try again once you have the result: SELECT `xxxxxx`.* FROM `xxxxxxxx` WHERE `xxxxxx`.`type` IN ('xxxxxxx', 'xxxxxxxx') AND `xxxxxxx`.`id` = 111111111 ORDER BY `xxxxxxxx`.`id` ASC LIMIT 1 /*application:IntercomApp,line:/app/controllers/intercom_base_controller.rb:28:in `append_info_to_payload'*/):
» 15 May 2017 17:10:35.192  i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument'
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument'
» 15 May 2017 17:10:35.192  i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_exception_bucket.rb:4:in `execute'
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_readonly_reconnect.rb:4:in `execute'
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: app/controllers/xxxxxxx_base_controller.rb:28:in `append_info_to_payload'
» 15 May 2017 17:10:35.192  i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument' Context
» 15 May 2017 17:10:35.192  i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument' Context
» 15 May 2017 17:10:35.192  i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_exception_bucket.rb:4:in `execute' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_readonly_reconnect.rb:4:in `execute' Context
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: app/controllers/xxxxxxx_base_controller.rb:28:in `append_info_to_payload' Context
» 15 May 2017 17:10:35.193 i i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument' Context
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument' Context
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/version_header.rb:10:in `call' Context
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/version_metrics.rb:9:in `call' Context
» 15 May 2017 17:10:35.193  i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/bad_request_for_broken_json.rb:11:in `call' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxxx/rack/character_encoding_coercion.rb:20:in `call' Context

can you help me debug it and see why this is still happening? @sodabrew @darnaut

sodabrew commented 7 years ago

What are your initializers? mysql2_exception_bucket.rb and mysql_readonly_reconnect.rb?

austinhalpin commented 7 years ago

mysql2_exception_bucket.rb is basically just a catch all for mysql2 exceptions: We're recording metrics whenever we see certain mysql exceptions for alarming purposes.

Example:

module ActiveRecord
  module Mysql2ExceptionBucket
    def execute(*)
      super
    rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
      if e.message =~ /This connection is still waiting for a result/
        record_metric
      end
    raise
  end

mysql_readonly_reconnect.rb is something we wrote to force the connection to reconnect!, we found that after we moved to AWS Aurora failovers were no longer handled gracefully & needed a redeploy to bounce connections. To avoid that, we wrote this initializer:

module ActiveRecord
  module Mysql2ReadonlyReconnect
    def execute(*)
      super
    rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
      if e.message =~ /read-only/
        Rails.logger.info("MySQL running in read-only mode: reconnecting before raising error")
        reconnect!
      end

      raise
    end
  end
end

ActiveRecord::ConnectionAdapters::Mysql2Adapter.prepend(ActiveRecord::Mysql2ReadonlyReconnect)
sodabrew commented 7 years ago

Interesting re: Aurora failovers! Do you have metrics for Aurora failover events? I can imagine a relationship between these two scenarios:

Rails app sends query Aurora fails over, connection errs silently Rails app tries to send next query on the connection that erred silently: 'still waiting for a result' Rails app tries to send next query on a connection that was in the pool: 'in read-only mode'

austinhalpin commented 7 years ago

The event that @eredi93 reported was isolated to a single host & wasn't triggered during any database failover, which is why we're at a bit of loss trying to figure out what the cause could be here. :(

stefansundin commented 6 years ago

@austinhalpin Thanks for sharing your initializer, it helped me! I'm also curious if you've encountered any other gotchas in the last year since your post. :)

I have been using Aurora a bit recently, and encountered the same issue with failovers. At first I thought that connections were not reset when the failover happened. After some more testing it does appear that all connections are reset, but that Rails often manages to reconnect before the DNS update has propagated fully. The Aurora cluster endpoint is a CNAME record with a 5 second TTL.

So my conclusion is that the mysql process on the old master restarts too quickly and is able to receive connections again (but now as a slave) before the DNS has propagated fully. Clients that manage to connect are oblivious to what happened and tries to write to the slave.

I changed your initializer slightly, using pool.disconnect! instead of reconnect! in order to force the whole connection pool to disconnect at once (this should hopefully reduce the number of errors that end users may receive). I also switched the regex to search for a string.

require 'active_record'

module ActiveRecord
  module Mysql2ReadonlyReconnect
    def execute(*)
      super
    rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
      if e.message.include?('The MySQL server is running with the --read-only option')
        pool.disconnect!
      end
      raise
    end
  end
end

ActiveRecord::ConnectionAdapters::Mysql2Adapter.prepend(ActiveRecord::Mysql2ReadonlyReconnect)

Finally, this document has been helpful to adopt best practices for Aurora: https://d1.awsstatic.com/whitepapers/RDS/amazon-aurora-mysql-database-administrator-handbook.pdf

The document mentions "smart drivers" for Aurora. Basically the table INFORMATION_SCHEMA.REPLICA_HOST_STATUS let's you discover other replicas directly in SQL. You can also query SELECT @@innodb_read_only; to determine if you're actually on the master if that is what you expect.

There are some other Aurora-specific variables available:

> show variables like 'aurora%';
+------------------------------------+------------------+
| Variable_name                      | Value            |
+------------------------------------+------------------+
| aurora_lab_mode                    | OFF              |
| aurora_load_from_s3_role           |                  |
| aurora_max_alter_table_log_entries | 1000             |
| aurora_max_connections_limit       | 16000            |
| aurora_select_into_s3_role         |                  |
| aurora_server_id                   | staging-aurora-1 |
| aurora_use_key_prefetch            | ON               |
| aurora_version                     | 2.02             |
+------------------------------------+------------------+