Closed mmozuras closed 7 years ago
After some investigation it appeared that the issue was caused by incorrect signal handling.
SIGINT
or SIGQUIT
) interrupted a Mysql2::Client
which was in progress.mysql_client_wrapper->active_thread
was not reset to Qnil
and it was still pointing to current Unicorn thread.mysql_client_wrapper->active_thread
was pointing to rb_thread_current
(Mysql2::Error: This connection is still waiting for a result, try again once you have the result
).I ran into another, closely related variant of this issue. It can be triggered when someone uses Thread#kill
to interrupt a long-running query. Here's a small repro:
#!/usr/bin/env ruby
require 'mysql2'
def long_query(connection)
begin
connection.query('SELECT SLEEP(1000);')
rescue StandardError => e
puts "Caught #{e}"
ensure
puts "Ensured"
end
puts 'done'
end
conn = Mysql2::Client.new(:host => "localhost",
:username => "root",
:password => 'p4ssw0rd')
t = Thread.new { long_query(conn) }
sleep 1
puts "Killing #{t}, #{t.status}"
t.kill
sleep 1
puts "Killed #{t}, #{t.status}"
conn.query('SELECT 1;')
On my machine, the output is:
Killing #<Thread:0x000000022f7ef0>, sleep
Ensured
Killed #<Thread:0x000000022f7ef0>, false
/var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:107:in `_query': This connection is in use by: #<Thread:0x000000022f7ef0@./main.rb:20 dead> (Mysql2::Error)
from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:107:in `block in query'
from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:106:in `handle_interrupt'
from /var/lib/gems/2.3.0/gems/mysql2-0.4.4/lib/mysql2/client.rb:106:in `query'
from ./main.rb:27:in `<main>'
ie. the killed thread is still marked as actively using the connection, even after it's been killed.
This variant is a little bit worse than the ones caused by signals or remotely-triggered exceptions, because there is no reasonable way to mask the Thread#kill
signal. The handle_interrupt
trick from https://github.com/brianmario/mysql2/pull/731/files won't work here.
I imagine this has been discussed before, but I haven't found any reference to it. Would it be reasonable to try to solve this with an exception-handler in the C extension code? The exception handlers should get run in the Thread#kill
case. Would it be possible to register a handler to reset the connection state when an exception is thrown while waiting for a response to some query?
@richardlarocque I cannot think of any reasonable workaround for a Thread#kill
. At best, the next time you try to use the connection it can say, "The handle is marked in-use by a dead thread, so you need to reconnect."
For those who are using Passenger to serve their app and use a script to kill passenger processes, check your script. We're running a script via cron to kill processes which are consuming too much ram. Thanks to the research by @dmitry-g, I looked into our script and was able to improve it to remove the use of kill. Passenger 5.0 comes with the passenger-config command which can be used to gracefully shut down passenger processes. We switched to that command recently and since we've started using it, this exception went away. Here is the script we're using to kill the processes for those who are interested (thanks to whoever came up with this script, it has been very helpful).
for i in /usr/local/rvm/bin/rvm-shell -c "passenger-memory-stats" |grep RubyApp|awk '{if ($4 > 299 && $7 == "RubyApp:") print $1}'
; do passenger-config detach-process $i; done
It looks like #811 would provide a fix for this problem.
Thanks for the help, everyone! We no longer encounter this issue. 👍
this happened again, we are running the latest gem. here is the log
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx-rails:ActiveRecord::StatementInvalid (Mysql2::Error: This connection is still waiting for a result, try again once you have the result: SELECT `xxxxxx`.* FROM `xxxxxxxx` WHERE `xxxxxx`.`type` IN ('xxxxxxx', 'xxxxxxxx') AND `xxxxxxx`.`id` = 111111111 ORDER BY `xxxxxxxx`.`id` ASC LIMIT 1 /*application:IntercomApp,line:/app/controllers/intercom_base_controller.rb:28:in `append_info_to_payload'*/):
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument'
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument'
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_exception_bucket.rb:4:in `execute'
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_readonly_reconnect.rb:4:in `execute'
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/controllers/xxxxxxx_base_controller.rb:28:in `append_info_to_payload'
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument' Context
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument' Context
» 15 May 2017 17:10:35.192 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_exception_bucket.rb:4:in `execute' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: config/initializers/mysql2_readonly_reconnect.rb:4:in `execute' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/controllers/xxxxxxx_base_controller.rb:28:in `append_info_to_payload' Context
» 15 May 2017 17:10:35.193 i i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:10:in `block in instrument' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: config/initializers/stacktrace.rb:8:in `instrument' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/version_header.rb:10:in `call' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/version_metrics.rb:9:in `call' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxx/rack/bad_request_for_broken_json.rb:11:in `call' Context
» 15 May 2017 17:10:35.193 i-07a3c182330826af5 xxxxxxx: app/lib/xxxxxxxx/rack/character_encoding_coercion.rb:20:in `call' Context
can you help me debug it and see why this is still happening? @sodabrew @darnaut
What are your initializers? mysql2_exception_bucket.rb
and mysql_readonly_reconnect.rb
?
mysql2_exception_bucket.rb
is basically just a catch all for mysql2 exceptions:
We're recording metrics whenever we see certain mysql exceptions for alarming purposes.
Example:
module ActiveRecord
module Mysql2ExceptionBucket
def execute(*)
super
rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
if e.message =~ /This connection is still waiting for a result/
record_metric
end
raise
end
mysql_readonly_reconnect.rb
is something we wrote to force the connection to reconnect!
, we found that after we moved to AWS Aurora failovers were no longer handled gracefully & needed a redeploy to bounce connections. To avoid that, we wrote this initializer:
module ActiveRecord
module Mysql2ReadonlyReconnect
def execute(*)
super
rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
if e.message =~ /read-only/
Rails.logger.info("MySQL running in read-only mode: reconnecting before raising error")
reconnect!
end
raise
end
end
end
ActiveRecord::ConnectionAdapters::Mysql2Adapter.prepend(ActiveRecord::Mysql2ReadonlyReconnect)
Interesting re: Aurora failovers! Do you have metrics for Aurora failover events? I can imagine a relationship between these two scenarios:
Rails app sends query Aurora fails over, connection errs silently Rails app tries to send next query on the connection that erred silently: 'still waiting for a result' Rails app tries to send next query on a connection that was in the pool: 'in read-only mode'
The event that @eredi93 reported was isolated to a single host & wasn't triggered during any database failover, which is why we're at a bit of loss trying to figure out what the cause could be here. :(
@austinhalpin Thanks for sharing your initializer, it helped me! I'm also curious if you've encountered any other gotchas in the last year since your post. :)
I have been using Aurora a bit recently, and encountered the same issue with failovers. At first I thought that connections were not reset when the failover happened. After some more testing it does appear that all connections are reset, but that Rails often manages to reconnect before the DNS update has propagated fully. The Aurora cluster endpoint is a CNAME record with a 5 second TTL.
So my conclusion is that the mysql process on the old master restarts too quickly and is able to receive connections again (but now as a slave) before the DNS has propagated fully. Clients that manage to connect are oblivious to what happened and tries to write to the slave.
I changed your initializer slightly, using pool.disconnect!
instead of reconnect!
in order to force the whole connection pool to disconnect at once (this should hopefully reduce the number of errors that end users may receive). I also switched the regex to search for a string.
require 'active_record'
module ActiveRecord
module Mysql2ReadonlyReconnect
def execute(*)
super
rescue Mysql2::Error, ActiveRecord::StatementInvalid => e
if e.message.include?('The MySQL server is running with the --read-only option')
pool.disconnect!
end
raise
end
end
end
ActiveRecord::ConnectionAdapters::Mysql2Adapter.prepend(ActiveRecord::Mysql2ReadonlyReconnect)
Finally, this document has been helpful to adopt best practices for Aurora: https://d1.awsstatic.com/whitepapers/RDS/amazon-aurora-mysql-database-administrator-handbook.pdf
The document mentions "smart drivers" for Aurora. Basically the table INFORMATION_SCHEMA.REPLICA_HOST_STATUS
let's you discover other replicas directly in SQL. You can also query SELECT @@innodb_read_only;
to determine if you're actually on the master if that is what you expect.
There are some other Aurora-specific variables available:
> show variables like 'aurora%';
+------------------------------------+------------------+
| Variable_name | Value |
+------------------------------------+------------------+
| aurora_lab_mode | OFF |
| aurora_load_from_s3_role | |
| aurora_max_alter_table_log_entries | 1000 |
| aurora_max_connections_limit | 16000 |
| aurora_select_into_s3_role | |
| aurora_server_id | staging-aurora-1 |
| aurora_use_key_prefetch | ON |
| aurora_version | 2.02 |
+------------------------------------+------------------+
We've been experiencing this issue in production to varying degrees for the last two years:
Mysql2::Error: This connection is still waiting for a result, try again once you have the result
I've looked at previous issues (like https://github.com/brianmario/mysql2/issues/566, https://github.com/brianmario/mysql2/issues/532, https://github.com/brianmario/mysql2/issues/99), but didn't find anything that helps. We've also tried multiple things ourselves, but did not manage to solve it.
We're running:
Stacktraces don't show anything interesting, here's one example:
We didn't notice any particular pattern to when it happens either:
Any help would be appreciated 😄