0.3.16 goes into an infinite loop when using SSL

thwarted commented 10 years ago

0.3.16 goes into an infinite loop caused by the the invalidate_fd function (commit 6206d2f8d3f525e8f89cbaeb5a6ae02a538b8a3d) which creates an unconnected socket and then reassigns it to the same fd used to connect to mysql. This causes the SSL transactions to get out of sync, as far as the client is concerned, and openssl goes into an infinite loop trying to write to an unconnected socket. I believe this happens during garbage collection.

What follows is a testcase that fails. I'm actually experiencing this in a larger rails app where the database exists on another machine connecting over TCP and SSL. In that case it occurs during unicorn startup, at some point when ActiveRecord first attempts to connect to the database (that is, before unicorn even gets around to forking any workers).

Gemfile

source 'https://rubygems.org'
ruby '1.9.3'
# choose one or the other
gem "mysql2", '= 0.3.15'
# gem "mysql2", '= 0.3.16'

test.rb

require 'mysql2'

client = Mysql2::Client.new(
    :host => "localhost",
    :username => "root",
    :sslca => 'ssl/ca-cert.pem',
    :sslkey => 'ssl/client-key.pem',
    :sslcert => 'ssl/client-cert.pem'
)

client.query("show variables like 'have%ssl'").each do |row|
    puts row
end

Execution under 0.3.15:

$ time bundle exec ruby test.rb 
{"Variable_name"=>"have_openssl", "Value"=>"YES"}
{"Variable_name"=>"have_ssl", "Value"=>"YES"}

real    0m0.506s
user    0m0.313s
sys     0m0.175s

(execution takes half a second and exits normally)

Execution under 0.3.16:

$ time bundle exec ruby test.rb 
{"Variable_name"=>"have_openssl", "Value"=>"YES"}
{"Variable_name"=>"have_ssl", "Value"=>"YES"}
^C

real    0m3.512s
user    0m1.673s
sys     0m1.783s

(control-c after about 3 seconds)

Comment out the 3 SSL related options being passed to Mysql2::Client.new and it doesn't occur on either 0.3.15 nor 0.3.16.

the interesting portion of strace while running under 0.3.16:

$ bundle exec strace -f -o /tmp/st.out ruby ./test.rb
...
10354 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER|SA_SIGINFO, 0x3f1740f710}, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x3f1740f710}, 8) = 0
10354 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 8
10354 dup2(8, 7)                        = 7
10354 close(8)                          = 0
10354 write(7, "\27\3\1\0 \25i\342Rg\232\350\220\36\206\267r\273\234\327\227N\307#\205\213\264\336V\22\31F"..., 74) = -1 ENOTCONN (Transport endpoint is not connected)
10354 poll([{fd=7, events=POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT|POLLHUP}])
10354 write(7, "\27\3\1\0 \25i\342Rg\232\350\220\36\206\267r\273\234\327\227N\307#\205\213\264\336V\22\31F"..., 74) = -1 ENOTCONN (Transport endpoint is not connected)
10354 poll([{fd=7, events=POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT|POLLHUP}])
10354 write(7, "\27\3\1\0 \25i\342Rg\232\350\220\36\206\267r\273\234\327\227N\307#\205\213\264\336V\22\31F"..., 74) = -1 ENOTCONN (Transport endpoint is not connected)
...

Various version information:

$ ruby --version
ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]

$ bundle --version
Bundler version 1.3.5

$ rpm -qa | grep Percona
Percona-Server-client-56-5.6.15-rel63.0.519.rhel6.x86_64
Percona-Server-shared-compat-5.5.35-rel33.0.611.rhel6.x86_64
Percona-Server-shared-56-5.6.15-rel63.0.519.rhel6.x86_64
Percona-Server-server-56-5.6.15-rel63.0.519.rhel6.x86_64
Percona-Server-devel-56-5.6.15-rel63.0.519.rhel6.x86_64

$ mysqld --version
mysqld  Ver 5.6.15-56 for Linux on x86_64 (Percona Server (GPL), Release rel63.0, Revision 519)

root@localhost [db (none)] mysql> show variables like '%ssl%';
+---------------+--------------------------------+
| Variable_name | Value                          |
+---------------+--------------------------------+
| have_openssl  | YES                            |
| have_ssl      | YES                            |
| ssl_ca        | /srv/mysql/ssl/ca-cert.pem     |
| ssl_capath    |                                |
| ssl_cert      | /srv/mysql/ssl/server-cert.pem |
| ssl_cipher    |                                |
| ssl_crl       |                                |
| ssl_crlpath   |                                |
| ssl_key       | /srv/mysql/ssl/server-key.pem  |
+---------------+--------------------------------+
9 rows in set (0.00 sec)

$ rpm -qa | grep openssl
openssl-1.0.1e-16.el6_5.7.x86_64
openssl-devel-1.0.1e-16.el6_5.7.x86_64

$ openssl version
OpenSSL 1.0.1e-fips 11 Feb 2013

$ cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m

$ ldd vendor/bundle/ruby/1.9.1/gems/mysql2-0.3.1*/lib/mysql2/*.so
vendor/bundle/ruby/1.9.1/gems/mysql2-0.3.15/lib/mysql2/mysql2.so:
        linux-vdso.so.1 =>  (0x00007fffcafff000)
        libruby.so.1.9 => /usr/lib64/libruby.so.1.9 (0x00007f66138b1000)
        libmysqlclient.so.18 => /usr/lib64/libmysqlclient.so.18 (0x00007f661335c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6613136000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f6612eb2000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f6612caa000)
        libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f6612a3d000)
        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f661265d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f6612459000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f6612221000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f6611e8d000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f16400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f6611b87000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6611970000)
        libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f661172c000)
        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f6611446000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f6611241000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f6611015000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f6610dff000)
        libfreebl3.so => /usr/lib64/libfreebl3.so (0x00007f6610b87000)
        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f661097c000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f6610779000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f661055e000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f661033f000)
vendor/bundle/ruby/1.9.1/gems/mysql2-0.3.16/lib/mysql2/mysql2.so:
        linux-vdso.so.1 =>  (0x00007fff9378d000)
        libruby.so.1.9 => /usr/lib64/libruby.so.1.9 (0x00007f5e183a6000)
        libmysqlclient.so.18 => /usr/lib64/libmysqlclient.so.18 (0x00007f5e17e51000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5e17c2b000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f5e179a7000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f5e1779f000)
        libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f5e17532000)
        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f5e17152000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f5e16f4e000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f5e16d16000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f5e16982000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f16400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f5e1667c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5e16465000)
        libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f5e16221000)
        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f5e15f3b000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f5e15d36000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f5e15b0a000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f5e158f4000)
        libfreebl3.so => /usr/lib64/libfreebl3.so (0x00007f5e1567c000)
        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f5e15471000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f5e1526e000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f5e15053000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f5e14e34000)

Related links: Setting up mysql for ssl http://dev.mysql.com/doc/refman/5.6/en/ssl-connections.html

Be sure to set different CNs for the CA, server and client when generating the certs, otherwise openssl refuses to connect.

thwarted commented 10 years ago

On second thought, it may not be due to that exact commit, since the same socket/dup2/close was done in an earlier commit (but post-0.3.15) but without the close-on-exec flag being set.

sodabrew commented 10 years ago

Ouch. Thank you for the detailed bug report - I'll digest this and start looking into it asap.

sodabrew commented 10 years ago

The problem shouldn't stem from that particular commit, but rather the work from #498 / #463 / #213 / a2bcfd6.

I am pretty sure that it is completely wrong for OpenSSL to sit in a loop there, trying to write to an unconnected socket. It should break out. There is no way to get a suddenly unconnected socket back into application protocol level synchronization. I'd say this goes onto the recent pile of "wow, OpenSSL is doing THAT!?" issues.

thwarted commented 10 years ago

Yeah, this is some questionable errno handling on the part of openssl.

Reading up on those other tickets and commits, it may be valuable to point out that my unicorn.rb has

before_fork do |server, worker|
...
  defined?(ActiveRecord::Base) and
    ActiveRecord::Base.connection.disconnect!
end

and

after_fork do |server, worker|
...
  defined?(ActiveRecord::Base) and
    ActiveRecord::Base.establish_connection
end

So I'm already attempting to avoid mysql socket reuse across forking. That my test script has a problem at exit is what lead me to suggest it occurring during garbage collection, which would also be the case of the now disconnected connection getting GC'ed in the before_fork hook.

thwarted commented 10 years ago

There's some good stuff in #463.

Just spit-balling here:

Rather than creating a new socket that's unconnected, open a new file descriptor on /dev/null. This should work for the writing in that case, since it's using what appears to be a plain write(2) call to send bytes (vs a socket-fd specific write function like send(2)). The possible drawback here is that openssl may want to do something other than write bytes as part of the SSL shutdown transaction. But that may end up being no problem because when it goes to read it'll get EBADF.
close(2) it before calling mysql_close. openssl (and mysql_close) might handle EPIPE or EBADF better than ENOTCONN (since ENOTCONN can occur during a write where connect(2) hasn't completed on a non-blocking socket, and is thus a retry signal). But I think this is where the race condition concern comes from that the sodabrew:close-race-fix branch (b1c089c617aa71b2c07e9a41057f5a6f75f99531) was created for.

jnimety commented 10 years ago

could this also be causing "Mysql2::Error: MySQL server has gone away" errors because of the file descriptor reuse? Our delayed job workers started crashing hard after updating to 0.3.16.

sodabrew commented 10 years ago

The problem is not descriptor reuse, we're using an unconnected socket and OpenSSL is handling this by looping.

Can you capture a backtrace of a DJ worker failing?

sodabrew commented 10 years ago

@thwarted Spitball: I'd like to try changing SOCK_STREAM to SOCK_DGRAM so that there is no connection state to worry about:

-  int sockfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+  int sockfd = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0);

https://github.com/brianmario/mysql2/blob/master/ext/mysql2/client.c#L179-191

sodabrew commented 10 years ago

I am trying to replicate the OP issue, but I'm not seeing it yet. SSL teardown happens cleanly so far.

thwarted commented 10 years ago

You can't replicate the issue using a datagram socket? Sounds like a winner then, same thing that I think would happen if the fd was opened to /dev/null. What does it look like is happening with an strace when openssl writes to the socket?

sodabrew commented 10 years ago

I can't replicate it either way :| so I don't know if this is a fix. Could you test my branch here and let me know if it resolves the issue for you? https://github.com/sodabrew/mysql2/tree/dgram_socket

Dev steps:

git clone https://github.com/sodabrew/mysql2.git
cd mysql2
bundle install --path vendor/bundle --without benchmarks
bundle exec rake clean compile
bundle exec ruby test.rb # your test code from above, should fail

git checkout dgram_socket
bundle exec rake clean compile
bundle exec ruby test.rb # your test might pass now?

thwarted commented 10 years ago

Sorry for the slow response.

I've since updated to ruby-2.1.2, and the problem still exists there on 0.3.16 but not on 0.3.15 (which is expected, since we've accepted this an openssl issue). This test was also done over tcp (but the same issue existed using SSL to connect to mysql over a UNIX domain socket).

The dgram_socket branch doesn't work. It doesn't work because an unconnected socket needs to be written to with sendmsg(2) or sendto(2) (so a destination address can be specified), not write(2), and openssl is doing write(2).

Changing the socket(..SOCK_DGRAM..) calls to sockfd = open("/dev/null", O_CLOEXEC, O_RDWR); does seem to avoid the problem. I get this series of syscalls then:

9836  open("/dev/null", O_RDWR|O_CLOEXEC) = 8
9836  dup2(8, 7)                        = 7
9836  close(8)                          = 0
9836  write(7, "\27\3\1\0 M6\\\314I\224\371\353h\t\271\374\335\341\264\272\31\2348\364ia_\354\272\276%"..., 74) = 74
9836  shutdown(7, 2 /* send and receive */) = -1 ENOTSOCK (Socket operation on non-socket)
9836  close(7)                          = 0

(I had previously written this in such a way that /dev/null was opened read-only, which returned EBADF from the write call, and openssl did not go into an infinite loop when receiving EBADF). The logic around this for setting close on exec using fcntl I'm not familiar with, so I'm not comfortable making a pull request and encouraging a blind merging.

You're unable to reproduce this with either version? You've got ssl enabled on the mysql server and the account you're connecting with had REQUIRE SSL in the grant statement?

sodabrew commented 10 years ago

I have another idea, need to go back and see if it was tried/discussed in the other PRs, but here it is: use a socketpair(), dup2 the first end, then close the second end.

thwarted commented 10 years ago

Is there a reason that opening up the fd on /dev/null isn't an acceptable solution?

sodabrew commented 10 years ago

Well, it's not a socket type, it's a file type. I'm not sold that it's the "right" hack.

sodabrew commented 10 years ago

Please grab a fresh copy of my sock_dgram branch - it's using socketpair() now (with STREAM not DGRAM, despite the branch name). Let me know if it works?

thwarted commented 10 years ago

To play devils advocate:

The type is "file descriptor". The only reason a new, effectively dead, file descriptor is being filled in in place of the working, real one is so that it can be manipulated dirtily before being explicitly closed without impacting the real connection. The only operation being performed on it that is socket specific is shutdown(2), which is the exact operation (other than the SSL shutdown transaction) that's attempting to be avoided. Interestingly enough, the action that should be taken when shutdown(2) is called on a socket and returns failure is the same action to take when calling shutdown(2) on a non-socket (that also returns failure): nothing. Is there a failed shutdown(2) call other than EBADF that should result in doing anything other than close(2) on the file descriptor?

sodabrew commented 10 years ago

But as you pointed out, the fd to /dev/null results in an ENOTSOCK. My thinking is that error is less likely to be handled cleanly. The socket-already-closed error is the most "normal" of all of these.

thwarted commented 10 years ago

My theory is that the mysql library is calling the shutdown after openssl says its done, and then ignoring the error, since it intends on closing it immediately anyway. Which it does. My logic is that we want the SSL layer to think it's succeeding (which writing to /dev/null does), since the socket handling layer seems to do something desirable (ignore that it's trying to shutdown a non-socket), even if not "correct" (although, what one is expecting to do with a socket that is shutdown with SHUT_RDWR other than close it, independent of the error, is lost on me).

I suspect that the socketpair solution will either return EPIPE or will raise SIGPIPE to the SSL layer, since the other end isn't available for reading once it is closed.

thwarted commented 10 years ago

The invalidate_socketpair branch works, but does exactly what I suspected: returns EPIPE on the write, and SIGPIPE is raised.

5660  socketpair(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0, [8, 9]) = 0
5660  dup2(8, 7)                        = 7
5660  close(8)                          = 0
5660  close(9)                          = 0
5660  write(7, "\27\3\1\0 \241\225\245\214\237\"v\0\tx\21\21\206\311\275\317\204\360\263#\20`\332\317\251\10\36"..., 74) = -1 EPIPE (Broken pipe)
5660  --- SIGPIPE (Broken pipe) @ 0 (0) ---
5660  shutdown(7, 2 /* send and receive */) = 0
5660  close(7)                          = 0
5660  write(4, "!", 1 <unfinished ...>

I'm only running the simple ruby code in the OP. Earlier in the strace, SIGPIPE is being ignored, I'm not sure if that's a regular ruby thing or a mysql-client-lib thing or what.

I agree it does seem more proper to use an actual socket in this case, but without knowing what is setting up SIGPIPE actions, it seems risky (say openssl is doing the ignore on SIGPIPE, it could be raised when not using an SSL connection).

sodabrew commented 10 years ago

Oh, triggering a SIGPIPE is not a good outcome. I'll take another look today.

sodabrew commented 10 years ago

I was finally able to reproduce this, on Travis, with MariaDB 10.0, while working on #535.

I also realized that because the invalidate_fd function is not used on Windows, that /dev/null can be relied upon. (If ever it were needed on Windows, the equivalent file is nul - https://gcc.gnu.org/ml/gcc-patches/2005-05/msg01793.html)

You mention in https://github.com/brianmario/mysql2/issues/516#issuecomment-47394616 that O_RDWR results in ENOTSOCK and O_RDONLY results in EBADF. I don't have an instinct on which is a better failure mode for the workaround. Do you have a preference or recommendation on this?

thwarted commented 10 years ago

I think getting ENOTSOCK is better because the only operation that seems to fail then is shutdown(2), all the other observed actions on the fd "appear" to succeed, which is the intent of using invalidate_fd.

phantasm66 commented 10 years ago

+1 on this issue:

Ruby version: 1.9.3-p545 OS: CentOS release 6.5 (via a Bashton AMI in Amazon EC2) Mysql2 gem version: mysql2-0.3.16

Example scenario:

require 'mysql2'

connection_hash = {
  :host => instance,
  :port => 3306,
  :database => db,
  :sslca => "/etc/pki/tls/certs/ec2-db.pem",
  :username => username,
  :password => password
}

client = Mysql2::Client.new(connection_hash)
 => #<Mysql2::Client:0x00000000f7d190>

rows = []

The client.query using ssl goes fine...

client.query(db_query, :symbolize_keys => true).each { |row| rows << row }
 => [{:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@charter.net"}, {:client=>"derp", :recipient=>"email@bluemail.ch"}, {:client=>"derp", :recipient=>"email@ppdocs.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@suddenlink.net"}, {:client=>"derp", :recipient=>"email@msn.com"}]

The perpetual loop occurs when i call client.close on the ssl connection. client.close hangs forever. When checking activity w/ strace, it's just endlessly returning "ENOTCONN (Transport endpoint is not connected)" while client.close just hangs.

When i remove mysql2 0.3.16 and replace it with 0.3.14 or 0.3.15 and perform the same tests as above, everything completes and closes without issue.

sodabrew commented 10 years ago

@phantasm66 Thanks for the confirming report. PR #535 is queued up to resolve this problem.

phantasm66 commented 10 years ago

@sodabrew Right on.. :+1:

sodabrew commented 10 years ago

Resolved by #535.

brianmario / mysql2

0.3.16 goes into an infinite loop when using SSL #516