Closed thwarted closed 10 years ago
On second thought, it may not be due to that exact commit, since the same socket/dup2/close was done in an earlier commit (but post-0.3.15) but without the close-on-exec flag being set.
Ouch. Thank you for the detailed bug report - I'll digest this and start looking into it asap.
The problem shouldn't stem from that particular commit, but rather the work from #498 / #463 / #213 / a2bcfd6.
I am pretty sure that it is completely wrong for OpenSSL to sit in a loop there, trying to write to an unconnected socket. It should break out. There is no way to get a suddenly unconnected socket back into application protocol level synchronization. I'd say this goes onto the recent pile of "wow, OpenSSL is doing THAT!?" issues.
Yeah, this is some questionable errno handling on the part of openssl.
Reading up on those other tickets and commits, it may be valuable to point out that my unicorn.rb has
before_fork do |server, worker|
...
defined?(ActiveRecord::Base) and
ActiveRecord::Base.connection.disconnect!
end
and
after_fork do |server, worker|
...
defined?(ActiveRecord::Base) and
ActiveRecord::Base.establish_connection
end
So I'm already attempting to avoid mysql socket reuse across forking. That my test script has a problem at exit is what lead me to suggest it occurring during garbage collection, which would also be the case of the now disconnected connection getting GC'ed in the before_fork
hook.
There's some good stuff in #463.
Just spit-balling here:
/dev/null
. This should work for the writing in that case, since it's using what appears to be a plain write(2) call to send bytes (vs a socket-fd specific write function like send(2)). The possible drawback here is that openssl may want to do something other than write bytes as part of the SSL shutdown transaction. But that may end up being no problem because when it goes to read it'll get EBADF
.close(2)
it before calling mysql_close
. openssl (and mysql_close) might handle EPIPE
or EBADF
better than ENOTCONN
(since ENOTCONN
can occur during a write where connect(2) hasn't completed on a non-blocking socket, and is thus a retry signal). But I think this is where the race condition concern comes from that the sodabrew:close-race-fix branch (b1c089c617aa71b2c07e9a41057f5a6f75f99531) was created for.could this also be causing "Mysql2::Error: MySQL server has gone away" errors because of the file descriptor reuse? Our delayed job workers started crashing hard after updating to 0.3.16.
The problem is not descriptor reuse, we're using an unconnected socket and OpenSSL is handling this by looping.
Can you capture a backtrace of a DJ worker failing?
@thwarted Spitball: I'd like to try changing SOCK_STREAM to SOCK_DGRAM so that there is no connection state to worry about:
- int sockfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+ int sockfd = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0);
https://github.com/brianmario/mysql2/blob/master/ext/mysql2/client.c#L179-191
I am trying to replicate the OP issue, but I'm not seeing it yet. SSL teardown happens cleanly so far.
You can't replicate the issue using a datagram socket? Sounds like a winner then, same thing that I think would happen if the fd was opened to /dev/null
. What does it look like is happening with an strace when openssl writes to the socket?
I can't replicate it either way :| so I don't know if this is a fix. Could you test my branch here and let me know if it resolves the issue for you? https://github.com/sodabrew/mysql2/tree/dgram_socket
Dev steps:
git clone https://github.com/sodabrew/mysql2.git
cd mysql2
bundle install --path vendor/bundle --without benchmarks
bundle exec rake clean compile
bundle exec ruby test.rb # your test code from above, should fail
git checkout dgram_socket
bundle exec rake clean compile
bundle exec ruby test.rb # your test might pass now?
Sorry for the slow response.
I've since updated to ruby-2.1.2, and the problem still exists there on 0.3.16 but not on 0.3.15 (which is expected, since we've accepted this an openssl issue). This test was also done over tcp (but the same issue existed using SSL to connect to mysql over a UNIX domain socket).
The dgram_socket branch doesn't work. It doesn't work because an unconnected socket needs to be written to with sendmsg(2) or sendto(2) (so a destination address can be specified), not write(2), and openssl is doing write(2).
Changing the socket(..SOCK_DGRAM..)
calls to sockfd = open("/dev/null", O_CLOEXEC, O_RDWR);
does seem to avoid the problem. I get this series of syscalls then:
9836 open("/dev/null", O_RDWR|O_CLOEXEC) = 8
9836 dup2(8, 7) = 7
9836 close(8) = 0
9836 write(7, "\27\3\1\0 M6\\\314I\224\371\353h\t\271\374\335\341\264\272\31\2348\364ia_\354\272\276%"..., 74) = 74
9836 shutdown(7, 2 /* send and receive */) = -1 ENOTSOCK (Socket operation on non-socket)
9836 close(7) = 0
(I had previously written this in such a way that /dev/null was opened read-only, which returned EBADF from the write call, and openssl did not go into an infinite loop when receiving EBADF). The logic around this for setting close on exec using fcntl I'm not familiar with, so I'm not comfortable making a pull request and encouraging a blind merging.
You're unable to reproduce this with either version? You've got ssl enabled on the mysql server and the account you're connecting with had REQUIRE SSL
in the grant statement?
I have another idea, need to go back and see if it was tried/discussed in the other PRs, but here it is: use a socketpair()
, dup2
the first end, then close the second end.
Is there a reason that opening up the fd on /dev/null isn't an acceptable solution?
Well, it's not a socket type, it's a file type. I'm not sold that it's the "right" hack.
Please grab a fresh copy of my sock_dgram branch - it's using socketpair()
now (with STREAM not DGRAM, despite the branch name). Let me know if it works?
To play devils advocate:
The type is "file descriptor". The only reason a new, effectively dead, file descriptor is being filled in in place of the working, real one is so that it can be manipulated dirtily before being explicitly closed without impacting the real connection. The only operation being performed on it that is socket specific is shutdown(2), which is the exact operation (other than the SSL shutdown transaction) that's attempting to be avoided. Interestingly enough, the action that should be taken when shutdown(2) is called on a socket and returns failure is the same action to take when calling shutdown(2) on a non-socket (that also returns failure): nothing. Is there a failed shutdown(2) call other than EBADF that should result in doing anything other than close(2) on the file descriptor?
But as you pointed out, the fd to /dev/null
results in an ENOTSOCK
. My thinking is that error is less likely to be handled cleanly. The socket-already-closed error is the most "normal" of all of these.
My theory is that the mysql library is calling the shutdown after openssl says its done, and then ignoring the error, since it intends on closing it immediately anyway. Which it does. My logic is that we want the SSL layer to think it's succeeding (which writing to /dev/null does), since the socket handling layer seems to do something desirable (ignore that it's trying to shutdown a non-socket), even if not "correct" (although, what one is expecting to do with a socket that is shutdown with SHUT_RDWR other than close it, independent of the error, is lost on me).
I suspect that the socketpair solution will either return EPIPE or will raise SIGPIPE to the SSL layer, since the other end isn't available for reading once it is closed.
The invalidate_socketpair
branch works, but does exactly what I suspected: returns EPIPE on the write, and SIGPIPE is raised.
5660 socketpair(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0, [8, 9]) = 0
5660 dup2(8, 7) = 7
5660 close(8) = 0
5660 close(9) = 0
5660 write(7, "\27\3\1\0 \241\225\245\214\237\"v\0\tx\21\21\206\311\275\317\204\360\263#\20`\332\317\251\10\36"..., 74) = -1 EPIPE (Broken pipe)
5660 --- SIGPIPE (Broken pipe) @ 0 (0) ---
5660 shutdown(7, 2 /* send and receive */) = 0
5660 close(7) = 0
5660 write(4, "!", 1 <unfinished ...>
I'm only running the simple ruby code in the OP. Earlier in the strace, SIGPIPE is being ignored, I'm not sure if that's a regular ruby thing or a mysql-client-lib thing or what.
I agree it does seem more proper to use an actual socket in this case, but without knowing what is setting up SIGPIPE actions, it seems risky (say openssl is doing the ignore on SIGPIPE, it could be raised when not using an SSL connection).
Oh, triggering a SIGPIPE is not a good outcome. I'll take another look today.
I was finally able to reproduce this, on Travis, with MariaDB 10.0, while working on #535.
I also realized that because the invalidate_fd
function is not used on Windows, that /dev/null
can be relied upon. (If ever it were needed on Windows, the equivalent file is nul
- https://gcc.gnu.org/ml/gcc-patches/2005-05/msg01793.html)
You mention in https://github.com/brianmario/mysql2/issues/516#issuecomment-47394616 that O_RDWR results in ENOTSOCK and O_RDONLY results in EBADF. I don't have an instinct on which is a better failure mode for the workaround. Do you have a preference or recommendation on this?
I think getting ENOTSOCK is better because the only operation that seems to fail then is shutdown(2), all the other observed actions on the fd "appear" to succeed, which is the intent of using invalidate_fd.
+1 on this issue:
Ruby version: 1.9.3-p545 OS: CentOS release 6.5 (via a Bashton AMI in Amazon EC2) Mysql2 gem version: mysql2-0.3.16
Example scenario:
require 'mysql2'
connection_hash = {
:host => instance,
:port => 3306,
:database => db,
:sslca => "/etc/pki/tls/certs/ec2-db.pem",
:username => username,
:password => password
}
client = Mysql2::Client.new(connection_hash)
=> #<Mysql2::Client:0x00000000f7d190>
rows = []
The client.query using ssl goes fine...
client.query(db_query, :symbolize_keys => true).each { |row| rows << row }
=> [{:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@charter.net"}, {:client=>"derp", :recipient=>"email@bluemail.ch"}, {:client=>"derp", :recipient=>"email@ppdocs.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@hotmail.com"}, {:client=>"derp", :recipient=>"email@suddenlink.net"}, {:client=>"derp", :recipient=>"email@msn.com"}]
The perpetual loop occurs when i call client.close on the ssl connection. client.close hangs forever. When checking activity w/ strace, it's just endlessly returning "ENOTCONN (Transport endpoint is not connected)" while client.close just hangs.
When i remove mysql2 0.3.16 and replace it with 0.3.14 or 0.3.15 and perform the same tests as above, everything completes and closes without issue.
@phantasm66 Thanks for the confirming report. PR #535 is queued up to resolve this problem.
@sodabrew Right on.. :+1:
Resolved by #535.
0.3.16 goes into an infinite loop caused by the the
invalidate_fd
function (commit 6206d2f8d3f525e8f89cbaeb5a6ae02a538b8a3d) which creates an unconnected socket and then reassigns it to the same fd used to connect to mysql. This causes the SSL transactions to get out of sync, as far as the client is concerned, and openssl goes into an infinite loop trying to write to an unconnected socket. I believe this happens during garbage collection.What follows is a testcase that fails. I'm actually experiencing this in a larger rails app where the database exists on another machine connecting over TCP and SSL. In that case it occurs during unicorn startup, at some point when ActiveRecord first attempts to connect to the database (that is, before unicorn even gets around to forking any workers).
Gemfile
test.rb
Execution under 0.3.15:
(execution takes half a second and exits normally)
Execution under 0.3.16:
(control-c after about 3 seconds)
Comment out the 3 SSL related options being passed to
Mysql2::Client.new
and it doesn't occur on either 0.3.15 nor 0.3.16.the interesting portion of strace while running under 0.3.16:
Various version information:
Related links: Setting up mysql for ssl http://dev.mysql.com/doc/refman/5.6/en/ssl-connections.html
Be sure to set different CNs for the CA, server and client when generating the certs, otherwise openssl refuses to connect.