Closed digitalextremist closed 9 years ago
No idea. You might try sprinkling the require process with some putses and trying to narrow it down to which particular thing is taking so long.
Yeah, I'm strace
ing and going to step through line by line if necessary with debug output. Don't worry, I'm not so much putting this here for a solution, but to track diagnostics as I go, work with DigitalOcean on whatever their end might be, and have a reference for code commits having to do with the delay.
Actively debugging this now, but will need to just leave debug output in place for a while until I catch it again. Of course, now that debugging output points are in place, it's starting up within seconds rather than minutes.
For my own reference later:
s0 = Time.now
puts "IO: 001"
require 'celluloid/io/version'
puts "IO: 002 #{Time.now - s0}"; s0 = Time.now
require 'celluloid'
puts "IO: 003 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/dns_resolver'
puts "IO: 004 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/mailbox'
puts "IO: 005 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/reactor'
puts "IO: 006 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/stream'
puts "IO: 007 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/tcp_server'
puts "IO: 008 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/tcp_socket'
puts "IO: 009 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/udp_socket'
puts "IO: 00A #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/unix_server'
puts "IO: 00B #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/unix_socket'
puts "IO: 00C #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/ssl_server'
puts "IO: 00D #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/ssl_socket'
puts "IO: 00E #{Time.now - s0}"
Finally caught it delayed again ( see, that didn't take long at all ).
This time it delayed for a long while when loading Celluloid
itself, which is what I feared:
IO: 001
IO: 002 0.003
IO: 003 135.131
IO: 004 0.058
IO: 005 0.002
IO: 006 0.059
IO: 007 0.005
IO: 008 0.004
IO: 009 0.011
IO: 00A 0.018
IO: 00B 0.009
IO: 00C 0.012
IO: 00D 0.01
IO: 00E 2.506
With that being know, it is Celluloid
itself, here are the debugging output tagged call sites there:
s1 = Time.now
puts "C: 001"
require 'logger'
puts "C: 002 #{Time.now-s1}"; s1 = Time.now
require 'thread'
puts "C: 002 #{Time.now-s1}"; s1 = Time.now
require 'timeout'
puts "C: 003 #{Time.now-s1}"; s1 = Time.now
require 'set'
puts "C: 004 #{Time.now-s1}"; s1 = Time.now
#de Static, inline definitions...
require 'celluloid/exceptions'
puts "C: 005 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/calls'
puts "C: 006 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/call_chain'
puts "C: 007 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/condition'
puts "C: 008 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/thread'
puts "C: 009 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/core_ext'
puts "C: 010 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/cpu_counter'
puts "C: 011 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/fiber'
puts "C: 012 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/fsm'
puts "C: 013 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/internal_pool'
puts "C: 014 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/links'
puts "C: 015 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/logger'
puts "C: 016 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/mailbox'
puts "C: 017 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/evented_mailbox'
puts "C: 018 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/method'
puts "C: 019 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/properties'
puts "C: 020 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/handlers'
puts "C: 021 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/receivers'
puts "C: 022 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/registry'
puts "C: 023 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/responses'
puts "C: 024 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/signals'
puts "C: 025 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/stack_dump'
puts "C: 026 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/system_events'
puts "C: 027 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/tasks'
puts "C: 028 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/task_set'
puts "C: 029 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/thread_handle'
puts "C: 030 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/uuid'
puts "C: 031 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/abstract_proxy'
puts "C: 032 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/sync_proxy'
puts "C: 033 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/cell_proxy'
puts "C: 034 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/actor_proxy'
puts "C: 035 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/async_proxy'
puts "C: 036 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/future_proxy'
puts "C: 037 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/block_proxy'
puts "C: 038 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/actor'
puts "C: 039 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/cell'
puts "C: 040 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/future'
puts "C: 041 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/actor_system'
puts "C: 042 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/pool_manager'
puts "C: 043 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/supervision_group'
puts "C: 044 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/supervisor'
puts "C: 045 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/notifications'
puts "C: 046 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/logging'
puts "C: 047 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/legacy' unless defined?(CELLULOID_FUTURE)
puts "C: 048 #{Time.now-s1}"; s1 = Time.now
Found the culprit, generally -- and it makes sense to support my theory that it involves possible need for @digitalocean to intervene somehow, or know about this:
IO: 001 IO: 002 0.004 C: 001 C: 002 0.002 C: 002 0.001 C: 003 0.007 C: 004 0.001 C: 005 0.01 C: 006 0.007 C: 007 0.005 C: 008 0.007 C: 009 0.03 C: 010 0.007 C: 011 0.009 C: 012 0.0 C: 013 0.007 C: 014 0.006 C: 015 0.005 C: 016 0.005 C: 017 0.007 C: 018 0.01 C: 019 0.016 C: 020 0.009 C: 021 0.008 C: 022 0.524 C: 023 0.003 C: 024 0.002 C: 025 0.002 C: 026 0.004 C: 027 0.003 C: 028 0.007 C: 029 0.005 C: 030 0.003 C: 031 51.873 #de due to require 'celluloid/uuid' C: 032 0.01 C: 033 0.005 C: 034 0.005 C: 035 0.005 C: 036 0.007 C: 037 0.004 C: 038 0.004 C: 039 0.018 C: 040 0.014 C: 041 0.012 C: 042 0.012 C: 043 0.026 C: 044 0.013 C: 045 0.003 C: 046 0.01 C: 047 0.034 C: 048 0.007 IO: 003 52.838 IO: 004 0.111 IO: 005 0.004 IO: 006 0.109 IO: 007 0.01 IO: 008 0.008 IO: 009 0.012 IO: 00A 0.018 IO: 00B 0.008 IO: 00C 0.01 IO: 00D 0.008 IO: 00E 1.837
Added call site debugging output to celluloid/uuid
:
s2 = Time.now
puts "UU: 001"
require 'securerandom'
puts "UU: 002 #{Time.now-s2}"
module Celluloid
# Clearly Ruby doesn't have enough UUID libraries
# This one aims to be fast and simple with good support for multiple threads
# If there's a better UUID library I can use with similar multithreaded
# performance, I certainly wouldn't mind using a gem for this!
module UUID
s2 = Time.now
values = SecureRandom.hex(9).match(/(.{8})(.{4})(.{3})(.{3})/)
PREFIX = "#{values[1]}-#{values[2]}-4#{values[3]}-8#{values[4]}".freeze
BLOCK_SIZE = 0x10000
puts "UU: 003 #{Time.now-s2}"; s2 = Time.now
@counter = 0
@counter_mutex = Mutex.new
puts "UU: 004 #{Time.now-s2}"; s2 = Time.now
Will keep watching this, but further research is now showing it's getting stuck badly circa invoking SecureRandom.hex
:
IO: 001
IO: 002 0.002
C: 001
C: 002 0.001
C: 002 0.001
C: 003 0.004
C: 004 0.0
C: 005 0.005
C: 006 0.003
C: 007 0.002
C: 008 0.002
C: 009 0.014
C: 010 0.004
C: 011 0.004
C: 012 0.0
C: 013 0.004
C: 014 0.004
C: 015 0.003
C: 016 0.003
C: 017 0.004
C: 018 0.005
C: 019 0.008
C: 020 0.005
C: 021 0.004
C: 022 0.458
C: 023 0.004
C: 024 0.003
C: 025 0.002
C: 026 0.004
C: 027 0.002
C: 028 0.007
C: 029 0.005
C: 030 0.002
UU: 001
UU: 002 0.008
UU: 003 194.583 #de The offender.
UU: 004 0.0
C: 031 194.597
C: 032 0.01
C: 033 0.006
C: 034 0.007
C: 035 0.006
C: 036 0.004
C: 037 0.004
C: 038 0.004
C: 039 0.015
C: 040 0.006
C: 041 0.006
C: 042 0.01
C: 043 0.019
C: 044 0.013
C: 045 0.004
C: 046 0.009
C: 047 0.033
C: 048 0.007
IO: 003 195.373
IO: 004 0.11
IO: 005 0.004
IO: 006 0.106
IO: 007 0.009
IO: 008 0.007
IO: 009 0.015
IO: 00A 0.016
IO: 00B 0.008
IO: 00C 0.01
IO: 00D 0.008
IO: 00E 2.884
Knowing the above, is there anything obvious that stands out to you @tarcieri?
Perhaps SecureRandom is trying to use /dev/random
and it's blocking? Perhaps your VM doesn't have enough entropy...
SecureRandom should probably use /dev/urandom
if this is the case... but just a guess.
@tarcieri so you mean rather than "too much load" there might not be enough load on the server to actually animate the randomizer?
This is the code behind random_bytes
in the 1.9.3
reference material:
https://gist.github.com/digitalextremist/802c0104f769b9c4e9b5
Possibly. An strace would be the most helpful thing here, IMO
Alas, it's back to strace
then... Thanks for the entropy hint so far.
@iamorphen you were just talking about entropy, specifically under Linux. Even though it might not be relevant to KVM ( the virtualization platform DigitalOcean uses ) do you have any useful references?
@digitalextremist As you know, a primary way to introduce entropy to /dev/random is through hardware interfaces such as the network interface and keyboard. These are virtual interfaces on a VM, so you're not necessarily farming as much activity as /dev/random would be had the OS been running completely natively.
Entropy will still be introduced to /dev/random through the network and disk access, among other sources, if available. However, regarding networking, if most of the traffic received by the VM network interface is generated on the same machine, that will introduce little entropy since the send-receive timespan is likely very uniform. Disk access is still a good source of entropy but it is only a subset of what you'd get from all sources had the OS been running natively, thus the long delays when trying to use /dev/random, if that really is the source of this issue.
Even when running Ubuntu natively, if I try and generate a PGP key right after I boot, it could take nearly 5 minutes (or just hang if I'm not generating network traffic or typing). You can search /dev/random pgp key slow
for many examples of this notorious issue.
Here are some sources I think are enlightening: Entropy and Random Number Generators in Linux Low Entropy on VMs (with suggested workaround) ServerFault: working with /dev/random on VMs
For networking needs, shouldn't /dev/urandom be better on linux? Also, http://www.issihosts.com/haveged/index.html might be a good fit for a virtual machine.
But i am just guessing here.
I have seen haveged
recommended multiple times. I cannot vouch for it, but others say it is a sound workaround.
The choice between /dev/random and /dev/urandom depends on the amount of robustness you need. /dev/urandom is pseudo-random (so is /dev/random, but to a lesser extent). /dev/urandom is non-blocking. It polls small bits of "random" information from /dev/random. Doing so, it potentially depletes /dev/random, but when it does, it will just reuse the last bit of information it polled from /dev/random, making it much more predictable. /dev/random, in the case of being depleted, will block until enough entropy is introduced.
So if the most robust randomness isn't needed, /dev/urandom will do. For the most random data, though, rely on /dev/random. In the treatise I linked to (Entropy and Random Number Generators in Linux, above), the author notes "/dev/urandom is widely considered safe for all cryptographic purposes, except by the most paranoid people."
@digitalextremist Can you force your system to use /dev/urandom in place of /dev/random momentarily and then re-time the executions and let us know if this changed anything? This should immediately tell us whether or not the issue is with /dev/[u]random. However you achieve this, please restore /dev/random functionality when you're finished (:
Edit: credit where credit is due -- @tarcieri first suggested this.
@iamorphen wish it were that easy! The call is made by a Ruby library so we'll need to route entirely around that to a secondary gem that uses /dev/urandom
!
@digitalextremist Doesn't it ultimately make a system call? Can't you just change the target of the system call to /dev/urandom? There are ways to map /dev/random to /dev/urandom or change the functionality of /dev/random to non-blocking, I believe.
Edit: As an alternative, you could instead just use an artificial seeder for /dev/random to test it.
This seems to be the relevant issue/bug/topic/call site discussion for ruby-lang @iamorphen:
https://bugs.ruby-lang.org/issues/9569
@tarcieri there's a note in celluloid/uuid talking about using a gem instead of SecureRandom. Is now the time perhaps? Or do you think this is virtualization rooted and worthy of OS/hypervisor config tuning?
@digitalextremist I'm curious about your environment's initial conditions. It sounds like while you were trying to reproduce the issue for this thread you were loading Celluloid::IO
multiple times back-to-back. How did you try and reproduce this? Straight from a reboot? Or was the system running for some time? I think this info is important since you labeled this as a bug.
@iamorphen, it's loading once only. This really ought to be moved to Celluloid
though, it'll happen howeverCelluloid
be called ( as I now see ).
It happens after reboot as well as after many days online. What's most troubling is this happens when my production environment has new code deployed after every 13 day Sprint ( but it's unpredictable if it'll delay or not ). This also happens repeatedly, day after day restarting my development environment when a hot reload triggers an unrecoverable fault.
If SecureRandom is reading from anything but /dev/urandom
whatever is doing that is broken. It'd be good to know if something else is going on first.
Well, I've successfully isolated the issue being available entropy depletion. It takes a very long time for entropy to become available on my virtual machine. From what I've read, /dev/urandom
is non-blocking and to be non-blocking it can at times reuse bits of entropy. So if it's blocking, I'm assuming it's using /dev/random
somehow. I will research further and determine for sure if it is, and what we can do.
Per @Asmod4n's suggestion, with @iamorphen's cautious encouragement above, I installed and started haveged
which increased my available bits pool from an average of 130-210 at any given time, to 1500-2000 at any given time. I'm sure that number can be worked with, but now the magical number of about 240 bits of entropy needed to start Celluloid
is always there, so startup takes an ordinary amount of time again. I think this issue ought to stay open though, until it is figured out how this can operate properly without haveged
and without blocking.
Using /dev/urandom
is fine. I suppose the bigger question is which codepath of SecureRandom is being taken, and why it isn't using /dev/urandom
If SecureRandom is slow, that is a problem for me too, i.e. https://github.com/ioquatix/rubydns/blob/b52354852aeeeb86aa407be9e5bc06bd8a82bc45/lib/rubydns/resolver.rb#L53
Ruby's SecureRandom uses OpenSSL first, if it's available. All OpenSSL versions try to use /dev/urandom by default. If OpenSSL isn't available, Ruby's SecureRandom fails back to using /dev/urandom directly. urandom never blocks.
But this user is using JRuby, not Ruby. The JVM apparently uses /dev/random by default on Linux.
So this can be fixed by either setting "securerandom.source=file:/dev/urandom" in java.security: http://docs.oracle.com/cd/E13209_01/wlcp/wlss30/configwlss/jvmrand.html
Or by upgrading to JRuby 1.7.11, which has rewritten SecureRandom to seed only once per thread: http://blog.mogotest.com/2014/03/11/faster-securerandom-in-jruby-1.7.11/
I'd probably do both.
You are exactly right @johnl. This is a jRuby specific issue. Those two workarounds are great to know. Without doing either or both of those, I was able to remedy the situation using haveged
several months back, but I will implement those two measures alongside the more unorthodox approach I've been using.
As it's not a bug in celluloid, then can you close the ticket? (I'm presuming the ticket owner can close/reject it - I can't).
Closing, this is not a bug in Celluloid.
It sometimes takes
180.0230
seconds only to loadCelluloid::IO
... sometimes longer.On DigitalOcean with a 8 core droplet, ( and previously on a 12 core droplet also ) ... purely running
require 'celluloid/io'
underjRuby 1.7.*
can range anywhere from0.003
to300+
seconds in duration.( For DigitalOcean ... This started in
Ticket #275167
, and is being strace'd as I go ) /cc: @mitayai, @tonyborries, @joonas, @imbriaco, @bkhamitov, @johnedgar, @raiyu