Extreme(ly variable) loadtime of Celluloid::IO

digitalextremist commented 10 years ago

It sometimes takes 180.0230 seconds only to load Celluloid::IO ... sometimes longer.

On DigitalOcean with a 8 core droplet, ( and previously on a 12 core droplet also ) ... purely running require 'celluloid/io' under jRuby 1.7.* can range anywhere from 0.003 to 300+ seconds in duration.

( For DigitalOcean ... This started in Ticket #275167, and is being strace'd as I go ) /cc: @mitayai, @tonyborries, @joonas, @imbriaco, @bkhamitov, @johnedgar, @raiyu

tarcieri commented 10 years ago

No idea. You might try sprinkling the require process with some putses and trying to narrow it down to which particular thing is taking so long.

digitalextremist commented 10 years ago

Yeah, I'm straceing and going to step through line by line if necessary with debug output. Don't worry, I'm not so much putting this here for a solution, but to track diagnostics as I go, work with DigitalOcean on whatever their end might be, and have a reference for code commits having to do with the delay.

digitalextremist commented 10 years ago

Actively debugging this now, but will need to just leave debug output in place for a while until I catch it again. Of course, now that debugging output points are in place, it's starting up within seconds rather than minutes.

For my own reference later:

s0 = Time.now
puts "IO: 001"
require 'celluloid/io/version'
puts "IO: 002 #{Time.now - s0}"; s0 = Time.now
require 'celluloid'
puts "IO: 003 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/dns_resolver'
puts "IO: 004 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/mailbox'
puts "IO: 005 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/reactor'
puts "IO: 006 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/stream'
puts "IO: 007 #{Time.now - s0}"; s0 = Time.now

require 'celluloid/io/tcp_server'
puts "IO: 008 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/tcp_socket'
puts "IO: 009 #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/udp_socket'
puts "IO: 00A #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/unix_server'
puts "IO: 00B #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/unix_socket'
puts "IO: 00C #{Time.now - s0}"; s0 = Time.now

require 'celluloid/io/ssl_server'
puts "IO: 00D #{Time.now - s0}"; s0 = Time.now
require 'celluloid/io/ssl_socket'
puts "IO: 00E #{Time.now - s0}"

digitalextremist commented 10 years ago

Finally caught it delayed again ( see, that didn't take long at all ). This time it delayed for a long while when loading Celluloid itself, which is what I feared:

IO: 001
IO: 002 0.003
IO: 003 135.131
IO: 004 0.058
IO: 005 0.002
IO: 006 0.059
IO: 007 0.005
IO: 008 0.004
IO: 009 0.011
IO: 00A 0.018
IO: 00B 0.009
IO: 00C 0.012
IO: 00D 0.01
IO: 00E 2.506

digitalextremist commented 10 years ago

With that being know, it is Celluloid itself, here are the debugging output tagged call sites there:

s1 = Time.now
puts "C: 001"
require 'logger'
puts "C: 002 #{Time.now-s1}"; s1 = Time.now
require 'thread'
puts "C: 002 #{Time.now-s1}"; s1 = Time.now
require 'timeout'
puts "C: 003 #{Time.now-s1}"; s1 = Time.now
require 'set'
puts "C: 004 #{Time.now-s1}"; s1 = Time.now

#de Static, inline definitions...

require 'celluloid/exceptions'
puts "C: 005 #{Time.now-s1}"; s1 = Time.now

require 'celluloid/calls'
puts "C: 006 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/call_chain'
puts "C: 007 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/condition'
puts "C: 008 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/thread'
puts "C: 009 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/core_ext'
puts "C: 010 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/cpu_counter'
puts "C: 011 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/fiber'
puts "C: 012 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/fsm'
puts "C: 013 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/internal_pool'
puts "C: 014 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/links'
puts "C: 015 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/logger'
puts "C: 016 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/mailbox'
puts "C: 017 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/evented_mailbox'
puts "C: 018 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/method'
puts "C: 019 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/properties'
puts "C: 020 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/handlers'
puts "C: 021 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/receivers'
puts "C: 022 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/registry'
puts "C: 023 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/responses'
puts "C: 024 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/signals'
puts "C: 025 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/stack_dump'
puts "C: 026 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/system_events'
puts "C: 027 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/tasks'
puts "C: 028 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/task_set'
puts "C: 029 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/thread_handle'
puts "C: 030 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/uuid'
puts "C: 031 #{Time.now-s1}"; s1 = Time.now

require 'celluloid/proxies/abstract_proxy'
puts "C: 032 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/sync_proxy'
puts "C: 033 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/cell_proxy'
puts "C: 034 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/actor_proxy'
puts "C: 035 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/async_proxy'
puts "C: 036 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/future_proxy'
puts "C: 037 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/proxies/block_proxy'
puts "C: 038 #{Time.now-s1}"; s1 = Time.now

require 'celluloid/actor'
puts "C: 039 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/cell'
puts "C: 040 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/future'
puts "C: 041 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/actor_system'
puts "C: 042 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/pool_manager'
puts "C: 043 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/supervision_group'
puts "C: 044 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/supervisor'
puts "C: 045 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/notifications'
puts "C: 046 #{Time.now-s1}"; s1 = Time.now
require 'celluloid/logging'
puts "C: 047 #{Time.now-s1}"; s1 = Time.now

require 'celluloid/legacy' unless defined?(CELLULOID_FUTURE)
puts "C: 048 #{Time.now-s1}"; s1 = Time.now

digitalextremist commented 10 years ago

Found the culprit, generally -- and it makes sense to support my theory that it involves possible need for @digitalocean to intervene somehow, or know about this:

IO: 001 IO: 002 0.004 C: 001 C: 002 0.002 C: 002 0.001 C: 003 0.007 C: 004 0.001 C: 005 0.01 C: 006 0.007 C: 007 0.005 C: 008 0.007 C: 009 0.03 C: 010 0.007 C: 011 0.009 C: 012 0.0 C: 013 0.007 C: 014 0.006 C: 015 0.005 C: 016 0.005 C: 017 0.007 C: 018 0.01 C: 019 0.016 C: 020 0.009 C: 021 0.008 C: 022 0.524 C: 023 0.003 C: 024 0.002 C: 025 0.002 C: 026 0.004 C: 027 0.003 C: 028 0.007 C: 029 0.005 C: 030 0.003 C: 031 51.873 #de due to require 'celluloid/uuid' C: 032 0.01 C: 033 0.005 C: 034 0.005 C: 035 0.005 C: 036 0.007 C: 037 0.004 C: 038 0.004 C: 039 0.018 C: 040 0.014 C: 041 0.012 C: 042 0.012 C: 043 0.026 C: 044 0.013 C: 045 0.003 C: 046 0.01 C: 047 0.034 C: 048 0.007 IO: 003 52.838 IO: 004 0.111 IO: 005 0.004 IO: 006 0.109 IO: 007 0.01 IO: 008 0.008 IO: 009 0.012 IO: 00A 0.018 IO: 00B 0.008 IO: 00C 0.01 IO: 00D 0.008 IO: 00E 1.837

digitalextremist commented 10 years ago

Added call site debugging output to celluloid/uuid:

s2 = Time.now
puts "UU: 001"
require 'securerandom'
puts "UU: 002 #{Time.now-s2}"

module Celluloid
  # Clearly Ruby doesn't have enough UUID libraries
  # This one aims to be fast and simple with good support for multiple threads
  # If there's a better UUID library I can use with similar multithreaded
  # performance, I certainly wouldn't mind using a gem for this!
  module UUID
    s2 = Time.now
    values = SecureRandom.hex(9).match(/(.{8})(.{4})(.{3})(.{3})/)
    PREFIX = "#{values[1]}-#{values[2]}-4#{values[3]}-8#{values[4]}".freeze
    BLOCK_SIZE = 0x10000
    puts "UU: 003 #{Time.now-s2}"; s2 = Time.now

    @counter = 0
    @counter_mutex = Mutex.new
    puts "UU: 004 #{Time.now-s2}"; s2 = Time.now

digitalextremist commented 10 years ago

Will keep watching this, but further research is now showing it's getting stuck badly circa invoking SecureRandom.hex:

IO: 001
IO: 002 0.002
C: 001
C: 002 0.001
C: 002 0.001
C: 003 0.004
C: 004 0.0
C: 005 0.005
C: 006 0.003
C: 007 0.002
C: 008 0.002
C: 009 0.014
C: 010 0.004
C: 011 0.004
C: 012 0.0
C: 013 0.004
C: 014 0.004
C: 015 0.003
C: 016 0.003
C: 017 0.004
C: 018 0.005
C: 019 0.008
C: 020 0.005
C: 021 0.004
C: 022 0.458
C: 023 0.004
C: 024 0.003
C: 025 0.002
C: 026 0.004
C: 027 0.002
C: 028 0.007
C: 029 0.005
C: 030 0.002
UU: 001
UU: 002 0.008
UU: 003 194.583 #de The offender.
UU: 004 0.0
C: 031 194.597
C: 032 0.01
C: 033 0.006
C: 034 0.007
C: 035 0.006
C: 036 0.004
C: 037 0.004
C: 038 0.004
C: 039 0.015
C: 040 0.006
C: 041 0.006
C: 042 0.01
C: 043 0.019
C: 044 0.013
C: 045 0.004
C: 046 0.009
C: 047 0.033
C: 048 0.007
IO: 003 195.373
IO: 004 0.11
IO: 005 0.004
IO: 006 0.106
IO: 007 0.009
IO: 008 0.007
IO: 009 0.015
IO: 00A 0.016
IO: 00B 0.008
IO: 00C 0.01
IO: 00D 0.008
IO: 00E 2.884

Knowing the above, is there anything obvious that stands out to you @tarcieri?

tarcieri commented 10 years ago

Perhaps SecureRandom is trying to use /dev/random and it's blocking? Perhaps your VM doesn't have enough entropy...

SecureRandom should probably use /dev/urandom if this is the case... but just a guess.

digitalextremist commented 10 years ago

@tarcieri so you mean rather than "too much load" there might not be enough load on the server to actually animate the randomizer?

digitalextremist commented 10 years ago

This is the code behind random_bytes in the 1.9.3 reference material:

https://gist.github.com/digitalextremist/802c0104f769b9c4e9b5

tarcieri commented 10 years ago

Possibly. An strace would be the most helpful thing here, IMO

digitalextremist commented 10 years ago

Alas, it's back to strace then... Thanks for the entropy hint so far.

digitalextremist commented 10 years ago

@iamorphen you were just talking about entropy, specifically under Linux. Even though it might not be relevant to KVM ( the virtualization platform DigitalOcean uses ) do you have any useful references?

iamorphen commented 10 years ago

@digitalextremist As you know, a primary way to introduce entropy to /dev/random is through hardware interfaces such as the network interface and keyboard. These are virtual interfaces on a VM, so you're not necessarily farming as much activity as /dev/random would be had the OS been running completely natively.

Entropy will still be introduced to /dev/random through the network and disk access, among other sources, if available. However, regarding networking, if most of the traffic received by the VM network interface is generated on the same machine, that will introduce little entropy since the send-receive timespan is likely very uniform. Disk access is still a good source of entropy but it is only a subset of what you'd get from all sources had the OS been running natively, thus the long delays when trying to use /dev/random, if that really is the source of this issue.

Even when running Ubuntu natively, if I try and generate a PGP key right after I boot, it could take nearly 5 minutes (or just hang if I'm not generating network traffic or typing). You can search /dev/random pgp key slow for many examples of this notorious issue.

Here are some sources I think are enlightening: Entropy and Random Number Generators in Linux Low Entropy on VMs (with suggested workaround) ServerFault: working with /dev/random on VMs

Asmod4n commented 10 years ago

For networking needs, shouldn't /dev/urandom be better on linux? Also, http://www.issihosts.com/haveged/index.html might be a good fit for a virtual machine.

But i am just guessing here.

iamorphen commented 10 years ago

I have seen haveged recommended multiple times. I cannot vouch for it, but others say it is a sound workaround.

The choice between /dev/random and /dev/urandom depends on the amount of robustness you need. /dev/urandom is pseudo-random (so is /dev/random, but to a lesser extent). /dev/urandom is non-blocking. It polls small bits of "random" information from /dev/random. Doing so, it potentially depletes /dev/random, but when it does, it will just reuse the last bit of information it polled from /dev/random, making it much more predictable. /dev/random, in the case of being depleted, will block until enough entropy is introduced.

So if the most robust randomness isn't needed, /dev/urandom will do. For the most random data, though, rely on /dev/random. In the treatise I linked to (Entropy and Random Number Generators in Linux, above), the author notes "/dev/urandom is widely considered safe for all cryptographic purposes, except by the most paranoid people."

iamorphen commented 10 years ago

@digitalextremist Can you force your system to use /dev/urandom in place of /dev/random momentarily and then re-time the executions and let us know if this changed anything? This should immediately tell us whether or not the issue is with /dev/[u]random. However you achieve this, please restore /dev/random functionality when you're finished (:

Edit: credit where credit is due -- @tarcieri first suggested this.

digitalextremist commented 10 years ago

@iamorphen wish it were that easy! The call is made by a Ruby library so we'll need to route entirely around that to a secondary gem that uses /dev/urandom!

iamorphen commented 10 years ago

@digitalextremist Doesn't it ultimately make a system call? Can't you just change the target of the system call to /dev/urandom? There are ways to map /dev/random to /dev/urandom or change the functionality of /dev/random to non-blocking, I believe.

Edit: As an alternative, you could instead just use an artificial seeder for /dev/random to test it.

digitalextremist commented 10 years ago

This seems to be the relevant issue/bug/topic/call site discussion for ruby-lang @iamorphen:

https://bugs.ruby-lang.org/issues/9569

@tarcieri there's a note in celluloid/uuid talking about using a gem instead of SecureRandom. Is now the time perhaps? Or do you think this is virtualization rooted and worthy of OS/hypervisor config tuning?

iamorphen commented 10 years ago

@digitalextremist I'm curious about your environment's initial conditions. It sounds like while you were trying to reproduce the issue for this thread you were loading Celluloid::IO multiple times back-to-back. How did you try and reproduce this? Straight from a reboot? Or was the system running for some time? I think this info is important since you labeled this as a bug.

digitalextremist commented 10 years ago

@iamorphen, it's loading once only. This really ought to be moved to Celluloid though, it'll happen howeverCelluloid be called ( as I now see ).

It happens after reboot as well as after many days online. What's most troubling is this happens when my production environment has new code deployed after every 13 day Sprint ( but it's unpredictable if it'll delay or not ). This also happens repeatedly, day after day restarting my development environment when a hot reload triggers an unrecoverable fault.

tarcieri commented 10 years ago

If SecureRandom is reading from anything but /dev/urandom whatever is doing that is broken. It'd be good to know if something else is going on first.

digitalextremist commented 10 years ago

Well, I've successfully isolated the issue being available entropy depletion. It takes a very long time for entropy to become available on my virtual machine. From what I've read, /dev/urandom is non-blocking and to be non-blocking it can at times reuse bits of entropy. So if it's blocking, I'm assuming it's using /dev/random somehow. I will research further and determine for sure if it is, and what we can do.

digitalextremist commented 10 years ago

Per @Asmod4n's suggestion, with @iamorphen's cautious encouragement above, I installed and started haveged which increased my available bits pool from an average of 130-210 at any given time, to 1500-2000 at any given time. I'm sure that number can be worked with, but now the magical number of about 240 bits of entropy needed to start Celluloid is always there, so startup takes an ordinary amount of time again. I think this issue ought to stay open though, until it is figured out how this can operate properly without haveged and without blocking.

tarcieri commented 10 years ago

Using /dev/urandom is fine. I suppose the bigger question is which codepath of SecureRandom is being taken, and why it isn't using /dev/urandom

ioquatix commented 9 years ago

If SecureRandom is slow, that is a problem for me too, i.e. https://github.com/ioquatix/rubydns/blob/b52354852aeeeb86aa407be9e5bc06bd8a82bc45/lib/rubydns/resolver.rb#L53

johnl commented 9 years ago

Ruby's SecureRandom uses OpenSSL first, if it's available. All OpenSSL versions try to use /dev/urandom by default. If OpenSSL isn't available, Ruby's SecureRandom fails back to using /dev/urandom directly. urandom never blocks.

But this user is using JRuby, not Ruby. The JVM apparently uses /dev/random by default on Linux.

So this can be fixed by either setting "securerandom.source=file:/dev/urandom" in java.security: http://docs.oracle.com/cd/E13209_01/wlcp/wlss30/configwlss/jvmrand.html

Or by upgrading to JRuby 1.7.11, which has rewritten SecureRandom to seed only once per thread: http://blog.mogotest.com/2014/03/11/faster-securerandom-in-jruby-1.7.11/

I'd probably do both.

digitalextremist commented 9 years ago

You are exactly right @johnl. This is a jRuby specific issue. Those two workarounds are great to know. Without doing either or both of those, I was able to remedy the situation using haveged several months back, but I will implement those two measures alongside the more unorthodox approach I've been using.

johnl commented 9 years ago

As it's not a bug in celluloid, then can you close the ticket? (I'm presuming the ticket owner can close/reject it - I can't).

digitalextremist commented 9 years ago

Closing, this is not a bug in Celluloid.

celluloid / celluloid-io

Extreme(ly variable) loadtime of Celluloid::IO #112