elastic / elasticsearch-ruby

Ruby integrations for Elasticsearch
https://www.elastic.co/guide/en/elasticsearch/client/ruby-api/current/index.html
Apache License 2.0
1.96k stars 596 forks source link

RoundRobin selector failed to get connection from pool in JRUBY due to parallel execution #662

Closed him3010 closed 5 years ago

him3010 commented 5 years ago

Currently we shifter our search application from ruby to jruby environment, and start getting the error "Cannot get connection from the pool" even if we have provided a large number of hosts to Elasticsearch::Client object which is shared across all the threads. After digging in the code

class RoundRobin
            include Base

            # Returns the next connection from the collection, rotating them in round-robin fashion.
            #
            # @return [Connections::Connection]
            #
            def select(options={})
              # On Ruby 1.9, Array#rotate could be used instead
              @current = !defined?(@current) || @current.nil? ? 0 : @current+1
              @current = 0 if @current >= connections.size
              connections[@current]
            end
  end

we found the @current variable is not synchronized. So we rewrote the selector class and issue went away with it.

Fix Code

class RoundRobinSelector
  attr_reader :connections

  # @option arguments [Connections::Collection] :connections Collection with connections.
  #
  def initialize(arguments={})
    @connections = arguments[:connections]
    @connection_select_mutex = Mutex.new
  end

  # Returns the next connection from the collection, rotating them in round-robin fashion.
  #
  # @return [Connections::Connection]
  #
  def select(options={})
    @connection_select_mutex.synchronize do
      @current = !defined?(@current) || @current.nil? ? 0 : @current+1
      @current = 0 if @current >= connections.size
      return connections[@current]
    end
  end
end
estolfo commented 5 years ago

@him3010 thanks for opening this issue. Do you have a simple script or code sample that reproduces the issue?

estolfo commented 5 years ago

@him3010 I have been trying to reproduce the error and haven't been able to. So can you please tell me the following:

him3010 commented 5 years ago

@estolfo we are using gem 'elasticsearch-persistence' 6.0.0', 'elasticsearch-transport' '6.3.1'. JRUBY Version - 9.2.5.0 with ruby version - 2.5.0 rails version - 5.1.4

In production we are using puma as a application server with the Puma Config :- Workers - 1 MinThreads - 10 MaxThreads - 60

Resolved hosts is the array of the elasticsearch nodes - [es.abc.com:9200, es1.abc.com:9200, es.abc.com:9200, ....] of size 60. Since 60 threads atmax can wait for I/O operation happening on elasticsearch

Connecting to elasticsearch client via
$elasticsearch = Elasticsearch::Client.new(hosts: resolved_hosts)

him3010 commented 5 years ago
Elasticsearch::Transport::Transport::Error: Cannot get new connection from pool.

Stack trace (hide Rails)
…rt-6.3.1/lib/elasticsearch/transport/transport/
base.rb: 258:in `perform_request'
…/lib/elasticsearch/transport/transport/http/
faraday.rb:  20:in `perform_request'
…-transport-6.3.1/lib/elasticsearch/transport/
client.rb: 143:in `perform_request'
…csearch-api-6.3.1/lib/elasticsearch/api/actions/
get.rb:  49:in `get'
…6.0.0/lib/elasticsearch/persistence/repository/
find.rb:  71:in `__find_one'
…6.0.0/lib/elasticsearch/persistence/repository/
find.rb:  30:in `find'
…/1mg/jsearch/search/app/models/active_document/
base.rb:  81:in `find'
…jsearch/search/app/services/sku_module/
skus_manager.rb: 256:in `by_id'
…rch/app/controllers/api/internal/v1/
skus_controller.rb:  23:in `by_id'
…4/lib/action_controller/metal/
basic_implicit_render.rb:   4:in `send_action'
…/gems/actionpack-5.1.4/lib/abstract_controller/
base.rb: 186:in `process_action'
…ionpack-5.1.4/lib/action_controller/metal/
rendering.rb:  30:in `process_action'
…/actionpack-5.1.4/lib/abstract_controller/
callbacks.rb:  20:in `block in process_action'
…ms/activesupport-5.1.4/lib/active_support/
callbacks.rb: 108:in `block in run_callbacks'
…lib/raven/integrations/rails/
controller_transaction.rb:   7:in `block in included'
…ruby/RubyBasicObject.java:2666:in `instance_exec' (

…ms/activesupport-5.1.4/lib/active_support/
callbacks.rb: 117:in `block in run_callbacks'
…ms/activesupport-5.1.4/lib/active_support/
callbacks.rb: 135:in `run_callbacks'
…/actionpack-5.1.4/lib/abstract_controller/
callbacks.rb:  19:in `process_action'
…actionpack-5.1.4/lib/action_controller/metal/
rescue.rb:  20:in `process_action'
…k-5.1.4/lib/action_controller/metal/
instrumentation.rb:  36:in `block in process_action'
…ctivesupport-5.1.4/lib/active_support/
notifications.rb: 166:in `block in instrument'
…5.1.4/lib/active_support/notifications/
instrumenter.rb:  21:in `instrument'
…ctivesupport-5.1.4/lib/active_support/
notifications.rb: 166:in `instrument'
…k-5.1.4/lib/action_controller/metal/
instrumentation.rb:  30:in `process_action'
<truncated 28 additional frames>
…tivesupport-5.1.4/lib/active_support/
tagged_logging.rb:  26:in `tagged'
…tivesupport-5.1.4/lib/active_support/
tagged_logging.rb:  69:in `tagged'
…y-9.2.5.0/gems/railties-5.1.4/lib/rails/rack/
logger.rb:  24:in `call'
…pack-5.1.4/lib/action_dispatch/middleware/
remote_ip.rb:  79:in `call'
…ms/request_store-1.4.1/lib/request_store/
middleware.rb:  19:in `call'
…ack-5.1.4/lib/action_dispatch/middleware/
request_id.rb:  25:in `call'
…gems/jruby-9.2.5.0/gems/rack-2.0.7/lib/rack/
runtime.rb:  22:in `call'
…ctive_support/cache/strategy/
local_cache_middleware.rb:  27:in `call'
…npack-5.1.4/lib/action_dispatch/middleware/
executor.rb:  12:in `call'
…gems/sentry-raven-2.9.0/lib/raven/integrations/
rack.rb:  51:in `call'
…/jruby-9.2.5.0/gems/railties-5.1.4/lib/rails/
engine.rb: 522:in `call'
….2.5.0/gems/puma-3.12.1-java/lib/puma/
configuration.rb: 227:in `call'
…jruby-9.2.5.0/gems/puma-3.12.1-java/lib/puma/
server.rb: 660:in `handle_request'
…jruby-9.2.5.0/gems/puma-3.12.1-java/lib/puma/
server.rb: 474:in `process_client'
…jruby-9.2.5.0/gems/puma-3.12.1-java/lib/puma/
server.rb: 334:in `block in run'
…-9.2.5.0/gems/puma-3.12.1-java/lib/puma/
thread_pool.rb: 135:in `block in spawn_thread'
him3010 commented 5 years ago

@him3010 thanks for opening this issue. Do you have a simple script or code sample that reproduces the issue?

Hi i have been trying to replicate this issue on staging environment but so far no luck, but whenever it is tested on production it reproduces the error. And after applying the fix no connection pool errors were seen

him3010 commented 5 years ago
require 'elasticsearch'

class RoundRobinSelector
  attr_reader :connections

  # @option arguments [Connections::Collection] :connections Collection with connections.
  #
  def initialize(arguments={})
    @connections = arguments[:connections]
    @connection_select_mutex = Mutex.new
  end

  # Returns the next connection from the collection, rotating them in round-robin fashion.
  #
  # @return [Connections::Connection]
  #
  def select(options={})
    @connection_select_mutex.synchronize do
      @current = !defined?(@current) || @current.nil? ? 0 : @current+1
      @current = 0 if @current >= connections.size
      return connections[@current]
    end
  end
end

hosts = ["127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200", "127.0.0.1:9200"]

if ARGV[0].eql?('use_selector')
  elasticsearch = Elasticsearch::Client.new(hosts: hosts, selector_class: RoundRobinSelector)
else
  elasticsearch = Elasticsearch::Client.new(hosts: hosts)
end

connections = elasticsearch.transport.connections
Thread.report_on_exception = false

10.times do |x|
  threads = []

  20.times do |i|
    threads << Thread.new do
      conn = connections.get_connection || (raise Exception.new("cannot get connection from pool"))
    rescue Exception
       puts 'cannot get connection from pool'
    end
  end

  threads.collect {|t| t.join}

  puts "Completed #{x} round"
end

this script will reproduce the error

Steps to run this script -

First run jruby script.rb ( you will see an error cannot get connection from pool). try to run it 2-3 times to see the error. (If lucky can get error in first time)

then run jruby script.rb use_selector , no matter how many times you run this script with use_selector the script will not fail

estolfo commented 5 years ago

@him3010 Thank you for the detailed info. I was able to reproduce the issue and identify the problem. I've opened a pull request, as you've seen. Would you mind taking a look at the updates I've made today? https://github.com/elastic/elasticsearch-ruby/pull/665/files Thanks!

him3010 commented 5 years ago

@him3010 Thank you for the detailed info. I was able to reproduce the issue and identify the problem. I've opened a pull request, as you've seen. Would you mind taking a look at the updates I've made today? https://github.com/elastic/elasticsearch-ruby/pull/665/files Thanks!

Sure will look into the PR

estolfo commented 5 years ago

Fixed by https://github.com/elastic/elasticsearch-ruby/pull/665