aerospike / aerospike-client-ruby

Ruby client for the Aerospike database
https://www.aerospike.com/
Apache License 2.0
61 stars 40 forks source link

Remove Thread#abort_on_exception from batch_index_command #92

Closed madejejej closed 4 years ago

madejejej commented 4 years ago

Most Ruby apps run inside threaded processes, like Sidekiq or Puma. Your code is only executed in a Worker thread spawned by the main thread. The gotcha is, if you spawn a thread with abort_on_exception = true and this thread raises an exception, the exception will always propagate to the main thread. This means that Sidekiq/Puma will exit due to an unexpected error.

There are be at least a few cases when the Aerospike Client might raise errors in the batch commands:

  1. Timeouts
  2. Network errors
  3. Cluster rebalancing / adding, or removing nodes

This PR only fixes the batch index command, but I think it'd be great to fix patch the other places where abort_on_exception is used.

Please see the two examples below of the intricacies of abort_on_exception.

The exception will always propagate to the main thread, and sometimes also to the Thread that spawned it

Try running this code multiple times. I get different results, most probably depending on which threads gets scheduled after raising the exception.

def get_batch
  threads = []

  # Aerospike fetching threads
  threads << Thread.new do
    Thread.current.abort_on_exception = true
    raise RuntimeError, '1'
  end

  threads.each(&:join)
end

begin
  # Sidekiq worker Thread
  t = Thread.new do
    begin
      begin
        get_batch
      rescue RuntimeError => e
        # you might be able to catch the error here sometimes, 
        # but it will always propagate to the main thread anyway
        puts 'catching RuntimeError in Worker Thread'
        puts e.message
      end
    end
  end

  t.join
rescue RuntimeError
  # Imagine we are in the main Sidekiq process now... We wouldn't be able to rescue this inside a Worker. 
  # The Sidekiq process will catch it instead and exit. That's quite bad. 
  puts 'rescuing from main thread'
end

puts 'working normally, lalala'

One possible output:

#<Thread:0x00007f8b5c925a88@wtf2.rb:7 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
test.rb:9:in `block in get_batch': 1 (RuntimeError)
rescuing from main thread
working normally, lalala

Second possible output:

#<Thread:0x00007fbe9f0496c8@wtf2.rb:7 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
test.rb:9:in `block in get_batch': 1 (RuntimeError)
catching RuntimeError in Worker Thread
1
rescuing from main thread
working normally, lalala

If more than one exception is thrown, you are not able to catch all of them, even in the main thread

This will result in a crash, for example when 2 of the nodes time out.

def fun(thread_count)
  threads = []

  thread_count.times do |i|
    threads << Thread.new do
      Thread.current.abort_on_exception = true
      raise RuntimeError, i
    end
  end

  threads.each(&:join)
end

begin
  fun(1)
rescue RuntimeError
  puts 'Rescuing...'
end

begin
  fun(2)
rescue RuntimeError
  puts 'We are not able to rescue this one.'
end
codecov-io commented 4 years ago

Codecov Report

Merging #92 into master will increase coverage by 0.02%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   91.58%   91.61%   +0.02%     
==========================================
  Files         159      159              
  Lines        7561     7560       -1     
==========================================
+ Hits         6925     6926       +1     
+ Misses        636      634       -2
Impacted Files Coverage Δ
lib/aerospike/client.rb 79.21% <ø> (-0.06%) :arrow_down:
lib/aerospike/cluster.rb 89.39% <0%> (+0.75%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1ec46c0...f44ac16. Read the comment docs.