davissp14 / etcdv3-ruby

Etcd v3 Ruby Client
MIT License
52 stars 17 forks source link

Remove automatic retries in case of errors #132

Closed graywolf-at-work closed 5 years ago

graywolf-at-work commented 5 years ago

Currently this gems tries to do automatic retries to the next (or same) endpoint in case of an error. This has two issues, stack overflow (retry is by recursion) and not respecting timeouts. Stack overflow is solvable, however the timeout issue not really given current architecture.

Cleanest solution is to just rotate the endpoints and move the responsibility for retry to the calling application.

Fixes davissp14/etcdv3-ruby#130 Fixes davissp14/etcdv3-ruby#131

codecov[bot] commented 5 years ago

Codecov Report

Merging #132 into master will increase coverage by 0.07%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
+ Coverage   99.08%   99.15%   +0.07%     
==========================================
  Files          30       30              
  Lines        1639     1662      +23     
==========================================
+ Hits         1624     1648      +24     
+ Misses         15       14       -1
Impacted Files Coverage Δ
spec/etcdv3/connection_wrapper_spec.rb 100% <100%> (ø) :arrow_up:
spec/helpers/connections.rb 93.75% <100%> (ø) :arrow_up:
lib/etcdv3/connection_wrapper.rb 97.29% <100%> (+3.54%) :arrow_up:
lib/etcdv3.rb 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3a542e8...27153fa. Read the comment docs.

graywolf-at-work commented 5 years ago

Looks like the travis build passed https://travis-ci.org/davissp14/etcdv3-ruby/ but it's not updated here for some reason?

graywolf-at-work commented 5 years ago

Could we possibly get some feedback on this PR? :)

davissp14 commented 5 years ago

I'll do my best to take a look this evening. Sorry for the delay!

On Thu, Feb 14, 2019 at 5:12 AM graywolf-at-work notifications@github.com wrote:

Could we possibly get some feedback on this PR? :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/davissp14/etcdv3-ruby/pull/132#issuecomment-463589099, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ0fiSxHPzPDOztPZrN4KNlYsHmMe_cks5vNUShgaJpZM4a0QTt .

-- Shaun Davis Software Engineer Compose - IBM http://compose.io Phone: (650)288-5791

davissp14 commented 5 years ago

Stack overflow is solvable, however the timeout issue not really given current architecture.

What's currently in place should align pretty closely with what Etcd defines as the command-timeout. The dial-timeout, which I think you are referring too hasn't been implemented yet. The generic naming of timeout, probably makes things a little confusing.

Would the implementation of a dial-based timeout address your major concerns? Aside, from the obvious stack-overflow issue.

--command-timeout=5s -  timeout for short running command (excluding dial timeout)
--dial-timeout=2s   -  dial timeout for client connections
graywolf-at-work commented 5 years ago

The problem is that behaviour we want to have is that call to the etcd will finish inside X seconds. Sure, if X is for example 2, I could give it command_timeout: 1, dial_timeout: 1. That would work for example .put or .get probably fine.

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

How would you recommend to implement this using the two timeouts you propose?

EDIT: Also based on grpc documentation, it seems that timeout currently works more like --total-timeout from etcdctl v2 instead of --command-timeout from etcdctl v3 (at least when having just one endpoint). But it's possible I'm wrong.

davissp14 commented 5 years ago

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.
http://ruby-doc.org/stdlib-2.6/libdoc/timeout/rdoc/Timeout.html

There's simply too many factors that could affect the time it takes to establish a new connection, and none of these factors are related to the watch command itself. So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

The dial_timeout implementation should be set on initialization and isn't something that should be dynamically set on a per-command basis.

I hope that makes sense!

graywolf-at-work commented 5 years ago

I'm looking into the Timeout.timeout and it looks weird

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"
require "timeout"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        Timeout.timeout(1) do
                conn.watch("foo", timeout: 5)
        end
rescue GRPC::DeadlineExceeded, Timeout::Error
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline
E0221 08:55:18.278105569     252 backup_poller.cc:110]       run_poller: {"created":"@1550739318.278025670","description":"Shutting down timer system","file":"src/core/lib/iomgr/timer_generic.cc","file_line":704}

real    0m11.178s
user    0m0.129s
sys     0m0.020s

It takes 11 seconds compared to expected around 1. And there is the run_poller: line in console.

Compared with using timeout: 1 instead:

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        conn.watch("foo", timeout: 1)
rescue GRPC::DeadlineExceeded
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline

real    0m1.126s
user    0m0.127s
sys     0m0.007s

That looks much more as expected.

Could you point me in the right direction of what I'm doing wrong?

davissp14 commented 5 years ago

@graywolf-at-work

Your results will be off because you're timing the execution of the entire script, rather than just the timeout block.

#!/usr/local/env ruby

require 'etcdv3'
require 'timeout'
require 'benchmark'

cl = Etcdv3.new(endpoints: "http://localhost:2379")

Benchmark.bm do |x|
  # 1 second timeout
  x.report do
    begin
      Timeout::timeout(1) do
        cl.watch('foo')
      end
    rescue Timeout::Error
    end
  end

  # 5 second timeout
  x.report do
    begin
      Timeout::timeout(5) do
        cl.watch('foo2')
      end
    rescue Timeout::Error
    end
  end

  # 10 second timeout
  x.report do
    begin
      Timeout::timeout(10) do
        cl.watch('foo3')
      end
    rescue Timeout::Error
    end
  end

end

Results

$ ruby timeout_test.rb
       user     system      total        real
   0.010000   0.010000   0.020000 (  1.007412)
   0.000000   0.010000   0.010000 (  5.023335)
   0.010000   0.010000   0.020000 ( 10.019187)
dcepelik commented 5 years ago

Hello again @davissp14,

So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

I believe this is the source of confusion.

In our perception, a method which takes a timeout argument is a signal to the caller that the function will not delay them for any reason by more than timeout seconds. Should the time-out occur, execution will be gracefully aborted and control will be returned to the caller to deal with the situation.

(It can usually delay them a little bit longer, but that extra overhead should be constant and cannot really be practically avoided, because you have to return from the function, perhaps do some cleanup etc.)

Providing strict guarantees to the caller is a great feature of code, because then, other code which needs to provide strict guarantees can be built atop. On the other hand, you cannot call code which isn't dependable from something that ought to be dependable.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.

Yes, you can achieve something that looks similar using Timeout#timeout, but you don't get the same guarantees.

Since the lower layer (GRPC in this case) is ultimately the part which implements the timeout-case behavior, it has got enough information to do proper clean-up of resources after failed connection attempts and unsuccessful commands. On the other hand, with Timeout#timeout, you get an exception thrown literally anywhere, possibly somewhere in native code, and to my best knowledge this is never really a safe way of cancellation.

(There are great articles which describe the issues in details. Even if Timeout#timeout worked flawlessly, it's a complicated device I'd rather avoid, because it may not work tomorrow.)


But I have to say that I understand your point too. Because the reconnect logic comes handy a lot of the time when you don't care all that much about having proper time-outs all the time, and having to retry yourself can be a bit of an overkill.

Hence I'd propose a compromise: keep the current behavior and add a allow_reconnect argument to Etcdv3#new which would default to true. true would basically mean to use current behavior (command may not take longer than timeout seconds, but if reconnection logic kicks in, it can add an arbitrary delay).

If allow_reconnect is set to false and current ETCD connection is bad, the end-points will be rotated and an exception will be raised, returning control to the caller to either retry or give up, whichever works better for them at the moment (because they have the information needed to decide that).

Would you fined that acceptable?

In any case, thank you very much.

dcepelik commented 5 years ago

@davissp14 Hi Shaun, would you be willing to further discuss this, please? The proposed behavior in my last comment is IMHO a nice compromise and we're blocked on this.

As usual, thanks for the time and energy you put into maintenance of this gem!

davissp14 commented 5 years ago

@dcepelik Sorry for the delay, life has been a little crazy as of late. The allow_reconnect flag sounds like a solid plan to me!

graywolf-at-work commented 5 years ago

Will implement tomorrow, thank you for greenlighting :)

graywolf-at-work commented 5 years ago

Is it acceptable like this? If yes, could we also ask for release again?

If further changes are required, let me know.

@dcepelik please also have a look.

graywolf-at-work commented 5 years ago

I think that should be all? :)

davissp14 commented 5 years ago

Thanks for working through this! Nice work! Will get a release up today.

graywolf-at-work commented 5 years ago

Just reminding if you would be so kind to do the release for us :) Thank you

davissp14 commented 5 years ago

Sorry for the delay, just pushed it up. Thank again!