contribsys / faktory_worker_ruby

Faktory worker for Ruby
GNU Lesser General Public License v3.0
214 stars 31 forks source link

NoMethodError: undefined method `[]' for nil:NilClass #51

Open scottrobertson opened 4 years ago

scottrobertson commented 4 years ago
Screenshot 2020-07-22 at 11 13 23
mperham commented 4 years ago

I don't know how this could happen. result should never return nil, only a string or raise. Can you easily reproduce the issue? If you can turn on debugging and reproduce the issue, it will help but the debugging will add a LOT of output.

Faktory.configure_worker do |config|
   config.faktory = { debug: true }
end
jmirmina commented 4 years ago

We're currently digging into an unrelated issue (not sure yet if it's something on our end or not) where we were digging into the result method. I remembered seeing this in the middle of that result function and figured I'd chime in: return nil if count == -1

mperham commented 4 years ago

@jmirmina Great find, I was completely blind to that line. This will occur if we call c.Result(nil) on the Faktory server-side but I can't see a code path where that is done in the ack command. It's definitely possible in the code path for other commands, for instance it is a normal part of fetch, when we want to return nil because there is no job to process right now.

I wonder if this issue can come about if we use a Faktory::Client in a thread-unsafe way; maybe it's possible for one thread to request fetch and another thread to receive the reply to ack.

jason-he-3 commented 4 years ago

Hello! I've been working with @jmirmina debugging this heavily on our side internally and with regards to what @mperham said about Faktory::Client being used in thread-unsafe ways I highly suspect this to be the case. So a bit of background about our setup, we run our rails setup inside of a standard container that boots up puma and as apart of that setup, we instantiate both as a producer (think: launcher class in this gem in a separate thread in an after initializer, we pass wid and pid to faktory). However, we also push jobs to faktory in this same rails container using the standard workflow where Faktory::Client#push is invoked as apart of a perform_async call and we also process jobs in this same container as well (or as apart of different pods in the same deployment in k8s depending on scaling). However, this seems to have had the unintended effect of sometimes having jobs error on push to faktory because the socket (and therefore connection) is either reaped on the faktory server side (because the producers and consumers are one connection pool now) or as I found today, the socket is somehow still writable on the client side despite being reaped which then doesn't throw an expected Errno::EPIPE (Broken pipe - No message available) when writing to that socket. This was after I had confirmed via the faktory server logs that the connection had been reaped almost 5 minutes prior.

The other thing I've witnessed while sharing a connection pool is that sometimes, there is data in the TCPSocket that I would not otherwise expect to be there. For example, when doing a job push yesterday, I somehow got a message back from faktory server that indicated an Unknown Worker: name message. I know that ultimately, Faktory.server_pool indicates that it is implemented on top of the connection_pool gem which blockingly yields a Faktory::Client object when one is available or otherwise lazily spins one up if the current size is below max. Could this weirdness possibly be attributed to the pool somehow yielding the connection back too soon after previous usage and the buffer not having a chance to fully execute IO#sync (which I could only see happening if we don't read the data in the socket fully, or exception being thrown before it can read) before the next thread uses it and starts reading from that same socket? We never access Faktory::Client directly, only via perform_async and other built-in methods that ultimately access it through Faktory.server/server_pool which ultimately leads to the connection pool. Any thoughts on this would totally be appreciated.

mperham commented 4 years ago

I'll read through the FWR code and make sure it is using connection_pool as expected. If you can give me any more clues or reproduce the issue, that would help a lot too.

mperham commented 4 years ago

What do you think about this?

+++ b/lib/faktory/client.rb
@@ -295,9 +295,12 @@ module Faktory

       begin
         yield
-      rescue Errno::EPIPE, Errno::ECONNRESET
+      rescue SystemCallError, SocketError, TimeoutError
         if retryable
           retryable = false
+
+          @sock.close rescue nil
+          @sock = nil
           open(@timeout)
           retry
         else

If the underlying socket raises an exception that is not caught, it could have unread data that will be read on the next use. This change:

  1. Adds more exceptions to the set that will result in the socket being discarded.
  2. Closes the existing socket so we don't try to use it further.
jason-he-3 commented 4 years ago

I think that this should cover the cases I outlined above with things potentially being left over in the buffer - these changes would be perfect. However, I did notice something weird today, here is a stack trace of us getting something unexpected from the buffer when doing a push:

Faktory::CommandError: Unknown worker 4c2b51b7d3572f5a
.../lib/faktory/client.rb:  333:in `result'
.../lib/faktory/client.rb:  343:in `ok'
.../lib/faktory/client.rb:  151:in `block in push'
.../lib/faktory/client.rb:  297:in `transaction'
.../lib/faktory/client.rb:  149:in `push'
.../lib/faktory/job.rb:   93:in `block in client_push'
….5.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:   63:in `block in with'
org/jruby/RubyThread.java:759:in `handle_interrupt'
….5.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:   62:in `block in with'
org/jruby/RubyThread.java:759:in `handle_interrupt'
….5.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:   59:in `with'

Now what was exceptionally weird about this was - as far as I can tell via the faktory-server go code - there is nothing in the push command route that should ever respond with Unknown worker. But if you re-use the same connection for another push, it keeps throwing this exception (side note: This is sourced based on commit e1f61d0). Assuming your calls to read_nonblock in the io handler are working correctly, what I fail to understand is how this could continuously produce this exception on push. Also - I was able to verify that despite this exception being thrown on push, faktory server is accepting the push data with the JID etc and the job does run correctly on an available worker which sort of indicates to me that the issue is still most likely isolated to the ruby side.

mperham commented 4 years ago

That error would happen if the client tries to heartbeat for a worker which has been reaped. It's another case where the connection should be closed so the client opens a new socket and re-authenticates. Sounds like your worker process got into a bad state somehow and FWR's client connection logic isn't yet good enough to gracefully recover from every edge case.

I've committed that change; feel free to run master and see if it improves things.

arjun810 commented 9 months ago

We've been running into this recently. I'm not sure what's causing it, but it seems potentially to happen more when we deploy a new version. We're running in ECS, so I don't really understand why that would matter..

mperham commented 9 months ago

@arjun810 Please post a full backtrace when you get one.

arjun810 commented 9 months ago

Sorry for the terrible formatting.. copying out of our error tracker.

Details

``` undefined method [] for nil:NilClass lib/faktory/client.rb:375 ok lib/faktory/client.rb:174 block in push lib/faktory/client.rb:320 transaction lib/faktory/client.rb:172 push lib/active_job/queue_adapters/faktory_adapter.rb:38 block (2 levels) in enqueue_at lib/connection_pool.rb:65 block (2 levels) in with lib/connection_pool.rb:64 handle_interrupt lib/connection_pool.rb:64 block in with lib/connection_pool.rb:61 handle_interrupt lib/connection_pool.rb:61 with lib/active_job/queue_adapters/faktory_adapter.rb:37 block in enqueue_at lib/faktory/middleware/chain.rb:129 block in invoke lib/faktory/middleware/batch.rb:14 call lib/faktory/middleware/chain.rb:131 block in invoke lib/faktory/middleware/chain.rb:134 invoke lib/active_job/queue_adapters/faktory_adapter.rb:36 enqueue_at lib/active_job/queue_adapters/faktory_adapter.rb:14 enqueue lib/active_job/enqueuing.rb:67 block in enqueue lib/active_support/callbacks.rb:118 block in run_callbacks lib/active_job/uniqueness/active_job_patch.rb:53 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_job/instrumentation.rb:25 block in instrument lib/appsignal/hooks/active_support_notifications.rb:19 block in instrument lib/active_support/notifications/instrumenter.rb:24 instrument lib/appsignal/hooks/active_support_notifications.rb:18 instrument lib/active_job/instrumentation.rb:35 instrument lib/active_job/instrumentation.rb:9 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_support/tagged_logging.rb:99 block in tagged lib/active_support/tagged_logging.rb:37 tagged lib/active_support/tagged_logging.rb:99 tagged lib/active_job/logging.rb:25 tag_logger lib/active_job/logging.rb:14 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_support/callbacks.rb:138 run_callbacks lib/active_job/enqueuing.rb:63 enqueue lib/active_job/enqueuing.rb:30 perform_later app/jobs/check_finetuning_job.rb:27 perform lib/active_job/execution.rb:59 block in _perform_job lib/active_support/callbacks.rb:118 block in run_callbacks lib/active_job/uniqueness/active_job_patch.rb:54 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/i18n.rb:322 with_locale lib/active_job/translation.rb:9 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_support/core_ext/time/zones.rb:66 use_zone lib/active_job/timezones.rb:9 block (2 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_support/callbacks.rb:138 run_callbacks lib/active_job/execution.rb:58 _perform_job lib/active_job/instrumentation.rb:20 _perform_job lib/active_job/execution.rb:46 perform_now lib/active_job/instrumentation.rb:14 block in perform_now lib/active_job/instrumentation.rb:25 block in instrument lib/appsignal/hooks/active_support_notifications.rb:19 block in instrument lib/active_support/notifications/instrumenter.rb:24 instrument lib/appsignal/hooks/active_support_notifications.rb:18 instrument lib/active_job/instrumentation.rb:35 instrument lib/active_job/instrumentation.rb:14 perform_now lib/active_job/logging.rb:18 block in perform_now lib/active_support/tagged_logging.rb:99 block in tagged lib/active_support/tagged_logging.rb:37 tagged lib/active_support/tagged_logging.rb:99 tagged lib/active_job/logging.rb:25 tag_logger lib/active_job/logging.rb:18 perform_now lib/active_job/execution.rb:24 block in execute lib/active_support/callbacks.rb:118 block in run_callbacks lib/active_job/railtie.rb:54 block (4 levels) in lib/active_support/execution_wrapper.rb:92 wrap lib/active_support/reloader.rb:72 block in wrap lib/active_support/execution_wrapper.rb:88 wrap lib/active_support/reloader.rb:71 wrap lib/active_job/railtie.rb:53 block (3 levels) in lib/active_support/callbacks.rb:127 instance_exec lib/active_support/callbacks.rb:127 block in run_callbacks lib/active_support/callbacks.rb:138 run_callbacks lib/active_job/execution.rb:22 execute lib/appsignal/hooks/active_job.rb:56 execute lib/active_job/queue_adapters/faktory_adapter.rb:47 perform lib/faktory/processor.rb:145 block (2 levels) in process lib/faktory/middleware/chain.rb:129 block in invoke lib/faktory/middleware/batch.rb:21 call lib/faktory/middleware/chain.rb:131 block in invoke lib/faktory/middleware/chain.rb:134 invoke lib/faktory/processor.rb:144 block in process lib/faktory/processor.rb:134 block (3 levels) in dispatch lib/faktory/rails.rb:50 block in call lib/active_support/execution_wrapper.rb:92 wrap lib/active_support/reloader.rb:72 block in wrap lib/active_support/execution_wrapper.rb:92 wrap lib/active_support/reloader.rb:71 wrap lib/faktory/rails.rb:49 call lib/faktory/processor.rb:130 block (2 levels) in dispatch lib/faktory/job_logger.rb:6 call lib/faktory/processor.rb:125 block in dispatch lib/faktory/logging.rb:43 with_context lib/faktory/logging.rb:37 with_job_hash_context lib/faktory/processor.rb:124 dispatch lib/faktory/processor.rb:143 process lib/faktory/processor.rb:88 process_one lib/faktory/processor.rb:71 run lib/faktory/util.rb:17 watchdog lib/faktory/util.rb:26 block in safe_thread ```

arjun810 commented 9 months ago

Little bit more information for you...

It seems like after our deploy, one of the workers started up unhappy. Here are the logs from that worker's startup.

It eventually does successfully process a job, but then will randomly and intermittently get the same OK failure pasted above.

Details

2023-12-07T17:10:15.008-08:00 7 TID-3fw INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux] 2023-12-07T17:10:15.009-08:00 7 TID-3fw INFO: See LICENSE and the LGPL-3.0 for licensing details.2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: Error fetching job: Faktory::TimeoutError 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:41:in `block in read_timeout' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `loop' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `read_timeout' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:23:in `gets' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:342:in `result' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:379:in `result!' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:183:in `block in fetch' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:320:in `transaction' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:181:in `fetch' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:25.483-08:00 7 TID-2d9g ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: Error fetching job: Faktory::TimeoutError 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:41:in `block in read_timeout' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `loop' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `read_timeout' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:23:in `gets' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:342:in `result' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:379:in `result!' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:183:in `block in fetch' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:320:in `transaction' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:181:in `fetch' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:25.504-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: Error fetching job: Faktory::TimeoutError 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:41:in `block in read_timeout' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `loop' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `read_timeout' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:23:in `gets' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:342:in `result' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:379:in `result!' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:183:in `block in fetch' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:320:in `transaction' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:181:in `fetch' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:25.533-08:00 7 TID-2dak ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: Error fetching job: unexpected token at 'OK' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:185:in `fetch' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:27.339-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:29.341-08:00 7 TID-2d9g INFO: Faktory is online, 3.857455488 sec downtime 2023-12-07T17:10:29.546-08:00 7 TID-2da0 INFO: Faktory is online, 4.042336297 sec downtime 2023-12-07T17:10:31.361-08:00 7 TID-2db4 INFO: Faktory is online, 4.02209626 sec downtime 2023-12-07T17:10:31.361-08:00 7 TID-2dak INFO: Faktory is online, 5.828125108 sec downtime 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: Error fetching job: unexpected token at 'OK' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:185:in `fetch' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:40.496-08:00 7 TID-2db4 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:44.517-08:00 7 TID-2db4 INFO: Faktory is online, 4.021023754 sec downtime 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: Error fetching job: unexpected token at 'OK' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:185:in `fetch' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:10:49.581-08:00 7 TID-2dbo ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:10:52.587-08:00 7 TID-2dbo INFO: Faktory is online, 3.006070199 sec downtime 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: Error fetching job: unexpected token at 'OK' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:185:in `fetch' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-07T17:11:10.752-08:00 7 TID-2da0 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-07T17:11:16.775-08:00 7 TID-2da0 INFO: Faktory is online, 6.023234352 sec downtime 2023-12-07T17:11:23.790-08:00 7 TID-2dak ERROR: Error fetching job: unexpected token at 'OK' 7 TID-2dak ERROR: Error fetching job: unexpected token at 'OK'

mperham commented 9 months ago

It sounds like you've got something really dodgy in your network messing up connections. I'm not sure how that happens. Faktory::Timeout is a bad sign. Did you by any chance take down Faktory and then start it 4-5 seconds later without first shutting down the workers?

Ruby is really flexible so for these things, I usually monkeypatch methods to get more runtime data associated with the crash, e.g.

class Faktory::Client
    def ok(retval = true)
      resp = result
      return retval if resp == "OK"
      p [:crash, resp] if !resp.respond_to?(:[]) # added
      resp[0].to_sym
    end
end

Now deploy and search the logs for "crash". Keep iterating on that debugging until you get more clarity.

mperham commented 9 months ago

My mistake, since we know resp is nil, that debug output isn't useful. You'll need to monkeypatch result to print the raw line data.

arjun810 commented 9 months ago

We think the issue is that our faktory server was under resourced. We gave it more CPU/RAM and haven't seen the issue anymore. It's a little odd because it never got to 100% cpu or memory..

In any case, it'd be nice if the client were more robust here -- not sure if you can reset the connection so the worker isn't in a permanently bad state.

It does seem like there is a client issue, because 3/4 workers were totally fine, and the one that started out unhappy stayed pretty unhappy.

ibrahima commented 8 months ago

Hmm, we are still experiencing this. I did notice that Faktory::Client already has a @debug instance variable that would in theory print out the line data, but I am not sure how to pass it in if we are using Faktory via ActiveJob. I'm guessing we'd have to construct our connection pool manually maybe?

https://github.com/contribsys/faktory_worker_ruby/blob/fdf75823c79c50f1e8f07544f292d8d82db1a77f/lib/faktory/client.rb#L346

Seems like this works:

Faktory.faktory = { debug: true }

So maybe no need to monkey patch!

ibrahima commented 8 months ago

Hm, we haven't tried the above yet because it seems like it'd produce a ton of logging data during normal operation. But we're considering monkey patching the client to set @debug=true when the error condition comes up, since the error state seems to stick around when that happens. Unfortunately that might not log the conditions that led up to the error though... but we'll see.

One thing I'm noticing now, looking at the result method: It seems like the main way it returns nil is if it gets a line of $-1. This corresponds to a null string in the RESP protocol. I'm kinda confused as to why the server would return a null string in response to a job push.

It seems like the only time Faktory returns nil is if there are no jobs to fetch (either no jobs on the queues, or the worker is supposed to be "quiet")

So it kinda feels like messages are getting mixed up and we're getting responses for different requests when this happens.

I will note that at least for the incident that occurred today, the worker process started out with 4 Faktory::TimeoutErrors when trying to fetch work. Then, it got a job successfully, and appeared to complete it, but stuff started breaking down when it tried to ACK the job or enqueue new jobs.

Selected logs ``` 2023-12-25 09:57:29.230 7 TID-3w0 INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux] 2023-12-25 09:57:29.230 7 TID-3w0 INFO: See LICENSE and the LGPL-3.0 for licensing details. 2023-12-25 09:57:42.348 7 TID-1j14 ERROR: Error fetching job: Faktory::TimeoutError 2023-12-25 09:57:42.348 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:41:in `block in read_timeout' 2023-12-25 09:57:42.348 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `loop' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:38:in `read_timeout' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/io.rb:23:in `gets' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:342:in `result' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:379:in `result!' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:183:in `block in fetch' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:320:in `transaction' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:181:in `fetch' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:110:in `block (2 levels) in with' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `handle_interrupt' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `block in with' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `handle_interrupt' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `with' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-25 09:57:42.349 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' x4 2023-12-25 09:57:44.351 7 TID-1j1o INFO: Faktory is online, 2.000483358 sec downtime 2023-12-25 09:57:44.374 7 TID-1j2s INFO: Faktory is online, 2.000531267 sec downtime 2023-12-25 09:57:46.352 7 TID-1j14 INFO: Faktory is online, 4.004027534 sec downtime 2023-12-25 09:57:46.353 7 TID-1j28 INFO: Faktory is online, 3.989520063 sec downtime 2023-12-25 09:57:47.323 7 TID-1j2s SomeApiJob JID-58f6c445a3b2bda4a637d068 INFO: start 2023-12-25 09:57:47.325 I, [2023-12-25T17:57:47.325786 #7] INFO -- : [ActiveJob] [SomeApiJob] [28d19e17-320e-44a5-9efa-07e090b591ac] Performing SomeApiJob (Job ID: 28d19e17-320e-44a5-9efa-07e090b591ac) from Faktory(default) enqueued at 2023-12-25T17:57:47.320902731Z with arguments: {...} 2023-12-25 09:57:47.475 I, [2023-12-25T17:57:47.475073 #7] INFO -- : [ActiveJob] [SomeApiJob] [28d19e17-320e-44a5-9efa-07e090b591ac] Performed SomeApiJob (Job ID: 28d19e17-320e-44a5-9efa-07e090b591ac) from Faktory(default) in 149.57ms 2023-12-25 09:57:47.477 7 TID-1j2s SomeApiJob JID-58f6c445a3b2bda4a637d068 INFO: done: 0.154 sec 2023-12-25 09:57:48.387 7 TID-1j2s WARN: 2023-12-25 09:57:48.388 7 TID-1j2s WARN: NoMethodError: undefined method `[]' for nil:NilClass 2023-12-25 09:57:48.388 7 TID-1j2s WARN: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:375:in `ok' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:191:in `block in ack' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:320:in `transaction' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:189:in `ack' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:6:in `block in acknowledge' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:110:in `block (2 levels) in with' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `handle_interrupt' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `block in with' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `handle_interrupt' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `with' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:6:in `acknowledge' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:148:in `process' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:88:in `process_one' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-25 09:57:48.388 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: Error fetching job: unexpected token at 'OK' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /usr/local/lib/ruby/3.2.0/json/common.rb:216:in `parse' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/client.rb:185:in `fetch' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `block in retrieve_work' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:79:in `block in server' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:110:in `block (2 levels) in with' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `handle_interrupt' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `block in with' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `handle_interrupt' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `with' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory.rb:78:in `server' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/fetch.rb:26:in `retrieve_work' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:100:in `fetch' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:81:in `process_one' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/processor.rb:71:in `run' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:17:in `watchdog' 2023-12-25 09:57:49.388 7 TID-1j14 ERROR: /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.2.0/lib/faktory/util.rb:26:in `block in safe_thread' ```

Will try and get more info if we can and see what we can find... this does seem pretty odd.

One thing I was confused about previously was that it seems like the read_timeout method might lead to data getting misaligned if it times out trying to read an expected number of bytes (and those bytes later show up) but it looks like commands are wrapped in a transaction that closes down the socket and opens a new one if there are TimeoutErrors, so it should be fine? Presumably if a client got a partial message and then timed out, the rest of the message would not get sent when it opens a new socket.

Though I'm not sure what happens on Line 331, if it fails to close the socket does it still try to open a new socket? Not sure what the nil does, it doesn't seem like it's actually returned or assigned to anything.

mperham commented 8 months ago

there appears to be an edge case in the Ruby client which corrupts the state of the connection. No one has been able to get me a reproduction so I haven’t been able to track it down and fix it but it usually happens after some exception is raised.

mperham commented 8 months ago

If I had an infinite amount of time I’d write a stress tester for the protocol in order to stress test all of the different Faktory clients with various scenarios. Things like timeouts, errors, malformed responses, etc.

ibrahima commented 8 months ago

Hmm thanks, that's helpful to know! Is it always the same exception that we experienced? Do you happen to know of any other issue reports like this off-hand? I wonder if there's a way to detect the bad state and "reset" the connection somehow...

I guess the other operational solution would be to just detect the bad state and remove the bad worker from the pool and spawn another one, hmm...

Edit: d'oh, I re-read all the comments on this ticket and realized this is the issue where other people ran into the same problem, and the code I was looking at to disconnect the socket was added in response to this issue.

FWIW, we are seeing this in Ruby Faktory workers and some of our Jobs also push other Jobs, so it's similar to the scenario others had mentioned where consumers and producers are in the same process.

Also... while we haven't been able to figure out how to reproduce it, it does seem to keep reoccurring for us every few days, and it usually seems to happen with a deploy - it seems like a worker will boot up in a bad way and get stuck in that bad state, but if it starts out fine it usually seems to stay fine.

ibrahima commented 7 months ago

One small update: We enabled debug logging for the Faktory protocol and saw this sequence of events when the error came up:

``` 2024-02-02 18:13:20.909> ACK 2024-02-02 18:13:20.910< +OK 2024-02-02 18:13:20.911> FETCH appname-critical appname-default 2024-02-02 18:13:20.931> PUSH 2024-02-02 18:13:20.933< +OK 2024-02-02 18:13:20.933I, [2024-02-03T02:13:20.933759 #6] INFO -- : [ActiveJob] [MyJob] [89ee96a1-f90d-4b80-9e3b-8edee919c828] Enqueued MyJob (Job ID: b2884d4d-2208-4e41-bd26-2430329cc14e) to Faktory(appname-default) at 2024-02-03 02:13:50 UTC with arguments: #>, {:method=>:check!} 2024-02-02 18:13:20.934I, [2024-02-03T02:13:20.934743 #6] INFO -- : [ActiveJob] [MyJob] [89ee96a1-f90d-4b80-9e3b-8edee919c828] Performed MyJob (Job ID: 89ee96a1-f90d-4b80-9e3b-8edee919c828) from Faktory(appname-default) in 383.38ms 2024-02-02 18:13:20.9356 TID-1fiqk MyJob JID-d20553506f8847c63358891b INFO: done: 0.384 sec 2024-02-02 18:13:20.935> ACK 2024-02-02 18:13:20.937< +OK 2024-02-02 18:13:20.937> FETCH appname-critical appname-default 2024-02-02 18:13:20.963> PUSH 2024-02-02 18:13:20.966< +OK 2024-02-02 18:13:20.966I, [2024-02-03T02:13:20.966509 #6] INFO -- : [ActiveJob] [MyJob] [4402f416-1b7b-4e83-a73e-7a69c2a85473] Enqueued MyJob (Job ID: 302d8a11-a111-412a-be03-6c2c5d63f5c6) to Faktory(appname-default) at 2024-02-03 02:13:50 UTC with arguments: #>, {:method=>:check!} 2024-02-02 18:13:20.967I, [2024-02-03T02:13:20.967535 #6] INFO -- : [ActiveJob] [MyJob] [4402f416-1b7b-4e83-a73e-7a69c2a85473] Performed MyJob (Job ID: 4402f416-1b7b-4e83-a73e-7a69c2a85473) from Faktory(appname-default) in 423.87ms 2024-02-02 18:13:20.9686 TID-1fe3c MyJob JID-1cd78d40edd3c0077bf5d83e INFO: done: 0.425 sec 2024-02-02 18:13:20.968> ACK 2024-02-02 18:13:20.969< +OK 2024-02-02 18:13:20.970> FETCH appname-critical appname-default 2024-02-02 18:13:21.095> PUSH 2024-02-02 18:13:21.095< $-1 2024-02-02 18:13:21.096E, [2024-02-03T02:13:21.096527 #6] ERROR -- : [ActiveJob] [MyJob] [15894a4e-2748-41e1-bf99-d3588ba56405] Failed enqueuing MyJob to Faktory(appname-default): NoMethodError (undefined method `[]' for nil:NilClass) 2024-02-02 18:13:21.097E, [2024-02-03T02:13:21.097585 #6] ERROR -- : [ActiveJob] [MyJob] [15894a4e-2748-41e1-bf99-d3588ba56405] Error performing MyJob (Job ID: 15894a4e-2748-41e1-bf99-d3588ba56405) from Faktory(appname-default) in 548.23ms: NoMethodError (undefined method `[]' for nil:NilClass): 2024-02-02 18:13:21.097/rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:378:in `ok' .... ```

This seems to show that a PUSH got a -1 as a response (possibly intended as a response to a FETCH). However, it's kinda hard to know what this was a response to because there's no thread ID or other identifying info in these debug logs. It would be nice to use Faktory::Logging.logger instead of puts here so that we know if the thread getting -1 is the same as the one that issued an PUSH. Gonna try to monkey patch that in...

I'm not sure if the above actually tells us anything new, because we kind of already knew that that was what was happening given that the error was that something was nil where it wasn't expected to be a possible response.

I also noticed in the above sequence that some FETCHes don't seem to have a response immediately after them. I also noticed this in another sequence of logs, where there's a mismatch between the number of FETCHes and responses. (Random tiny sample, but I hope it illustrates the situation.) I don't think this is necessarily a problem, unless the server is responding multiple times for the same FETCH and some of those responses go to a wrong client.

``` 2024-02-02 18:12:25.885 > ACK 2024-02-02 18:12:25.887 < +OK 2024-02-02 18:12:25.887 > FETCH appname-critical appname-default 2024-02-02 18:12:25.941 < $-1 2024-02-02 18:12:26.038 > FETCH appname-critical appname-default 2024-02-02 18:12:26.038 > FETCH appname-critical appname-default 2024-02-02 18:12:26.039 > FETCH appname-critical appname-default 2024-02-02 18:12:26.039 < $-1 2024-02-02 18:12:26.040 < $-1 2024-02-02 18:12:26.040 < $-1 2024-02-02 18:12:26.138 > FETCH appname-critical appname-default 2024-02-02 18:12:26.140 < $-1 2024-02-02 18:12:26.140 < $-1 2024-02-02 18:12:26.140 < $-1 2024-02-02 18:12:26.140 < $-1 2024-02-02 18:12:26.140 < $-1 2024-02-02 18:12:26.211 > FETCH appname-critical appname-default 2024-02-02 18:12:26.213 < $-1 2024-02-02 18:12:26.213 < $-1 2024-02-02 18:12:26.214 < $-1 ```

Will continue to investigate... it seems like one of our apps reproduces the issue more readily than the other, so that might help. Also, the app is in pre-production so there are long quiet periods (e.g. which explains the stretch of -1 responses to fetches).

For reference, here's the monkey patch I used:

class Faktory::Client
  private

  def debug(line)
    Faktory::Logging.logger.info(line)
  end
end

Resulting in slightly more comprehensible logs (in that we can see which response is a response to which request).

``` 2024-02-06 13:19:14.519 7 TID-16us INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.520 7 TID-1aig INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.520 7 TID-16t4 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.619 7 TID-16vc INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.620 7 TID-16vw INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.621 7 TID-1900 INFO: < $-1 2024-02-06 13:19:14.621 2024-02-06 13:19:14.720 7 TID-16wg INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.720 7 TID-1cm0 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.820 7 TID-16to INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.820 7 TID-16x0 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.821 7 TID-1cc0 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.821 7 TID-1cd4 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:14.821 7 TID-1chk INFO: < $-1 2024-02-06 13:19:14.821 2024-02-06 13:19:15.020 7 TID-1c48 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.121 7 TID-1b4o INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.420 7 TID-18r4 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.421 7 TID-1914 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.422 7 TID-18vk INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.621 7 TID-1900 INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:15.822 7 TID-1chk INFO: > FETCH appname-critical appname-default 2024-02-06 13:19:16.230 7 TID-1bt4 INFO: < $-1 2024-02-06 13:19:16.230 2024-02-06 13:19:16.330 7 TID-16sk INFO: < $-1 2024-02-06 13:19:16.330 2024-02-06 13:19:16.331 7 TID-1ags INFO: < $-1 2024-02-06 13:19:16.331 2024-02-06 13:19:16.331 7 TID-1ahc INFO: < $-1 2024-02-06 13:19:16.331 2024-02-06 13:19:16.431 7 TID-19nc INFO: < $-1 2024-02-06 13:19:16.431 2024-02-06 13:19:16.431 7 TID-19m8 INFO: < $-1 2024-02-06 13:19:16.431 2024-02-06 13:19:16.434 7 TID-16rg INFO: < $-1 2024-02-06 13:19:16.434 2024-02-06 13:19:16.531 7 TID-16us INFO: < $-1 2024-02-06 13:19:16.531 2024-02-06 13:19:16.531 7 TID-1aig INFO: < $-1 2024-02-06 13:19:16.531 2024-02-06 13:19:16.535 7 TID-16t4 INFO: < $-1 2024-02-06 13:19:16.535 2024-02-06 13:19:16.631 7 TID-16vc INFO: < $-1 2024-02-06 13:19:16.631 2024-02-06 13:19:16.632 7 TID-16vw INFO: < $-1 2024-02-06 13:19:16.632 2024-02-06 13:19:16.732 7 TID-16wg INFO: < $-1 2024-02-06 13:19:16.732 2024-02-06 13:19:16.732 7 TID-1cm0 INFO: < $-1 2024-02-06 13:19:16.732 2024-02-06 13:19:16.832 7 TID-16x0 INFO: < $-1 2024-02-06 13:19:16.832 2024-02-06 13:19:16.832 7 TID-1cd4 INFO: < $-1 2024-02-06 13:19:16.832 2024-02-06 13:19:16.832 7 TID-16to INFO: < $-1 2024-02-06 13:19:16.832 2024-02-06 13:19:16.834 7 TID-1cc0 INFO: < $-1 2024-02-06 13:19:16.834 2024-02-06 13:19:17.033 7 TID-1c48 INFO: < $-1 2024-02-06 13:19:17.033 2024-02-06 13:19:17.133 7 TID-1b4o INFO: < $-1 ```
ibrahima commented 7 months ago

^ Hah, was just about to send in a patch for that, thanks!

ibrahima commented 7 months ago

I think I found something relevant. So far, in some of the cases where we see this error state of commands getting responses meant for other commands, I have noticed an SSL error that happens before it starts getting these other errors:

2024-02-02 12:59:25.866 E, [2024-02-02T20:59:25.866463 #6] ERROR -- : [ActiveJob] [MyJob] [e707f64b-9fd4-4bf2-af51-220ce6b6179c] Failed enqueuing MyJob to Faktory(appname-default): OpenSSL::SSL::SSLError (SSL_read: shutdown while in init)
2024-02-02 12:59:25.867 E, [2024-02-02T20:59:25.867476 #6] ERROR -- : [ActiveJob] [MyJob] [e707f64b-9fd4-4bf2-af51-220ce6b6179c] Error performing MyJob (Job ID: e707f64b-9fd4-4bf2-af51-220ce6b6179c) from Faktory(appname-default) in 327.3ms: OpenSSL::SSL::SSLError (SSL_read: shutdown while in init):
2024-02-02 12:59:25.867 /usr/local/lib/ruby/3.2.0/openssl/buffering.rb:214:in `sysread_nonblock'
2024-02-02 12:59:25.867 /usr/local/lib/ruby/3.2.0/openssl/buffering.rb:214:in `read_nonblock'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/io.rb:39:in `block in read_timeout'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/io.rb:38:in `loop'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/io.rb:38:in `read_timeout'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/io.rb:23:in `gets'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:345:in `result'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:376:in `ok'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:176:in `block in push'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:323:in `transaction'
2024-02-02 12:59:25.867 /rails/vendor/ruby/3.2.0/gems/faktory_worker_ruby-1.3.0/lib/faktory/client.rb:174:in `push'
...

It seems that when calling read_nonblock, we try to disable exceptions, but there are still some exceptions that can come up, at least when OpenSSL is involved. If an exception is raised here, it seems like whatever partial reads happen get left on the buffer.

I wonder if it would help to add this SSLError to the transaction function, so that if this error comes up it can close the socket and start a new one? That doesn't explain why it's happening (maybe a dodgy network issue on our end?) but it might help recover so that we don't get stuck in a bad state?

Submitted https://github.com/contribsys/faktory_worker_ruby/pull/83 in case that is a sensible thing to do. At the very least, it feels like this is an error that needs to be handled somehow for Faktory to recover successfully.

Thanks for your consideration!

ibrahima commented 7 months ago

Hmm... the above might still be a little off the mark. It seems like something must be calling the OpenSSL shutdown function for this to come up, and it seems like one of the times that FWR tries to close the socket is in this error handling case... so maybe transaction is the wrong place for this error handling. Though in the past we did see this error near a deploy, so maybe it could occur when workers are cycling in and out, but that doesn't really make sense because if a worker was shutting down we wouldn't continue getting errors from that worker.

Also... I'm looking at https://github.com/contribsys/faktory_worker_ruby/blob/e163c28cc53d5b0abb8d7d8006c0b26c6c0215e2/lib/faktory/processor.rb#L77 and it seems like if I understand correctly, if Processor raises an unhandled exception it should throw away the whole thread and make a new one? Though I am not seeing where, if this happens, the Processor is shut down, just that it gets removed from the Set of @threads in manager.rb. I wonder if it's possible for a bad thread to keep fetching jobs if it gets into this state? I think though that if the transaction method handles an exception it should close the Connection and associated bad socket, so maybe the change I submitted will help with that.

Unfortunately most of the logs from the earlier incidents we had have passed out of our log retention policy, so it's hard for me to determine if it continued spawning threads, but I do remember that in the admin UI we would see one processor with way more jobs than the configured concurrency, so that might be related... We did have one such incident on 2/6 and I am seeing a bunch of different TIDs in the logs with messages like TID-1gbh0 ERROR: Error fetching job: unexpected token at 'OK' so it doesn't look like it's confined to one thread, which makes sense because different threads could get the bad connection from the connection pool.

We'll be keeping an eye on this and will report back based on what we learn... unfortunately with these Heisenbugs it's hard to know if the issue is fixed or not when we don't know how to reproduce it reliably 😕.

mperham commented 7 months ago

main has a fix to gracefully handle and reopen sockets which raise OpenSSL errors. See #83. If someone wants to try it out and see if it improves your experience, I can package it up and release v2.0.1.

ibrahima commented 7 months ago

That would be awesome, thanks!

I'm running it from git in one of our staging environments so we'll see if anything comes up. We're just waiting for now... But the release would be handy for another app that sees more real traffic and might be more likely to reproduce the issue.