Store thread local data in thread local storage to avoid race condition with unicorn

mopp commented 3 years ago

What I did

I made Client store the socket and socket_started in thread local storage to avoid a race condition. This fix changes the behavior. the same socket was used after forked. But the client recreates the socket periodically. I believe that it will not be a serious problem.

Environment

Ruby: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux] Ruby on Rails: 6.0.3.4 unicorn: 5.7.0 Docker image: ruby:2.7.2 prometheus_exporter: 0.5.3

Background

I create a new Rails application the last week. I often see the error message below. Once it occurs, it was repeating until I stop the container.

Prometheus Exporter, failed to send message undefined method `+' for nil:NilClass

worker_loop method outputs the error https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L156-L161 And the root is here https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L194 @socket_started has nil. It's weird. The worker_loop method caches all kinds of exceptions. As a result, it cannot recover automatically.

I found the root cause of the thing after debugging. I use Rails with unicorn. it forks the Ruby process. I configured the prometheus exporter following the README.md to monitor unicorn metrics.

# config/unicorn.rb
PrometheusExporter::Instrumentation::Process.start(type: 'master')

after_fork do |_server, _worker|
  PrometheusExporter::Instrumentation::Process.start(type: 'web')
end

First, the master process starts the PrometheusExporter::Client. Some metrics will be sent using the client (Actually, I'm not sure. But the client was called soon after instantiated.) Then, the worker process is spawned.

https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L202-L209 These lines rarely cause the race condition. The client uses thread internally to send the metrics to prometheus server. For instance, please consider context switching (from the client thread to the master process main thread) happens at line 203, and the master process spawns a worker process. Then, the thread in the spaned worker process will be dead. I don't know the reason. You can confirm the snippet below.

thread = Thread.new do 
  while true
    1 + 1
  end
end
puts("[Parent] before fork Thread: #{thread.inspect}")
child_pid = fork do
  puts("[Child] Thread: #{thread.inspect}")
  sleep(1)
end
puts("[Parent] after fork Thread: #{thread.inspect}")
Process.waitpid(child_pid)
puts("[Parent] child process got shutdown.")
puts("[Parent] Thread: #{thread.inspect}")

[Parent] before fork Thread: #<Thread:0x00005622d24053a8@/tmp/nvim0ypa1l/4:1 run>
[Child] Thread: #<Thread:0x00005622d24053a8@/tmp/nvim0ypa1l/4:1 dead>
[Parent] after fork Thread: #<Thread:0x00005622d24053a8@/tmp/nvim0ypa1l/4:1 run>
[Parent] child process got shutdown.
[Parent] Thread: #<Thread:0x00005622d24053a8@/tmp/nvim0ypa1l/4:1 run>

However, the instance variable @socket and @socket_started is stored in the process memory space. The worker process starts a new thread but nobody cleans up them after forked. I moved them into thread local storage to isolate between process and threads.

How did I debug it

I put this snippet at config/unicorn.rb to dump the state.

require 'socket'
require 'thread'

class PrometheusExporter::Client
  attr_reader :name, :type, :help, :socket, :socket_started, :worker_thread
  def self.new(*args)
    tmp = super *args
    puts("#{$$} #{Thread.current} [new] begin @socket = #{tmp.socket.inspect}, @socket_started = #{tmp.socket_started.inspect}, @worker_thread = #{tmp.worker_thread.inspect}")
    tmp
  end

  def self.default
    puts("#{$$} #{Thread.current} [default] begin #{@default.inspect}")
    @default ||= new
    puts("#{$$} #{Thread.current} [default] begin @socket = #{@default.socket.inspect}, @socket_started = #{@default.socket_started.inspect}, @worker_thread = #{@default.worker_thread.inspect}")
    @default
  end

  def worker_loop
    puts("#{$$} #{Thread.current} [worker_loop] begin metrics: #{@type}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    close_socket_if_old!
    process_queue
  rescue => e
    puts("#{$$} #{Thread.current} [worker_loop] rescue metrics: #{@metrics}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    STDERR.puts "Prometheus Exporter, failed to send message #{e}"
    # @socket = nil
  end

  def close_socket_if_old!
    puts("#{$$} #{Thread.current} [close_socket_if_old!] metrics: #{@metrics}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    if @socket && ((@socket_started + MAX_SOCKET_AGE) < Time.now.to_f)
      close_socket!
    end
  end

  def ensure_socket!
    puts("#{$$} #{Thread.current} [ensure_socket!] begin #{@metrics}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    close_socket_if_old!
    if !@socket
      @socket = TCPSocket.new @host, @port
      puts("#{$$} #{Thread.current} [ensure_socket!] CONNECT! @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
      @socket.write("POST /send-metrics HTTP/1.1\r\n")
      @socket.write("Transfer-Encoding: chunked\r\n")
      @socket.write("Host: #{@host}\r\n")
      @socket.write("Connection: Close\r\n")
      @socket.write("Content-Type: application/octet-stream\r\n")
      @socket.write("\r\n")
      @socket_started = Time.now.to_f
      puts("#{$$} #{Thread.current} [ensure_socket!] START! @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    end
    puts("#{$$} #{Thread.current} [ensure_socket!] end #{@metrics}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    nil
  rescue
    @socket = nil
    @socket_started = nil
    puts("#{$$} #{Thread.current} [ensure_socket!] ***raised*** #{@metrics}, @socket = #{@socket.inspect}, @socket_started = #{@socket_started.inspect}")
    raise
  end
end

Here is the result that shows the

root@14b0b65bda94:/app# bundle exec unicorn --config-file ./config/unicorn.rb
3792 #<Thread:0x0000563d57462d88 run> [new] begin @socket = nil, @socket_started = nil, @worker_thread = nil
3792 #<Thread:0x0000563d57462d88 run> master process
3792 #<Thread:0x0000563d57462d88 run> [default] begin nil
3792 #<Thread:0x0000563d57462d88 run> [new] begin @socket = nil, @socket_started = nil, @worker_thread = nil
3792 #<Thread:0x0000563d57462d88 run> [default] begin @socket = nil, @socket_started = nil, @worker_thread = nil
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = nil, @socket_started = nil
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = nil, @socket_started = nil
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [ensure_socket!] begin [], @socket = nil, @socket_started = nil
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = nil, @socket_started = nil
I, [2020-11-02T09:21:00.152255 #3792]  INFO -- : listening on addr=0.0.0.0:8080 fd=13
I, [2020-11-02T09:21:00.152509 #3792]  INFO -- : worker=0 spawning...
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [ensure_socket!] CONNECT! @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
I, [2020-11-02T09:21:00.153436 #3792]  INFO -- : master process ready
3799 #<Thread:0x0000563d57462d88 run> worker process
I, [2020-11-02T09:21:00.153755 #3799]  INFO -- : Refreshing Gem list
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [ensure_socket!] START! @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [ensure_socket!] end [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3799 #<Thread:0x0000563d57462d88 run> [default] begin #<PrometheusExporter::Client:0x0000563d583911d8 @metrics=[], @queue=#<Thread::Queue:0x0000563d58391188>, @socket=#<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started=nil, @max_queue_size=10000, @host="localhost", @port=9394, @worker_thread=#<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 dead>, @mutex=#<Thread::Mutex:0x0000563d58391138>, @thread_sleep=0.5, @json_serializer=JSON, @custom_labels=nil>
3799 #<Thread:0x0000563d57462d88 run> [default] begin @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil, @worker_thread = #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 dead>
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] rescue metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
Prometheus Exporter, failed to send message undefined method `+' for nil:NilClass
3799 #<Thread:0x0000563d57462d88 run> [default] begin #<PrometheusExporter::Client:0x0000563d583911d8 @metrics=[], @queue=#<Thread::Queue:0x0000563d58391188>, @socket=#<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started=nil, @max_queue_size=10000, @host="localhost", @port=9394, @worker_thread=#<Thread:0x0000563d57451268 ./config/unicorn.rb:131 sleep>, @mutex=#<Thread::Mutex:0x0000563d58391138>, @thread_sleep=0.5, @json_serializer=JSON, @custom_labels=nil>
3799 #<Thread:0x0000563d57462d88 run> [default] begin @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil, @worker_thread = #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 sleep>
I, [2020-11-02T09:21:00.464552 #3799]  INFO -- : worker=0 ready
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] rescue metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
Prometheus Exporter, failed to send message undefined method `+' for nil:NilClass
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] rescue metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
Prometheus Exporter, failed to send message undefined method `+' for nil:NilClass
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] begin metrics: , @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [close_socket_if_old!] metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
3799 #<Thread:0x0000563d57451268 ./config/unicorn.rb:131 run> [worker_loop] rescue metrics: [], @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = nil
^CI, [2020-11-02T09:21:03.369889 #3792]  INFO -- : reaped #<Process::Status: pid 3799 exit 0> worker=0
I, [2020-11-02T09:21:03.369984 #3792]  INFO -- : master complete

You can see the @socket_started is only updated in the master process.

3792 #<Thread:0x0000563d5799f080 ./config/unicorn.rb:131 run> [ensure_socket!] START! @socket = #<TCPSocket:fd 10, AF_INET, 127.0.0.1, 50122>, @socket_started = 1604308860.1540303

Thanks

SamSaffron commented 3 years ago

I sorted this out in 718ebd2, thanks heaps for the debugging here @mopp

Let me know if the fix works

mopp commented 3 years ago

Thank you for your response.

I will test your change in my local :+1:

mopp commented 3 years ago

@SamSaffron I confirmed that your change has not resolved this issue. The way to debug is the same as the RP comment I described. I simply inserted the puts

https://github.com/discourse/prometheus_exporter/blob/718ebd249ea586d333cbee9087971ed73d8aa033/lib/prometheus_exporter/client.rb#L210-L230

Please consider context switching occurs between lines 222 and 227. Here the context switching means that switching the execution from the client thread to unicorn master process main thread.

Let's see a more concrete example to explain my understanding of this issue. Please assume the context switching occurs at the line 227. The state of the default client in the unicorn master process might be

@socket = #<TCPSocket:fd 13, AF_INET, 127.0.0.1, 35478>
@socket_started = nil
@socket_pid = nil

Example steps when this issue occurs

[master process] creates default client
[client thread in master process] initializes @socket but @socket_started and @socket_pid has nil still.
[master process] fork itself to create unicorn worker process
Here the race condition occurs!
[worker process] create an internal thread in the client because the thread which was created by the master has gone after the fork.
[client thread in worker process] try to start worker_loop and crashes!
- worker_loop -> close_socket_if_old! -> if @socket && ((@socket_started + MAX_SOCKET_AGE) < Time.now.to_f)
- @socket is not nil but @socket_started is nil here.

I recommend that to store the data, which the client thread required, into thread local storage in order to avoid this kind of bugs.

SamSaffron commented 3 years ago

Thanks! I just pushed a minor change to catch the fork happening in the middle of ensure_socket.

mopp commented 3 years ago

@SamSaffron The issue has occurred. I confirmed again.

Why you wouldn't merge my changes? :thinking: Please tell me if you have a concern, there are any problems with my changes, or my description was not enough. I will try to fix it.

SamSaffron commented 3 years ago

I am extremely uncomfortable using thread local storage to solve this problem

I am super happy refining this fix as needed

On Thu, 5 Nov 2020 at 6:35 pm, mopp notifications@github.com wrote:

@SamSaffron https://github.com/SamSaffron The issue has occurred. I confirmed again.

Why you wouldn't merge my changes? 🤔 Please tell me if you have a concern, there are any problems with my changes, or my description was not enough. I will try to fix it.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-722199487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXNQO5PJ26FSI5MVDGTSOJISNANCNFSM4THLTWEQ .

mopp commented 3 years ago

@SamSaffron Okay, you are the owner of this repository. I respect your intention. I created a super dirty test case to just reproduce this issue artificially. My change passed it. So, please use it to fix this issue. :wave:

diff --git a/lib/prometheus_exporter/client.rb b/lib/prometheus_exporter/client.rb
index 68cbab5..c7d06de 100644
--- a/lib/prometheus_exporter/client.rb
+++ b/lib/prometheus_exporter/client.rb
@@ -166,9 +166,12 @@ module PrometheusExporter

     def worker_loop
       close_socket_if_old!
-      process_queue
-    rescue => e
-      STDERR.puts "Prometheus Exporter, failed to send message #{e}"
+
+      begin
+        process_queue
+      rescue => e
+        STDERR.puts "Prometheus Exporter, failed to send message #{e}"
+      end
     end

     def ensure_worker_thread!
diff --git a/test/client_test.rb b/test/client_test.rb
index 9202ca6..efef3f8 100644
--- a/test/client_test.rb
+++ b/test/client_test.rb
@@ -51,4 +51,35 @@ class PrometheusExporterTest < Minitest::Test
     summary_metric = client.register(:summary, 'summary_metric', 'helping', expected_quantiles)
     assert_equal(expected_quantiles, summary_metric.standard_values('value', 'key')[:opts])
   end
+
+  def test_internal_state_broken
+    # start prometheus server for the client
+    server = PrometheusExporter::Server::Runner.new(bind: '0.0.0.0', port: 9394).start
+
+    # register a metrics for testing
+    client = PrometheusExporter::Client.new
+    counter_metric = client.register(:counter, 'counter_metric', 'helping')
+
+    # the internal worker thread will establish the connection to the server
+    counter_metric.increment('counter_metric')
+
+    # wait for thread working
+    sleep(1)
+
+    # Reproduce the state of the instance after process fork (race condition).
+    client.instance_variable_set('@socket_started', nil)
+    Thread.kill(client.instance_variable_get('@worker_thread').kill)
+
+    # wait for thread working
+    sleep(1)
+
+    # the internal worker thread will be created and try to send the metrics
+    counter_metric.increment('counter_metric')
+    sleep(1)
+
+    # the internal worker thread should not crash.
+    assert_equal(client.instance_variable_get('@worker_thread').alive?, true)
+
+    server.kill
+  end
 end

https://github.com/mopp/prometheus_exporter/commit/97f17dd5ea17d6b8bab1c51184ccff4015e80858

SamSaffron commented 3 years ago

Thank you! Will try it tomorrow

On Thu, 5 Nov 2020 at 9:34 pm, mopp notifications@github.com wrote:

@SamSaffron https://github.com/SamSaffron Okay, you are the owner of this repository. I respect your intention. I created a super dirty test case to just reproduce this issue artificially. My change passed it. So, please use it to fix this issue. 👋 mopp@97f17dd https://github.com/mopp/prometheus_exporter/commit/97f17dd5ea17d6b8bab1c51184ccff4015e80858

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-722291637, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXLCEXIDL3MXES44PPTSOJ5UPANCNFSM4THLTWEQ .

SamSaffron commented 3 years ago

@mopp can you try the latest commit?

mopp commented 3 years ago

@SamSaffron I confirmed that the latest commit

passed the test I wrote
didn't cause the error with my Rails application.

Thanks for your support.

SamSaffron commented 3 years ago

no worries, thanks for the excellent repro here and for confirming the fix!

On Tue, Nov 10, 2020 at 10:42 AM mopp notifications@github.com wrote:

@SamSaffron https://github.com/SamSaffron I confirmed that the latest commit

passed the test I wrote

didn't cause the error with my Rails application.

Thanks for your support.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-724350124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXLSWJZI4VHESP6EBV3SPB47ZANCNFSM4THLTWEQ .

discourse / prometheus_exporter