Closed mopp closed 3 years ago
I sorted this out in 718ebd2, thanks heaps for the debugging here @mopp
Let me know if the fix works
Thank you for your response.
I will test your change in my local :+1:
@SamSaffron I confirmed that your change has not resolved this issue.
The way to debug is the same as the RP comment I described. I simply inserted the puts
Please consider context switching occurs between lines 222 and 227. Here the context switching means that switching the execution from the client thread to unicorn master process main thread.
Let's see a more concrete example to explain my understanding of this issue. Please assume the context switching occurs at the line 227. The state of the default client in the unicorn master process might be
@socket = #<TCPSocket:fd 13, AF_INET, 127.0.0.1, 35478>
@socket_started = nil
@socket_pid = nil
Example steps when this issue occurs
@socket
but @socket_started
and @socket_pid
has nil
still.worker_loop
and crashes!
worker_loop
-> close_socket_if_old!
-> if @socket && ((@socket_started + MAX_SOCKET_AGE) < Time.now.to_f)
@socket
is not nil but @socket_started
is nil here.I recommend that to store the data, which the client thread required, into thread local storage in order to avoid this kind of bugs.
Thanks! I just pushed a minor change to catch the fork happening in the middle of ensure_socket.
@SamSaffron The issue has occurred. I confirmed again.
Why you wouldn't merge my changes? :thinking: Please tell me if you have a concern, there are any problems with my changes, or my description was not enough. I will try to fix it.
I am extremely uncomfortable using thread local storage to solve this problem
I am super happy refining this fix as needed
On Thu, 5 Nov 2020 at 6:35 pm, mopp notifications@github.com wrote:
@SamSaffron https://github.com/SamSaffron The issue has occurred. I confirmed again.
Why you wouldn't merge my changes? 🤔 Please tell me if you have a concern, there are any problems with my changes, or my description was not enough. I will try to fix it.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-722199487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXNQO5PJ26FSI5MVDGTSOJISNANCNFSM4THLTWEQ .
@SamSaffron Okay, you are the owner of this repository. I respect your intention. I created a super dirty test case to just reproduce this issue artificially. My change passed it. So, please use it to fix this issue. :wave:
diff --git a/lib/prometheus_exporter/client.rb b/lib/prometheus_exporter/client.rb
index 68cbab5..c7d06de 100644
--- a/lib/prometheus_exporter/client.rb
+++ b/lib/prometheus_exporter/client.rb
@@ -166,9 +166,12 @@ module PrometheusExporter
def worker_loop
close_socket_if_old!
- process_queue
- rescue => e
- STDERR.puts "Prometheus Exporter, failed to send message #{e}"
+
+ begin
+ process_queue
+ rescue => e
+ STDERR.puts "Prometheus Exporter, failed to send message #{e}"
+ end
end
def ensure_worker_thread!
diff --git a/test/client_test.rb b/test/client_test.rb
index 9202ca6..efef3f8 100644
--- a/test/client_test.rb
+++ b/test/client_test.rb
@@ -51,4 +51,35 @@ class PrometheusExporterTest < Minitest::Test
summary_metric = client.register(:summary, 'summary_metric', 'helping', expected_quantiles)
assert_equal(expected_quantiles, summary_metric.standard_values('value', 'key')[:opts])
end
+
+ def test_internal_state_broken
+ # start prometheus server for the client
+ server = PrometheusExporter::Server::Runner.new(bind: '0.0.0.0', port: 9394).start
+
+ # register a metrics for testing
+ client = PrometheusExporter::Client.new
+ counter_metric = client.register(:counter, 'counter_metric', 'helping')
+
+ # the internal worker thread will establish the connection to the server
+ counter_metric.increment('counter_metric')
+
+ # wait for thread working
+ sleep(1)
+
+ # Reproduce the state of the instance after process fork (race condition).
+ client.instance_variable_set('@socket_started', nil)
+ Thread.kill(client.instance_variable_get('@worker_thread').kill)
+
+ # wait for thread working
+ sleep(1)
+
+ # the internal worker thread will be created and try to send the metrics
+ counter_metric.increment('counter_metric')
+ sleep(1)
+
+ # the internal worker thread should not crash.
+ assert_equal(client.instance_variable_get('@worker_thread').alive?, true)
+
+ server.kill
+ end
end
https://github.com/mopp/prometheus_exporter/commit/97f17dd5ea17d6b8bab1c51184ccff4015e80858
Thank you! Will try it tomorrow
On Thu, 5 Nov 2020 at 9:34 pm, mopp notifications@github.com wrote:
@SamSaffron https://github.com/SamSaffron Okay, you are the owner of this repository. I respect your intention. I created a super dirty test case to just reproduce this issue artificially. My change passed it. So, please use it to fix this issue. 👋 mopp@97f17dd https://github.com/mopp/prometheus_exporter/commit/97f17dd5ea17d6b8bab1c51184ccff4015e80858
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-722291637, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXLCEXIDL3MXES44PPTSOJ5UPANCNFSM4THLTWEQ .
@mopp can you try the latest commit?
@SamSaffron I confirmed that the latest commit
Thanks for your support.
no worries, thanks for the excellent repro here and for confirming the fix!
On Tue, Nov 10, 2020 at 10:42 AM mopp notifications@github.com wrote:
@SamSaffron https://github.com/SamSaffron I confirmed that the latest commit
- passed the test I wrote
- didn't cause the error with my Rails application.
Thanks for your support.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/discourse/prometheus_exporter/pull/136#issuecomment-724350124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXLSWJZI4VHESP6EBV3SPB47ZANCNFSM4THLTWEQ .
What I did
I made
Client
store thesocket
andsocket_started
in thread local storage to avoid a race condition. This fix changes the behavior. the same socket was used after forked. But the client recreates the socket periodically. I believe that it will not be a serious problem.Environment
Ruby: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux] Ruby on Rails: 6.0.3.4 unicorn: 5.7.0 Docker image: ruby:2.7.2 prometheus_exporter: 0.5.3
Background
I create a new Rails application the last week. I often see the error message below. Once it occurs, it was repeating until I stop the container.
worker_loop
method outputs the error https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L156-L161 And the root is here https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L194@socket_started
hasnil
. It's weird. Theworker_loop
method caches all kinds of exceptions. As a result, it cannot recover automatically.I found the root cause of the thing after debugging. I use Rails with unicorn. it forks the Ruby process. I configured the prometheus exporter following the README.md to monitor unicorn metrics.
First, the master process starts the
PrometheusExporter::Client
. Some metrics will be sent using the client (Actually, I'm not sure. But the client was called soon after instantiated.) Then, the worker process is spawned.https://github.com/discourse/prometheus_exporter/blob/cf5997fa72b65a27dfe7a71db79eb072d1360ed7/lib/prometheus_exporter/client.rb#L202-L209 These lines rarely cause the race condition. The client uses thread internally to send the metrics to prometheus server. For instance, please consider context switching (from the client thread to the master process main thread) happens at line 203, and the master process spawns a worker process. Then, the thread in the spaned worker process will be dead. I don't know the reason. You can confirm the snippet below.
However, the instance variable
@socket
and@socket_started
is stored in the process memory space. The worker process starts a new thread but nobody cleans up them after forked. I moved them into thread local storage to isolate between process and threads.How did I debug it
I put this snippet at
config/unicorn.rb
to dump the state.Here is the result that shows the
You can see the
@socket_started
is only updated in the master process.Thanks