Closed kramuenke closed 3 years ago
Hello @kramuenke,
Could you share a snippet of how you are using the library? Are those long-running apps or only applications running for a short amount of time? You may need to add explicit call to the flush
method. Please see the Ruby example here in this section: https://docs.datadoghq.com/developers/dogstatsd/high_throughput/#enable-buffering-on-your-client
We are also seeing this issue. We will try to package up some reproduction steps today or next week.
Hi, I'm also an engineer at Stitch Fix, and I helped @stitchfixsam investigate the issue we were seeing.
Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of dogstatsd-ruby
(from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were).
Our applications also use a single, global instance of Datadog:Statsd
.
When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for the statsd
agent using the following tcpdump
command: sudo tcpdump -i lo0 udp port 8125 -vv -X
(note that we use macOS). This allowed us to confirm that metrics were not being sent at all.
Using a debugger, we were able to determine that dogstatsd-ruby
5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).
In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.
dogstatsd-ruby
4.8.3, it immediately prints the metric emitted on the second thread.dogstatsd-ruby
5.0.1, it never prints the metric.dogstatsd-ruby
5.0.1 when configured to use Datadog::Statsd#flush
, it does print the metric.Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush
after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush
call would block when hitting this line. Since the Queue#pop
method blocks until something is in the Queue
, this means that any call to #flush
apparently blocks until there's a metric in the queue to emit.
For now, at Stitch Fix, we need to remain on dogstatsd-ruby
4.8.3 until we have a resolution to this issue.
Hello @kramuenke,
Could you share a snippet of how you are using the library? Are those long-running apps or only applications running for a short amount of time? You may need to add explicit call to the
flush
method. Please see the Ruby example here in this section: https://docs.datadoghq.com/developers/dogstatsd/high_throughput/#enable-buffering-on-your-client
One of the applications is a rails 4 ( 😢 ) app running on unicorn. We create one instance in a rails initializer
DATADOG_STATSD_CLIENT = Datadog::Statsd.new(ENV['DATADOG_STATSD_HOST'], ENV['DATADOG_STATSD_PORT'])
Then using that instance everywhere
DATADOG_STATSD_CLIENT.increment('some.metric', tags: ["environment:#{Rails.env}"])`)
The other app is a rack app running on puma where we also create one global instance.
Hello @emilyst!
Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of
dogstatsd-ruby
(from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were).Our applications also use a single, global instance of
Datadog:Statsd
.When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for the
statsd
agent using the followingtcpdump
command:sudo tcpdump -i lo0 udp port 8125 -vv -X
(note that we use macOS). This allowed us to confirm that metrics were not being sent at all.Using a debugger, we were able to determine that
dogstatsd-ruby
5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).
Thanks for those investigations. This is exactly why we are encouraging to flush and close the client at the end of short scripts. #180 will help the closing of such clients and #181 will add some syntax sugar to not have to explicitly close clients.
In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.
- With
dogstatsd-ruby
4.8.3, it immediately prints the metric emitted on the second thread.- With
dogstatsd-ruby
5.0.1, it never prints the metric.- With
dogstatsd-ruby
5.0.1 when configured to useDatadog::Statsd#flush
, it does print the metric.Based on this finding, we attempted to remedy the issue in our applications by adding a call to
Datadog::Statsd#flush
after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the#flush
call would block when hitting this line. Since theQueue#pop
method blocks until something is in theQueue
, this means that any call to#flush
apparently blocks until there's a metric in the queue to emit.
This actually depends of what kind of flush you are doing, is it synchronous or not? I'm adding some documentation about how you can synchronize with the companion thread in #183.
If you want to emit a metric explicitly after each metric push, you can configure the buffer_max_pool_size
parameter to the value 1
.
If you think the companion thread is failing to do some writing and something weird is happening, you can give a logger to the statsd instance when building it to have more details of what's happening in the background.
Hello @emilyst!
Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of
dogstatsd-ruby
(from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were). Our applications also use a single, global instance ofDatadog:Statsd
. When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for thestatsd
agent using the followingtcpdump
command:sudo tcpdump -i lo0 udp port 8125 -vv -X
(note that we use macOS). This allowed us to confirm that metrics were not being sent at all. Using a debugger, we were able to determine thatdogstatsd-ruby
5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).Thanks for those investigations. This is exactly why we are encouraging to flush and close the client at the end of short scripts. #180 will help the closing of such clients and #181 will add some syntax sugar to not have to explicitly close clients.
Just to clarify, the small script I linked to was just my attempt to reproduce the issue with as simple a test case as possible. We're really asking about the behavior of the agent in our applications, which are not short scripts.
In the case of our applications, #flush
does not work for reasons I previously explained, and we cannot #close
the global instance as it's being reused over the lifetime of the application.
In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.
- With
dogstatsd-ruby
4.8.3, it immediately prints the metric emitted on the second thread.- With
dogstatsd-ruby
5.0.1, it never prints the metric.- With
dogstatsd-ruby
5.0.1 when configured to useDatadog::Statsd#flush
, it does print the metric.Based on this finding, we attempted to remedy the issue in our applications by adding a call to
Datadog::Statsd#flush
after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the#flush
call would block when hitting this line. Since theQueue#pop
method blocks until something is in theQueue
, this means that any call to#flush
apparently blocks until there's a metric in the queue to emit.This actually depends of what kind of flush you are doing, is it synchronous or not? I'm adding some documentation about how you can synchronize with the companion thread in #183.
As you can see from the script, we're doing a synchronous flush.
If you want to emit a metric explicitly after each metric push, you can configure the
buffer_max_pool_size
parameter to the value1
.
Is there a difference between doing this and simply calling #flush
ourselves?
We are seeing this exact same issue, our setup also uses a Datadog::Statsd
instance assigned to a globally accessible constant.
Standard metrics were fine, but all custom metrics failed to flush. We took the same steps described by @emilyst to verify. For now we're going to stay on a version < 5.x.
Hello @emilyst,
Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush call would block when hitting this line. Since the Queue#pop method blocks until something is in the Queue, this means that any call to #flush apparently blocks until there's a metric in the queue to emit.
From my understanding, you may actually want to do an asynchronous flush here (you do not want to block your app while the client is flushing the metrics)? However, a good thing may be to have a final synchronous flush at the end of the execution of your process, to be sure that last metrics are being flushed before your process terminates.
The sync: true
is here to ensure that once the call to flush
returns, the data has properly been transmitted through the network. The call may then block for milliseconds depending on the latency of the network.
I've tried to look for race conditions that would block the whole application (wrote a small application running 5 threads, 1 main flushing at a given frequency, 3 flushing synchronously and 1 flushing asynchronously) and I've not met any race / blocking situation so far. Of course I know, it doesn't mean that there is none, that's the hard part with race conditions...
Is there a difference between doing this and simply calling #flush ourselves?
It means that every time you submit a metric, next submission will trigger a flush since the maximum amount of the metrics that the internal buffer can contain is 1
metric.
Hi @muzfuz, thanks for your feedback as well.
Were you able to reproduce a complete blocking of your application while calling the flush
method with 5.0.1
?
I just want to be sure that there is no misunderstanding about the flush
method and its sync
parameter. I will improve the documentation following-up this thread.
I'm having the same issue. after upgrading to 5.0.1 all metrics stopped reporting in the dashboard. reverted to 4.8.3 solved the issue. I'm not using #batch anywhere, which seems the only breaking change in the documentation I'm using dogstats in a docker image.
the only difference I see is the return value of the increment method:
$datadog = Datadog::Statsd.new(namespace: "test", socket_path: ENV['DD_DOGSTATSD_SOCKET'])
5.0.1:
$datadog.increment('my.metric')
=> ["datadog.dogstatsd.client.metrics:1|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.events:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.service_checks:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.bytes_sent:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.bytes_dropped:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.packets_sent:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.packets_dropped:0|c|#client:ruby,client_version:5.0.1,client_transport:uds"]
and the metric is not reported.
4.8.3:
$datadog.increment('my.metric')
=> 1
and the metric is reported correctly
do you need any other info to debug the issue?
@remeh I've not extensively tested using #flush
with sync: true
in a high traffic environment so I can't comment on whether or not this showed the same issues mentioned by @emilyst
However I can say that explicitly using #flush
after #increment
did submit the metrics to the server.
Much like @alor we were not using #increment
in a batch
though, so I'm confused as to why this would now be required. Has there been a change to the public API, and has using an explicit #flush
now become a requirement as of v5
? If so then it would be useful to have that change documented please.
Hi all. My team has just gone through this issue with Rails running on Unicorn and we have a working explanation and solution.
Firstly - this issue may not affect every app running on Unicorn. It is most likely to happen if an instance of Statsd is created and shared throughout the app. The problem will then manifest if that instance is created in the master process - before the worker processes are fork
ed.
The issue is that the Sender#send_loop
thread is created in the master process and threads are not copied across a fork
. So although the worker processes have a reference to a Statsd
instance, that instance has no thread dispatching the messages.
The solution we used to deal with the issue was to explicitly clear the shared instance at the start of the after_fork
callback.
Here are the relevant excerpts from our code:
# .../datadogstats.rb (our wrapper class that owns the instance of StatsD
...
def self.statsd
@statsd ||= Datadog::Statsd.new(...)
end
def self.forked
@statsd = nil
end
def self.close
@statsd&.flush
@statsd&.close
@statsd = nil
end
...
# config/unicorn.rb
...
after_fork do |server, worker|
DatadogStats.forked
...
end
after_worker_exit do |_,_,_|
DatadogStats.close
end
The important point is that #forked
clears out the reference to the Statsd
instance. The next attempt to generate metrics will create a new instance in that process (which will create a message dispatch thread).
I've included the additional code we added to ensure the queue is flushed when a process is exiting. We did a similar thing in our sidekiq workers:
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
...
config.on(:shutdown) { DatadogStats.close }
end
Note: this issue is not Unicorn-specific, it could affect any server that uses fork
.
I hope this helps!
@brushbox Thanks a lot for your message and very detailed explanations, that's very helpful! It indeed makes sense and could be an issue with frameworks that forks the main process... Meanwhile, re-opening the connection in your wrapper is probably the best solution in this case 👍
@muzfuz @alor I've improved the migration documentation and made it clearer about the #flush
method, I hope this is helpful.
We are working on a release 5.1.0, with auto-close helpers when possible and making clearer that an opened instance has to be closed to release resources. Edit: 5.1.0 is not shipping an auto-close feature as it would introduce even more complexity as soon as DogStatsD is used with libraries using forks. However, 5.1.0 ships a flush-on-close that will help for some use-case of missing metrics.
About that fork issue, we indeed have to dig into this, thanks a lot @brushbox
Hey everyone,
We added a a section about the migration from v4.x to v5.x and a section mentioning the common pitfalls with v5.x.
Please, let me know if a proper usage of #flush(sync: true)
or #flush(sync: false)
(depending on your use-case, if you need a blocking flush/non-blocking flush) is solving that issue for you. I hope the added documentation will avoid further troubles regarding this, I agree it should have been available in the first place, sorry for that.
About the fork
issue mentioned by @brushbox , I'll create a separated issue to track this work as it is leading to the same issue, but is due to a different cause.
Hey everyone,
If you can't manually flush/close the instance, please consider using the single_thread
mode added in v5.2.0 (release note) and to set buffer_max_pool_size: 1
in order to replicate the behavior of v4.x versions (no companion thread and flush on every metric submission).
Please let me know if that doesn't fix the issue for you.
@remeh We were at 4.4.0 and I just tried 5.2.0 with options -
buffer_max_payload_size: 2 << 16, # 128KB Buffer
single_thread: true,
We use Puma and global Dogstatsd client, but we still saw the issue of no metrics being reported.
I know your recommendation is to use buffer_max_pool_size: 1
, but I kept our old value because we have been setting max_buffer_bytes: 2 << 16
in 4.4 version. Looking at changelog, I thought it was simply renamed and still works like the old logic.
Wondering if buffer_max_payload_size
was the issue with 5.2.0 no reporting?
For now, it seems going to the latest 4.x makes most sense for us.
Hey @rahul342,
No, this is correct and should be a proper usage of buffer_max_payload_size
, even with the single_thread
mode on.
Are you willing to help us by testing with buffer_max_pool_size: 1
(or a small number) and validate that you have your metrics visible? It would be super-helpful. I suspect that since the change in how the messages are buffered, the forks done by Puma may duplicate an empty message buffer, that you fill by submitting metrics, but not enough to reach 128KB and for them to be automatically flushed.
Also, please make sure to flush them (see #flush
) when it is appropriated (e.g. at the end of the request process / worker execution).
@remeh 128KB is probably not that big for our volume and period we had this live. I doubt that the buffer wouldn't have filled. Sorry won't be able to test further anytime soon though.
I've now encountered this issue as well. I'm testing reverting to 4.8.x for the moment to see if that resolves things. The suggestions for tcpdump
were really useful
Hi @tibbon
Had you the chance to try with single_thread: true
? Here's a guide to migrate from v4.x to v5.x https://github.com/DataDog/dogstatsd-ruby#migrating-from-v4x-to-v5x
Please let me know if that's not helping neither.
I was able to make this work with Puma by handling the worker hooks (thanks everyone for all the info BTW!).
The only thing not working for me now is the runtime_metrics
integration with the ddtrace
Gem.
Datadog.configure do |c|
c.runtime_metrics.enabled = true
c.runtime_metrics.statsd = Stats.instance
end
class Stats
def self.instance
@client ||= Datadog::Statsd.new(host, port)
end
Does anyone know what might be missing for that to work?
If you still miss metrics with v5.x, please make sure to: properly use #close
and #flush
where needed, or to construct the instances using single_thread: true
and buffer_max_pool_size: 1
to flush on every metric submission. There is also this guide for v4.x -> v5.x migration and v5.x common pitfalls that may help you migrating from v4.x.
Note that 5.3.0 is shipping with changes automatically re-creating the companion thread which should avoid having to manually handle the thread lifecycle of the child processes, but still, re-creating an instance in child process is probably the best solution.
@tegon is the runtime_metrics
integration with ddtrace
still not working for you? Can you please open a new issue about this documenting your setup if that's the case? Thanks in advance.
I'm closing this issue since most of what's mentioned here should be addressed with either the automatic re-spawn fo the companion thread shipped in 5.3.0 or the use of the single thread mode. Please open a new issue if you still have problems 🙇.
I need to come back to the computer tomorrow and dig in more again to get proper details and try some of the above suggestions, but we encountered an issue where thousands of threads were getting created. Both with Puma and Shoryuken in separate deployments.
It was enough to make allocation of more threads on other neighboring k8s deployments not possible! We were using 5.3.2
@tibbon That's... not good 😰 . Definitely if you can dig up more info and create an issue with it, we'd really like to look into it.
Hi all,
after adopting 5.0.1 (from 4.8.3) in two of applications we stopped seeing metrics coming through. Both apps do not use
#batch
. We just create a global single instance ofDatadog:Statsd
that we keep using throughout the app.Anything else to look out for when adopting 5.x?