DataDog / dogstatsd-ruby

A Ruby client for DogStatsd
https://www.datadoghq.com/
MIT License
182 stars 137 forks source link

No metrics after adopting 5.0.1 #179

Closed kramuenke closed 3 years ago

kramuenke commented 3 years ago

Hi all,

after adopting 5.0.1 (from 4.8.3) in two of applications we stopped seeing metrics coming through. Both apps do not use #batch. We just create a global single instance of Datadog:Statsd that we keep using throughout the app.

Anything else to look out for when adopting 5.x?

remeh commented 3 years ago

Hello @kramuenke,

Could you share a snippet of how you are using the library? Are those long-running apps or only applications running for a short amount of time? You may need to add explicit call to the flush method. Please see the Ruby example here in this section: https://docs.datadoghq.com/developers/dogstatsd/high_throughput/#enable-buffering-on-your-client

stitchfixsam commented 3 years ago

We are also seeing this issue. We will try to package up some reproduction steps today or next week.

emilyst commented 3 years ago

Hi, I'm also an engineer at Stitch Fix, and I helped @stitchfixsam investigate the issue we were seeing.

Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of dogstatsd-ruby (from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were).

Our applications also use a single, global instance of Datadog:Statsd.

When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for the statsd agent using the following tcpdump command: sudo tcpdump -i lo0 udp port 8125 -vv -X (note that we use macOS). This allowed us to confirm that metrics were not being sent at all.

Using a debugger, we were able to determine that dogstatsd-ruby 5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).

In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.

Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush call would block when hitting this line. Since the Queue#pop method blocks until something is in the Queue, this means that any call to #flush apparently blocks until there's a metric in the queue to emit.

For now, at Stitch Fix, we need to remain on dogstatsd-ruby 4.8.3 until we have a resolution to this issue.

kramuenke commented 3 years ago

Hello @kramuenke,

Could you share a snippet of how you are using the library? Are those long-running apps or only applications running for a short amount of time? You may need to add explicit call to the flush method. Please see the Ruby example here in this section: https://docs.datadoghq.com/developers/dogstatsd/high_throughput/#enable-buffering-on-your-client

One of the applications is a rails 4 ( 😢 ) app running on unicorn. We create one instance in a rails initializer

DATADOG_STATSD_CLIENT = Datadog::Statsd.new(ENV['DATADOG_STATSD_HOST'], ENV['DATADOG_STATSD_PORT'])

Then using that instance everywhere

DATADOG_STATSD_CLIENT.increment('some.metric', tags: ["environment:#{Rails.env}"])`)

The other app is a rack app running on puma where we also create one global instance.

kbogtob commented 3 years ago

Hello @emilyst!

Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of dogstatsd-ruby (from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were).

Our applications also use a single, global instance of Datadog:Statsd.

When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for the statsd agent using the following tcpdump command: sudo tcpdump -i lo0 udp port 8125 -vv -X (note that we use macOS). This allowed us to confirm that metrics were not being sent at all.

Using a debugger, we were able to determine that dogstatsd-ruby 5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).

Thanks for those investigations. This is exactly why we are encouraging to flush and close the client at the end of short scripts. #180 will help the closing of such clients and #181 will add some syntax sugar to not have to explicitly close clients.

In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.

  • With dogstatsd-ruby 4.8.3, it immediately prints the metric emitted on the second thread.
  • With dogstatsd-ruby 5.0.1, it never prints the metric.
  • With dogstatsd-ruby 5.0.1 when configured to use Datadog::Statsd#flush, it does print the metric.

Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush call would block when hitting this line. Since the Queue#pop method blocks until something is in the Queue, this means that any call to #flush apparently blocks until there's a metric in the queue to emit.

This actually depends of what kind of flush you are doing, is it synchronous or not? I'm adding some documentation about how you can synchronize with the companion thread in #183.

If you want to emit a metric explicitly after each metric push, you can configure the buffer_max_pool_size parameter to the value 1.

kbogtob commented 3 years ago

If you think the companion thread is failing to do some writing and something weird is happening, you can give a logger to the statsd instance when building it to have more details of what's happening in the background.

emilyst commented 3 years ago

Hello @emilyst!

Specifically, what we observed was that when an application at Stitch Fix upgraded to the newest version of dogstatsd-ruby (from 4.8.3 to 5.0.1), many metrics were not being emitted (though a few were). Our applications also use a single, global instance of Datadog:Statsd. When I investigated with @stitchfixsam, we were able to reproduce the issue locally, capturing traffic destined for the statsd agent using the following tcpdump command: sudo tcpdump -i lo0 udp port 8125 -vv -X (note that we use macOS). This allowed us to confirm that metrics were not being sent at all. Using a debugger, we were able to determine that dogstatsd-ruby 5.0.1 was not emitting metrics when the thread on which the metrics events were emitted would terminate before the buffer of events had filled. This is commonly the case in short-lived processes (used to handle Resque or Sidekiq jobs) or in short-lived threads (used to handle incoming Rack requests).

Thanks for those investigations. This is exactly why we are encouraging to flush and close the client at the end of short scripts. #180 will help the closing of such clients and #181 will add some syntax sugar to not have to explicitly close clients.

Just to clarify, the small script I linked to was just my attempt to reproduce the issue with as simple a test case as possible. We're really asking about the behavior of the agent in our applications, which are not short scripts.

In the case of our applications, #flush does not work for reasons I previously explained, and we cannot #close the global instance as it's being reused over the lifetime of the application.

In this Gist, you can see the issue. When that script is run, it will start a UDP server on one thread, emit a metric on another thread, and then run continuously, printing any messages arriving to the UDP server.

  • With dogstatsd-ruby 4.8.3, it immediately prints the metric emitted on the second thread.
  • With dogstatsd-ruby 5.0.1, it never prints the metric.
  • With dogstatsd-ruby 5.0.1 when configured to use Datadog::Statsd#flush, it does print the metric.

Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush call would block when hitting this line. Since the Queue#pop method blocks until something is in the Queue, this means that any call to #flush apparently blocks until there's a metric in the queue to emit.

This actually depends of what kind of flush you are doing, is it synchronous or not? I'm adding some documentation about how you can synchronize with the companion thread in #183.

As you can see from the script, we're doing a synchronous flush.

If you want to emit a metric explicitly after each metric push, you can configure the buffer_max_pool_size parameter to the value 1.

Is there a difference between doing this and simply calling #flush ourselves?

muzfuz commented 3 years ago

We are seeing this exact same issue, our setup also uses a Datadog::Statsd instance assigned to a globally accessible constant.

Standard metrics were fine, but all custom metrics failed to flush. We took the same steps described by @emilyst to verify. For now we're going to stay on a version < 5.x.

remeh commented 3 years ago

Hello @emilyst,

Based on this finding, we attempted to remedy the issue in our applications by adding a call to Datadog::Statsd#flush after each time we emit any metric, but in practice, we found that this led to another issue. We saw that sometimes the #flush call would block when hitting this line. Since the Queue#pop method blocks until something is in the Queue, this means that any call to #flush apparently blocks until there's a metric in the queue to emit.

From my understanding, you may actually want to do an asynchronous flush here (you do not want to block your app while the client is flushing the metrics)? However, a good thing may be to have a final synchronous flush at the end of the execution of your process, to be sure that last metrics are being flushed before your process terminates.

The sync: true is here to ensure that once the call to flush returns, the data has properly been transmitted through the network. The call may then block for milliseconds depending on the latency of the network.

I've tried to look for race conditions that would block the whole application (wrote a small application running 5 threads, 1 main flushing at a given frequency, 3 flushing synchronously and 1 flushing asynchronously) and I've not met any race / blocking situation so far. Of course I know, it doesn't mean that there is none, that's the hard part with race conditions...

Is there a difference between doing this and simply calling #flush ourselves?

It means that every time you submit a metric, next submission will trigger a flush since the maximum amount of the metrics that the internal buffer can contain is 1 metric.


Hi @muzfuz, thanks for your feedback as well. Were you able to reproduce a complete blocking of your application while calling the flush method with 5.0.1?

I just want to be sure that there is no misunderstanding about the flush method and its sync parameter. I will improve the documentation following-up this thread.

alor commented 3 years ago

I'm having the same issue. after upgrading to 5.0.1 all metrics stopped reporting in the dashboard. reverted to 4.8.3 solved the issue. I'm not using #batch anywhere, which seems the only breaking change in the documentation I'm using dogstats in a docker image.

the only difference I see is the return value of the increment method:

$datadog = Datadog::Statsd.new(namespace: "test", socket_path: ENV['DD_DOGSTATSD_SOCKET'])

5.0.1:

$datadog.increment('my.metric')
=> ["datadog.dogstatsd.client.metrics:1|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.events:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.service_checks:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.bytes_sent:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.bytes_dropped:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.packets_sent:0|c|#client:ruby,client_version:5.0.1,client_transport:uds", "datadog.dogstatsd.client.packets_dropped:0|c|#client:ruby,client_version:5.0.1,client_transport:uds"]

and the metric is not reported.

4.8.3:

$datadog.increment('my.metric')
=> 1

and the metric is reported correctly

do you need any other info to debug the issue?

muzfuz commented 3 years ago

@remeh I've not extensively tested using #flush with sync: true in a high traffic environment so I can't comment on whether or not this showed the same issues mentioned by @emilyst

However I can say that explicitly using #flush after #increment did submit the metrics to the server.

Much like @alor we were not using #increment in a batch though, so I'm confused as to why this would now be required. Has there been a change to the public API, and has using an explicit #flush now become a requirement as of v5 ? If so then it would be useful to have that change documented please.

brushbox commented 3 years ago

Hi all. My team has just gone through this issue with Rails running on Unicorn and we have a working explanation and solution.

Firstly - this issue may not affect every app running on Unicorn. It is most likely to happen if an instance of Statsd is created and shared throughout the app. The problem will then manifest if that instance is created in the master process - before the worker processes are forked.

The issue is that the Sender#send_loop thread is created in the master process and threads are not copied across a fork. So although the worker processes have a reference to a Statsd instance, that instance has no thread dispatching the messages.

The solution we used to deal with the issue was to explicitly clear the shared instance at the start of the after_fork callback.

Here are the relevant excerpts from our code:

# .../datadogstats.rb (our wrapper class that owns the instance of StatsD
...
  def self.statsd
    @statsd ||= Datadog::Statsd.new(...)
  end

  def self.forked
    @statsd = nil
  end

  def self.close
    @statsd&.flush
    @statsd&.close
    @statsd = nil
  end
...

# config/unicorn.rb

...
after_fork do |server, worker|
  DatadogStats.forked
  ...
end

after_worker_exit do |_,_,_|
  DatadogStats.close
end

The important point is that #forked clears out the reference to the Statsd instance. The next attempt to generate metrics will create a new instance in that process (which will create a message dispatch thread).

I've included the additional code we added to ensure the queue is flushed when a process is exiting. We did a similar thing in our sidekiq workers:

# config/initializers/sidekiq.rb

Sidekiq.configure_server do |config|
  ...
  config.on(:shutdown) { DatadogStats.close }
end

Note: this issue is not Unicorn-specific, it could affect any server that uses fork.

I hope this helps!

remeh commented 3 years ago

@brushbox Thanks a lot for your message and very detailed explanations, that's very helpful! It indeed makes sense and could be an issue with frameworks that forks the main process... Meanwhile, re-opening the connection in your wrapper is probably the best solution in this case 👍

@muzfuz @alor I've improved the migration documentation and made it clearer about the #flush method, I hope this is helpful.

We are working on a release 5.1.0, with auto-close helpers when possible and making clearer that an opened instance has to be closed to release resources. Edit: 5.1.0 is not shipping an auto-close feature as it would introduce even more complexity as soon as DogStatsD is used with libraries using forks. However, 5.1.0 ships a flush-on-close that will help for some use-case of missing metrics.

About that fork issue, we indeed have to dig into this, thanks a lot @brushbox

remeh commented 3 years ago

Hey everyone,

We added a a section about the migration from v4.x to v5.x and a section mentioning the common pitfalls with v5.x.

Please, let me know if a proper usage of #flush(sync: true) or #flush(sync: false) (depending on your use-case, if you need a blocking flush/non-blocking flush) is solving that issue for you. I hope the added documentation will avoid further troubles regarding this, I agree it should have been available in the first place, sorry for that.

About the fork issue mentioned by @brushbox , I'll create a separated issue to track this work as it is leading to the same issue, but is due to a different cause.

remeh commented 3 years ago

Hey everyone,

If you can't manually flush/close the instance, please consider using the single_thread mode added in v5.2.0 (release note) and to set buffer_max_pool_size: 1 in order to replicate the behavior of v4.x versions (no companion thread and flush on every metric submission).

Please let me know if that doesn't fix the issue for you.

rahul342 commented 3 years ago

@remeh We were at 4.4.0 and I just tried 5.2.0 with options -

buffer_max_payload_size: 2 << 16, # 128KB Buffer
single_thread: true,

We use Puma and global Dogstatsd client, but we still saw the issue of no metrics being reported.

I know your recommendation is to use buffer_max_pool_size: 1, but I kept our old value because we have been setting max_buffer_bytes: 2 << 16 in 4.4 version. Looking at changelog, I thought it was simply renamed and still works like the old logic.

Wondering if buffer_max_payload_size was the issue with 5.2.0 no reporting?

For now, it seems going to the latest 4.x makes most sense for us.

remeh commented 3 years ago

Hey @rahul342,

No, this is correct and should be a proper usage of buffer_max_payload_size, even with the single_thread mode on.

Are you willing to help us by testing with buffer_max_pool_size: 1 (or a small number) and validate that you have your metrics visible? It would be super-helpful. I suspect that since the change in how the messages are buffered, the forks done by Puma may duplicate an empty message buffer, that you fill by submitting metrics, but not enough to reach 128KB and for them to be automatically flushed. Also, please make sure to flush them (see #flush) when it is appropriated (e.g. at the end of the request process / worker execution).

rahul342 commented 3 years ago

@remeh 128KB is probably not that big for our volume and period we had this live. I doubt that the buffer wouldn't have filled. Sorry won't be able to test further anytime soon though.

tibbon commented 3 years ago

I've now encountered this issue as well. I'm testing reverting to 4.8.x for the moment to see if that resolves things. The suggestions for tcpdump were really useful

remeh commented 3 years ago

Hi @tibbon Had you the chance to try with single_thread: true? Here's a guide to migrate from v4.x to v5.x https://github.com/DataDog/dogstatsd-ruby#migrating-from-v4x-to-v5x Please let me know if that's not helping neither.

tegon commented 3 years ago

I was able to make this work with Puma by handling the worker hooks (thanks everyone for all the info BTW!). The only thing not working for me now is the runtime_metrics integration with the ddtrace Gem.

  Datadog.configure do |c|
    c.runtime_metrics.enabled = true
    c.runtime_metrics.statsd = Stats.instance
  end

class Stats
  def self.instance
    @client ||= Datadog::Statsd.new(host, port)
  end

Does anyone know what might be missing for that to work?

remeh commented 3 years ago

If you still miss metrics with v5.x, please make sure to: properly use #close and #flush where needed, or to construct the instances using single_thread: true and buffer_max_pool_size: 1 to flush on every metric submission. There is also this guide for v4.x -> v5.x migration and v5.x common pitfalls that may help you migrating from v4.x.

Note that 5.3.0 is shipping with changes automatically re-creating the companion thread which should avoid having to manually handle the thread lifecycle of the child processes, but still, re-creating an instance in child process is probably the best solution.

@tegon is the runtime_metrics integration with ddtrace still not working for you? Can you please open a new issue about this documenting your setup if that's the case? Thanks in advance.

I'm closing this issue since most of what's mentioned here should be addressed with either the automatic re-spawn fo the companion thread shipped in 5.3.0 or the use of the single thread mode. Please open a new issue if you still have problems 🙇.

tibbon commented 2 years ago

I need to come back to the computer tomorrow and dig in more again to get proper details and try some of the above suggestions, but we encountered an issue where thousands of threads were getting created. Both with Puma and Shoryuken in separate deployments.

It was enough to make allocation of more threads on other neighboring k8s deployments not possible! We were using 5.3.2

ivoanjo commented 2 years ago

@tibbon That's... not good 😰 . Definitely if you can dig up more info and create an issue with it, we'd really like to look into it.