Can we trace Puma request body wait time?

tlynam commented 3 years ago

Hello!

I hope you're well! We'd like to trace Puma's request body wait time. This is the time spent waiting for the client to upload their PUT/POST request bodies. Our application has larger bodies for a lot of PUT/POST requests and some customers have lower bandwidth. I don't know if we'd be able to update this queue time value itself but for monitors or dashboards we could subtract the request body wait time right? Could we make this an opt-in option?

It's available in the middleware via env['puma.request_body_wait']. Would you be open to this? I'd be happy to work on a PR. I'm learning the terminology and unclear on how to implement but see where it logs the request start time: https://github.com/DataDog/dd-trace-rb/blob/master/lib/ddtrace/contrib/rack/middlewares.rb#L58

Here's some background on the Puma body wait time: https://github.com/puma/puma/blob/master/docs/architecture.md https://github.com/puma/puma/blob/master/docs/deployment.md

delner commented 3 years ago

Interesting. I think we do something like this in Rack (what we call "HTTP request queue time") as you pointed out. This would be good to add as a tag, I think we just need to find a good place for it. (Is rack a good place for it, or should it be elsewhere?)

Thanks for the suggestion!

tlynam commented 3 years ago

Great, thank you! I'm not sure of all the available options on where to place this. It seems like Rack is reasonable since Puma is a Rack web server? Okay, if we add it as a tag for Rack, could set_request_tags! be appropriate?

https://github.com/DataDog/dd-trace-rb/blob/master/lib/ddtrace/contrib/rack/middlewares.rb#L135

Just guessing something like?

if configuration[:body_wait_enabled] && request_span.get_tag(Datadog::Ext::HTTP::BODY_WAIT).nil?
  request_span.set_tag(Datadog::Ext::HTTP::BODY_WAIT, env['puma.request_body_wait'])
end

delner commented 3 years ago

Basic strategy seems fine to me. I think naming we'll want to be thoughtful about though.

Do you know if other servers (Unicorn, Passenger, etc) have a similar metric, and if so, what do they call it? Does "body wait time" narrowly apply to PUT/POST, or is this metric mean different things in another context?

Whatever we come up with, I'd like it to be relatively "inclusive" such that if we added such support to other servers, it would fit nicely into the scheme and be generally useful, while not being too loose with the definition (e.g. same metric timing different things in different servers.)

Using "body_wait" as a part of it seems like a reasonable starting point though.

tlynam commented 3 years ago

Cool, thanks a lot! I looked into this before when we were using New Relic. I was working on adding this there and found other Ruby servers currently don't have this value.

Falcon

Just read through some docs and code and I don't see that it measures how long it takes to read request bodies.

Webrick

Webrick doesn't buffer large requests. I also didn't see any passed headers that log the buffering time.

Thin

Thin can buffer large requests but I didn't see any passed headers that log buffering time.

Unicorn

I didn't see any passed headers that log buffering time. It looks like it's recommended to use with nginx to buffer requests. I don't know nginx but wondering if it's possible set a header when a request has finish buffering so we could calculate what the buffering time is.

Phusion Passenger

Phusion Passenger also uses nginx. I didn't see any headers logging the request buffer time. Also wondering if it would be possible to set a header in nginx when a request has completed buffering so we could then calculate the buffering time.

From what I understand, I think body wait time would only apply to PUT/POST/PATCH requests.

Makes a lot of sense about the naming. What about something like body_wait_time? How about units? Do we assume everything is in ms? Can I go ahead and start writing something?

delner commented 3 years ago

Thanks for looking all these up!

I talked with a colleague about this use case; there might be a bit of a snag with how metrics work. Evidently submitting this as a tag or metric on the Rack span will not produce a metric that can be graphed or "subtracted" from that other metric (which I think was the intent in your original post?) This is because tags are just facets added to spans, and don't compute any metrics within Datadog (at least not today to my knowledge.)

There's maybe a few ways of working around this:

Interim solution: Configure and submit this metric via Dogstatsd instead. Requires an agent and port 8125 UDP, so requirements may not be trivial, but it would definitely produce a metric you could combine on a graph. A Rack middleware module might be a good way to do this (not sure if it should be a part of this trace middleware or not.)
Longer term solution?: Add a Puma integration that generates Puma spans. Spans auto-generate metrics (that can be graphed) and would allow for richer tagging on Puma specific things. Current proposed solution (Rack middleware) is a little weird in the sense that we're doing Puma stuff in Rack, but this integration would a great place to put this instead.

marcotc commented 3 years ago

I think that

Add a Puma integration that generates Puma spans

would be the cleanest solution here: it would capture any time spent in Puma-only code and, like @delner mentioned, spans natively generate metrics.

When it comes down to drilling into specific parts of the Puma lifecycle (e.g. differentiate between the whole Puma request code path vs only the request_body_wait time), we have better options with a dedicated Puma span. The request_body_wait time could be a sub-span that's a child of the parent Puma span and happens before the request is handed over to the upper application stack, or it could simply be a tag if we deem having it as a first-class span is not needed. Regardless of this decision, a Puma span empowers us to freely choose which one provides the right level of visibility.

tlynam commented 3 years ago

A Puma integration sounds great, thank you! I'm not sure how this will look like but I'll investigate.

marcotc commented 3 years ago

👋 @tlynam, we are currently actively investigating how to implement a Puma integration: given that a Puma span would be the parent of a Rack span, we want to make sure the impact of having puma.request being the new top-level span (instead of rack.request) is managed correctly.

In the meantime, you can use this snippet to get Puma wait time value reported on each Rack span and as a StatsD metric. This script can be inserted anywhere after require 'ddtrace' is invoked:

module PumaRackIntegration
  def set_request_tags!(request_span, env, *args)
    if (request_body_wait = env['puma.request_body_wait'])
      # Set tag on Rack span
      request_span.set_tag('puma.request_body_wait', request_body_wait)

      # Send metric through StatsD client:
      # https://docs.datadoghq.com/developers/dogstatsd/?tab=hostagent#client-instantiation-parameters
      your_dogstatsd_instance.distribution('puma.request_body_wait', request_body_wait)
    end

    super(request_span, env, *args)
  end
end

Datadog::Contrib::Rack::TraceMiddleware.prepend(PumaRackIntegration)

Let us know if this script works for you. Please feel free to tweak the tag name and metric name as you see fit as well.

tlynam commented 3 years ago

Thank you so much @marcotc, that's really helpful!

ryush00 commented 1 year ago

Any news on this topic?

DataDog / dd-trace-rb