DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
304 stars 372 forks source link

Allow before_flush processor to access trace metadata #2655

Open oskarpearson opened 1 year ago

oskarpearson commented 1 year ago

Current behaviour

When using a before_flush processor to process a trace, I'm unable to access either tags or metadata of the trace. This is because in Datadog::Tracing::TraceSegment the meta attribute is marked as protected, and Datadog::Tracing::Metadata::Tagging has not been included. My understanding is that the latter would mean I can use get_tag and other such methods.

I would like to be able to process a TraceSegment in a before_flush and selectively copy some tags/metadata from the TraceSegment to some of the spans within that trace.

Expected behaviour

I should be able to get the metadata or tags from the trace via a publically-callable method. For example trace.has_tag?(tag) or trace.get_tag(tag), or trace[tag].

Alternatively, via trace.meta[tag]

Steps to reproduce

In your rails controller or similar code, call something like this to set a custom tag on the trace:

Datadog::Tracing.active_trace.set_tag("component", "component_name")

In your datadog setup, configure a custom tag processor:

Datadog::Tracing.before_flush(
  DatadogTracePropogationProcessor.new
)

In your DatadogTracePropogationProcessor class:

def call(trace)
  ... try and use `trace.get_tag()`...
end

How does ddtrace help you?

I'd like to be able to pre-process traces and selectively set tags on child spans - most notably service-entry spans https://docs.datadoghq.com/tracing/glossary/#service-entry-span

The intention is that when a rails request occurs I tag it in the controller. Then any calls to postgres or redis as part of that trace should also be tagged with those custom tags.

For example, any http requests that hit our rails controllers are tagged with the squad that owns it. This currently works fine. That allows me to filter APM results, logs, and similar "by squad". I can then build squad-specific dashboards.

However, the spans related to postgres SQL requests are not similarly tagged. My understanding is that this is because they are "service entry" spans - and the tags don't normally populate down to those spans. This means that I can't go to the "apm" dashboard, select our postgres service, and then create filters by squad. This would be useful - because it would allow me to find the most common requests by a squad, or the slowest running requests per squad, or similar.

Environment

Proposed Fix

We could probably fix this by adding include Metadata::Tagging to the top of Datadog::Tracing::TraceSegment

As an monkey-patch, I've added Datadog::Tracing::TraceSegment.send(:include, Datadog::Tracing::Metadata::Tagging) to my initialiser and it appears to work as-expected.

delner commented 1 year ago

In general, I'm supportive of the idea of being able to add tags to the trace during post processing.

However, this runs into a technical gray area.

There are two different sets of trace data constructs:

  1. The "in-flight" ones: TraceOperation and SpanOperation
  2. The "completed" ones: TraceSegment and Span

TraceSegment and Span are meant to represent the "final" state of their "in flight" counterparts.

Therein lies the problem: the reason you can't modify it is because it's meant to be immutable. We take advantage of the immutability property to (1) prevent race conditions where the user might attempt to modify the trace data in an unexpected way, and (2) make it possible to process the data in the background in a thread-safe & Ractor-safe way.


Read access I don't think should be an issue... but write access is a different matter.

There might be a solution somewhere here, in modifying the behavior of the processing pipeline. One possibility is providing a mutable form of the trace/span to the processing pipeline, which is then reconverted back to an immutable object. I knew of this when I rewrote the behavior for 1.0, but avoided doing this because I think there could be a significant memory/CPU cost to post-processing this way (in allocating and discarding memory for every span in a trace.) This was complexity I did not want to mess with at the time.

Perhaps there's another way. But I think prevention is better than a cure here, if possible. Is there a way to avoid post-processing altogether? What prevents you from doing this without post-processing?

delner commented 1 year ago

For example, any http requests that hit our rails controllers are tagged with the squad that owns it. This currently works fine. That allows me to filter APM results, logs, and similar "by squad". I can then build squad-specific dashboards.

I think I want to explore this use case more. It sounds interesting... particularly the multi-tenant environment and making the right things visible to the right teams. I'd prefer we provide a feature that specifically addresses what you need/want out of the box, if at all possible.

Sure, if we have to use the post-processor as a stop-gap measure, then that might be okay for now... but it would be better to not need it at all.

oskarpearson commented 1 year ago

Thanks for the useful and thoughtful reply @delner - it's good to know this was intentional rather than accidental. It would be great to find a different solution for my use case though!

We use packwerk extensively in our rails codebase - along with the equivalent of https://github.com/rubyatscale/code_ownership It allows us to split our rails app into components, and each component has an owning squad.

Our code largely consists of Rails controllers and sidekiq jobs (with a few rake tasks and scheduled events thrown in for good measure).

Both of these are working well. I can filter/search by team/squad and even ruby component on both logging and on APM and find problematic requests per-squad.

However, all 'service entry' spans that occur during this time are handled by the dd-trace-rb activerecord, redis, and other existing integrations. And on the datadog UI if I view the SQL resources I don't have the same tags. My understanding is that this is by design - tags from parent spans and traces aren't copied into service-entry spans.

I would love to be able to go into apm and search for "SQL queries by squad", or "Redis cache requests" by squad or rails component.

On a similar note, I'd also like to be able to selectively tag traces based on certain criteria. For example I'd like to be able to look at the trace and determine that it's triggered 100 SQL requests.. and then tag it with may_contain_n_plus_ones or similar.

It seemed to me that the place for this sort of thing would best be placed in the post-processor. However, I'm more than happy to find another option here. Do you have any suggestions?

One thing occurred to me is adding a after action in the controller/sidekiq engine and then looping through Datadog::Tracing.active_trace spans. Would that better fit with your intentions?

oskarpearson commented 1 year ago

One thing occurred to me is adding a after action in the controller/sidekiq engine and then looping through Datadog::Tracing.active_trace spans. Would that better fit with your intentions?

FWIW I've investigated this and it doesn't seem feasible. Datadog::Tracing.active_trace is a Datadog::Tracing::TraceOperation which doesn't provide access to tags or indeed spans, from what I can see.

delner commented 1 year ago

The code ownership gem is interesting.

  • On the sidekiq side I've created middleware that finds the code owner and tags every job with the owning squad.
  • On the web side I've used prepend_before_action around all rails controllers to tag the request in the same way.

Can you either share a code snippet or explain how this works? What tag(s) are you adding? I'm particularly interested in how the code retrieves the code owner from the context of each of these instrumentation points. Are you calling an API in the code_ownership gem?

My instinct here is medium to long term, if we can find the "right" model that fits well with the Datadog UI/product, we could add an integration for code_ownership and automate this tagging. It would be a bit different than how our other integrations work, but would be an interesting feature. For now, let's table this idea though... there is too much work to do on that to be able to deliver some kind of timely resolution.


However, all 'service entry' spans that occur during this time are handled by the dd-trace-rb activerecord, redis, and other existing integrations. And on the datadog UI if I view the SQL resources I don't have the same tags. My understanding is that this is by design - tags from parent spans and traces aren't copied into service-entry spans.

Some brief context for service naming worth clarifying on:

How service naming currently behaves...

service on a span historically has been free-assignment: name it to whatever you want. Our instrumentation (for 3rd party libraries like activerecord, redis) mirrored this... each allows service names (for spans) to be reconfigured by the user.

The above practice has problems: particularly because it muddles "what is a service?". It makes service an arbitrary field, meaning we can't meaningfully use the field to drive "smarter" Datadog features (like service mapping.)

How we want service naming to behave...

Moving forward, service should reflect "name of application which generated the span". Ideally this means all spans always reflect service as DD_SERVICE/Datadog.configuration.service. This effectively eliminates "service entry spans".

There are obstacles to this still: some operations, like DB queries and external web requests (e.g. AWS), are not or cannot be instrumented (because instrumentation is not present in those resources). Instead, the Ruby app produces "external"/"edge" spans that wrap the entire operation blocking on that external resource. If we set the service of these spans to the Ruby application name, it would incorrectly attribute time spent blocking on external resources to the Ruby application.

This is why activerecord and redis allow for service names to be configured, and produce "service entry spans". I call this out because in the future, this behavior may change if our database monitoring provides a better alternative.


Although your intent is clear ("I would love to be able to go into apm and search for "SQL queries by squad", or "Redis cache requests" by squad or rails component") I'm not sure I fully grasp the plan here.

Is it...

...when an operation is executed within code owned by a particular team, then tag all the resulting descendant spans with the code owner? Such that an activerecord query will be attributed to the code owner of the Rails request that triggered it?

delner commented 1 year ago

One thing occurred to me is adding a after action in the controller/sidekiq engine and then looping through Datadog::Tracing.active_trace spans. Would that better fit with your intentions?

FWIW I've investigated this and it doesn't seem feasible. Datadog::Tracing.active_trace is a Datadog::Tracing::TraceOperation which doesn't provide access to tags or indeed spans, from what I can see.

I would discourage the use of active_trace in an attempt to modify the trace.

I deliberately obfuscated the access to spans in the TraceOperation for this reason: previously we allowed spans to be accessible from any context, and this led to spaghetti code where side effects could be produced anywhere in an unpredictable fashion. Instrumentation hacking other instrumentation, etc...

In addition, we have "partial flushing", which emits spans as they complete, to keep memory pressure down on long running traces. So you have no guarantee that all the spans generated by the trace would be accessible via active_trace anyways.

You should be able to access/set trace tags via active_trace though. That behavior is safe and intended. Do keep in mind, your trace tags will be added to the "root span" when the trace completes. Not sure how this will help the use case.

delner commented 1 year ago

...when an operation is executed within code owned by a particular team, then tag all the resulting descendant spans with the code owner? Such that an activerecord query will be attributed to the code owner of the Rails request that triggered it?

If this is indeed the case, then I can't help but think this sounds like "scope" or a "stack frame" like behavior. Our own Context object is meant to act like this, and its how we track the active_trace from stack frame to stack frame.

Maybe we can wrap the executing code with a "scope", tagged with the ownership information. This "scope" would be accessible by the descendant instrumentation, which can read the active state. That instrumentation (activerecord, redis, etc) could then read the owner and tag it appropriately.

If this sort of thing to work, it could likely involve changes to Context and each of the instrumentation libraries in which we'd like to see code owner tagging (e.g. activerecord). Not trivial, but this would be one way of avoiding the post-processor.

brafales commented 1 year ago

Hi @delner,

...when an operation is executed within code owned by a particular team, then tag all the resulting descendant spans with the code owner? Such that an activerecord query will be attributed to the code owner of the Rails request that triggered it?

Yes, this is exactly the behaviour we'd like to see. We opened a support request a while ago and a solutions engineer confirmed that unfortunately tags set at the top level trace would not automatically propagate to children spans, and they pointed us in the direction of using the pipeline processors, hence why we explored this.

oskarpearson commented 1 year ago

Thanks again @delner for the long reply!

The code ownership gem is interesting.

  • On the sidekiq side I've created middleware that finds the code owner and tags every job with the owning squad.
  • On the web side I've used prepend_before_action around all rails controllers to tag the request in the same way.

Can you either share a code snippet or explain how this works? What tag(s) are you adding? I'm particularly interested in how the code retrieves the code owner from the context of each of these instrumentation points. Are you calling an API in the code_ownership gem?

It's a bit complicated, unfortunately, because in our scenario we're actually using our own internal code, rather than that gem (it's on my todo list to switch over to it). It'd probably work like this, though:

class ApplicationController < ActionController::Base
  include DatadogAutoTagging
  prepend_before_action -> { datadog_autotag_setup(self) }
end
module DatadogAutoTagging
  # This code is actually shared between rails controllers
  # and sidekiq jobs, so it's a little more complex in reality
  def datadog_autotag_setup(obj)
    active_trace = Datadog::Tracing.active_trace
    return unless active_trace

    # obj is the instance of the running rails controller, so we find the code owner
    # for that class.
    owner = CodeOwnership.for_class(obj.class)

    # The tag name here is arbitrary. We set some additional
    # tags here - like which packwerk component the
    # file lives in too. bw is our company name: Bloom & Wild
    active_trace.set_tag("bw_owner", owner)
  end
end

Some brief context for service naming worth clarifying on:

[Useful stuff snipped - thanks]

I was wondering if you perhaps had some sort of diagram/flow of how things flow/are correlated in the dd-trace-rb code? I'm not familiar with it, so I don't grok the differences between context, TraceOperation, and similar?

If not, it's all fine - I'm sure we'll find a solution somehow.

Although your intent is clear ("I would love to be able to go into apm and search for "SQL queries by squad", or "Redis cache requests" by squad or rails component") I'm not sure I fully grasp the plan here.

Is it...

...when an operation is executed within code owned by a particular team, then tag all the resulting descendant spans with the code owner? Such that an activerecord query will be attributed to the code owner of the Rails request that triggered it?

:) I've gone full https://xyproblem.info/ haven't I

Context

We have a Ruby on Rails monolith, and occasionally run into a few problems:

  1. High load on our core read-write database
  2. Low usage of our read-only-replica database
  3. Poorly-chosen or missing database indexes
  4. N+1 queries (where one rails controller or one sidekiq job) does large large numbers of SQL queries
  5. Occasional high load on Redis

The monolith is managed by numerous squads. Thanks to packwerk we have split the monolith into components. Each component is owned by a squad. Rails controllers, Sidekiq jobs, admin pages, and similar are thus owned by single squad.

Right now, we can't surface each of the issues above to a squad. We have to look at entries in APM and go "oh, this is part of a Squad1 trace, they need to do it". Then look at the next one, and go "Oh, this is Squad2".

I'd love to build saved-views/dashboards/etc/etc/etc based on a tag to solve this.

Current Status

Over the last year or so we've made massive strides in being able to use code like the example above to attach tags to every incoming rails controller request, and every sidekiq job.

So, for example, we have dashboards that show API performance by squad. Or slow-running sidekiq jobs by-squad. And we can filter logs and rails APM traces by squad. Indeed, thanks to work Bernat did we also have views of apdex per-squad, which is pretty neat.

So far, so good. I can go into https://app.datadoghq.com/apm/traces and filter by our API service (bw-monolith-api), then filter by owning squad, and then by duration, error status or similar. And squads can then own their endpoints.

But

However, I can't filter our postgres, redis or similar services by squad. If I go into the postgres service and find a SQL query that was executed there, I can click on it and see that it's part of a trace - and if I go up to the top of that trace I can see it's owned by a particular squad.. But the span showing the SQL statement in the "postgres service" hasn't been tagged with that squad.

If it was tagged with the squad, I could do things like this per squad for postgres:

I could do similarly with redis or similar.

My Hack

Since doing the hack above (forcing things into the spans) APM traces for things like SQL queries and similar do have tags.

This hack could also let me do things like count the number of "postgres spans" in a given trace, and then tag the top-level trace with things like sql_queries: 10.

If our tech leadership team then says "hey, squads - spend some time in the upcoming sprint fixing your n+1s" the squad can go to datadog, filter by their squad, and then filter for "sql_queries > 10".

delner commented 1 year ago

a solutions engineer confirmed that unfortunately tags set at the top level trace would not automatically propagate to children spans

@brafales this is correct. The purpose of "trace tags" wasn't to propagate values down, it was to propagate values up, such that you could decorate the first span more easily without having to access/tag it directly. I hope that down the road we will have a UX that makes these trace tags even more discoverable.

It does beg the question of whether we should introduce a mechanism to propagate down too. My instinct goes to something like this:

# Provides the current trace scope, like a "stack frame", which makes state accessible.
Datadog::Tracing.in_scope do |scope|
  # Returns active trace (TraceOperation)
  scope.active_trace

  # Allows storing/retrieving scoped variables
  # These do not exist outside the lifetime of the block,
  # They are not automatically tagged onto spans.
  scope[:engine] = :accounts

  # Contains helper functions, possibly including `set_tag`.
  # which will set_tag at the beginning of every new span within the scope.
  scope.set_tag('codeowner', 'Account Operations')

  Datadog::Tracing.in_scope do |scope|
    # Variables defined in a parent are accessible to the child
    scope[:engine] # => :accounts

    # Child scopes can define their own values, or override the parent's
    scope[:engine] = :billing
    scope[:action] = :show_billing_history
  end

  # Variables defined in a child do not propagate to the parent
  scope[:engine] # => :accounts
  scope[:action] # => nil
end

Then in a Rails application, it could look something like this:

class BillingHistoryController
  around_action :set_codeowner

  def index
    # ...
  end

  private

  def set_codeowner
    Datadog::Tracing.in_scope do |scope|
      scope[:engine] = :accounts
      scope.set_tag('codeowner', 'Account Operations')
      yield
    end
  end
end

@oskarpearson @brafales what do you think?


If this seems like its on the right track, then there are some outstanding design questions on my part:

delner commented 1 year ago

@oskarpearson This context is very helpful... this is a great user story which will help us communicate the need internally.

On a side note, we are actually working on some ideas around N+1 queries. I'm going to want to tag in @marcotc who owns this one.