Distributed tracing: continue multiple traces

HoneyryderChuck commented 1 year ago

Is your feature request related to a problem? Please describe.

The problem I'm having relates to a set of concurrent operations (under a given datadog trace) which pipe a data structure into a queue; this queue is then flushed into a single batch, that I then send to a firehose via put_record_batch (AWS SDK Firehose).

Describe the goal of the feature

As per the description above, I'd like to, given a set of operations which propagate tracing context into the datastructure in the intermediate queue, and once they're all fetched into a batch, resume all of them before calling the mentioned AWS API call (which will start a new aws-sdk related trace, and should pick up all the traces from the messages in the batch.

Describe alternatives you've considered

I didn't really think about an alternative. Not sure if it's a bug either, or something accomplished by another datadog feature.

How does ddtrace help you?

distributed tracing feature certainly helps me keeping tabs on a given flow split into chunks of separated concurrent / distributed workloads.

delner commented 1 year ago

I'm interested in your use case, but I'm having trouble understanding the trace flow you're describing, and visualizing the desired output in the UI. Particularly, I'm unfamiliar with "firehose" and how this batching procedure works.

Is there anyway to diagram this? Share psuedo-code that demonstrates how things are should work? What have you tried using in the existing distributed tracing? Where did it fall short?

HoneyryderChuck commented 1 year ago

The AWS SDK function I mean is this one.

Essentially, I have a background thread, let's say in a sidekiq process, collecting data from a queue, and then batch-sending:

firehose = Aws::Sdk::Firehose.new
$queue = Queue.new

Thread.start do
   loop do
     records = []
     while !$queue.empty? && records.size < 40
       records << $queue.pop
     end
     firehose.put_record_batch(records)
   end
end

Meanwhile, any sidekiq worker may be writing to that queue:

def perform(arg)
  data = heavy_computation(arg)
  $queue << data
end

What I would like is to link the trace from the sidekiq worker all the way down to the firehose put record call writing that data field. It's sorta-kinda what the distributed tracing modules do (i.e. pass trace id to the next request / job queue / smth else), but in this case, because the sdk call involves many units of work (in this case 40 of them), I'd like a way to say "please continue these 40 traces in the next call to the aws sdk which starts a trace":

def perform(arg)
  data = heavy_computation(arg)
  $queue << add_trace_context(data)
end

# and later

 while !$queue.empty? && records.size < 40
    data = $queue.pop
    DDtrace.continue_trace(data) # 40 calls at once!
    records << data[:data]
 end
# and now the aws sdk call start a new trace, continuing the 40 above.
firehose.put_record_batch(records)

delner commented 1 year ago

I see; this definitely helps a lot!

If I'm understanding correctly, it's basically a "fan-in", where the trace of the process reading from the queue is downstream of 40 other traces. In other words, the trace has 40+ parents (the 40 sidekiq jobs and whatever started this process itself.)

firehose.process ---------------------------> queue.pop
                                         /
sidekiq.job -----------------------------
                                       /
sidekiq.job ---------------------------
                                     /
38 others... ------------------------

I think it effectively boils down to a problem of multiple inheritance... what is a parent? And if you have multiple parents (each of which is its own trace), what trace do you belong?

The trace of the "queue reading operation" should almost certainly annotate all of its ancestors (the firehose process and the 40 Sidekiq jobs preceding its own work.) It's just a question of how should it be annotated, and how it should be displayed in the UX.

Regarding annotation...

It probably would make the most sense that the parent is any operation that synchronously preceded the current operation within the same execution context. This would mean there would only ever be, at most, one parent: in this case, the "firehose process".
Any operation that precedes the current in a separate execution context (aka asynchronously) would not be a parent. Instead the current operation would be annotated to have "followed from" that operation. In this case, the queue reading operation will have "followed from" 40 Sidekiq jobs.

Regarding display...

...I don't know if we have a good way of displaying this in a trace view (aka "flamegraph"). This visualization is great for showing synchronous processes, but doesn't do well with asynchronous behavior. And it definitely doesn't show multiple traces today. IMO, it would be more compelling to show a "trace graph" where you could see the "queue read operation" as a node on a directed graph, with the 40 sidekiq job nodes pointing to it. Drilling into each node would display the "flamegraph" for that individual trace.

Today, this behavior doesn't exist, and it goes beyond the Ruby library. But its an interesting use case; I'd love to float this to our internal team to have them examine it further, consider what we could do to support it. Any description of the visual output/UX of what you might expect to see by doing this would be helpful in contextualizing the feature request, make the value clear.

HoneyryderChuck commented 1 year ago

Thx for taking interest!

Visually, I don't expect anything wildly different than what exists today: for instance, if you're jumping into a "child trace" and zoom out, you may see multiple traces above, as the root trace may have spawned many "child traces", each with their own children, and you can only visually identify the hierarchy by the lines which vertically cut them. So in the case of a firehose trace iin the example above, that line with connect with several other "uber traces", which aren't related to each other (which may be difficult considering the limitations off the current implementation).

delner commented 1 year ago

I tried brainstorming what a visualization could look like (granted I'm not a designer)...

Tracing use cases - Follows from flamegraph

Does that roughly model what you're looking for? (For the sake of argument?)

I think the team has some interest in adding support for more interconnection between related traces like this. It could be a while before this sort of thing is supported first class in Ruby, pending those designs. I'll update you on this as there are developments.

HoneyryderChuck commented 1 year ago

It looks like a decent start, for sure 👌thx

delner commented 1 year ago

Just an update; our team is internally looking at ways to support this kind of behavior more broadly in tracing & APM. I'm sharing this example as a use case to help contextualize our designs. I'd like to see if we can figure out some first-class support on that end, before implementing anything in Ruby. It could take a bit though. We'll try to keep you updated when we have some to share!

Thanks for highlighting this @HoneyryderChuck!

DataDog / dd-trace-rb

Distributed tracing: continue multiple traces #2964