Open ioquatix opened 3 years ago
I started investigating this on a different app.
Here is the current implementation of the application:
on 'index' do
numbers = 10.times.map{rand}
internet = Async::HTTP::Internet.instance
barrier = Async::Barrier.new
numbers.each do |number|
barrier.async do
internet.get("http://httpbin.org/delay/#{number}") do |response|
response.read
end
end
end
barrier.wait
succeed! content: "Hello World!"
end
The barrier.async
creates a new fiber. It essentially does the following:
def trace_context=(context)
if context
::Datadog.tracer.provider.context = ::Datadog::Context.new(
trace_id: context.trace_id,
span_id: context.span_id,
sampled: context.sampled?,
)
end
end
def trace_context(span = Datadog.tracer.active_span)
return nil unless span
flags = 0
if span.sampled
flags |= Context::SAMPLED
end
return Context.new(
span.trace_id,
span.span_id,
flags,
nil,
remote: false,
span: span,
)
end
def trace_context=(context)
if context
::Datadog.tracer.provider.context = ::Datadog::Context.new(
trace_id: context.trace_id,
span_id: context.span_id,
sampled: context.sampled?,
)
end
end
def trace_context(span = Datadog.tracer.active_span)
return nil unless span
flags = 0
if span.sampled
flags |= Context::SAMPLED
end
return Context.new(
span.trace_id,
span.span_id,
flags,
nil,
remote: false,
span: span,
)
end
# barrier.async invokes this:
def make_fiber
context = self.trace_context
Fiber.new do
self.trace_context = context
yield # client http request
end
end
It produces the following picture:
What's the reason for the empty rows between the trace spans? Is it because they are considered decentralised traces? Because of how we copy the context?
I don't know why you're seeing blank lines between each call. If I were to hypothesize, each of these spans are probably "root spans" (parent ID 0) with the same trace ID. Not sure though, would need to see the trace data (spans and their attributes) and reconcile that with how our UI works.
You could experiment by wrapping the contents of on 'index'
with another trace
call: I know async can be modeled this way and it might look better. Still I would expect to see roughly parallel groupings like this under one parent. Not sure if that'd be satisfactory.
I think this resulting visualization is largely a facet of how flamegraphs work. In the case above, you're doing a "fan-out" (to start each HTTP call), followed by a "fan-in" (to wait on each HTTP call to finish) before completing the original operation (index
.) Because you have multiple execution contexts (fibers) running concurrently in the same trace, the graph must somehow plot these operations on the same timeline while indicating they are not direct descendants of one other (because stacked bars typically imply parent/child relationships.)
Given this example, what would you (intuitively) expect the trace visualization look like? (Something of a mock flamegraph would be a really powerful aid.) I think if we have an idea of how you expect the visualization to behave, we might be able to work backwards and figure out how the trace should be represented in the data model.
For what its worth, I think this is a very interesting issue and I would like to make this async use-case far more intuitive to implement and visualize, so happy to work with you on this!
Thanks for your detailed reply. I am sorry I did not update this earlier but I have been busy. I actually already experimented with what you suggested and it shows up as you expected:
As to your suggestion about what this should look like, I also already made a mock-up:
I think the idea is we need to group units of execution (processes, threads, fibers) but they aren't the same as spans as traditionally used to represent "traces" of application code but rather groupings - i.e. you never have a thread/fiber/process "inside" another one.
Rather than going deep on this right now, I'd rather just provide these high level ideas and see what you think. So, I'll finish here.
I have been using dd-trace-rb to trace typhoeus's multi-threaded parallel HTTP requests, and the flamegraphs look nice as below.
In the case of the ddtrace patch for typhoeus (ethon), individual requests (ethon.request) are children of a wrapper virtual span (ethon.multi.request). Ethon::Multi works as a multi-request aggregator, and ddtrace just applies trace patches to the aggregator.
So, to draw similar flamegraphs for async fiber HTTP requests, we might need to implement a wrapper, aggregator library for async http requests, which itself would be beyond the scope of ddtrace.
@mmizutani how do you get the vertical teal "prefix" on all the children requests?
Actually I did read through the ethon instrumentation code to try and understand how I could achieve similar traces.
Is it because of:
@datadog_multi_span ||= datadog_configuration[:tracer].trace(
Ext::SPAN_MULTI_REQUEST, # THIS SPECIAL NAME?
service: datadog_configuration[:service_name]
)
From multi_patch.rb.
The other complex thing to consider here is that the life time of a thread of fiber can be bigger than an individual request in the flame graph i.e. thread pools, fiber pools, background fibers, etc. This could make visualisation a little bit tricky, but it should still try to capture the concurrent execution if possible.
I've made some progress on this as outlined here: https://github.com/DataDog/dd-trace-rb/issues/1136#issuecomment-945401624
I'm going to follow up again once I've confirmed this is working, at least in it's most basic form. We can check the current visualisation and go from there.
Hello.
I'm building a gem for tracing Ruby: https://github.com/socketry/trace. This gem is vendor agnostic and the goal is to provide vendor agnostic interfaces sufficient for distributed tracing including across fibers and threads. It's not designed to be a full interface but rather to provide the shims so that libraries can implement the tracing interface rather than be monkey patched. I know there are options like OpenTelemetry but they are too heavy for gems which in many cases won't be used in environments where tracing is occurring.
We have been using
Datadog::Context
internally to try and connect these traces. We have an open PR for experimentation: https://github.com/socketry/trace/pull/1Essentially:
The goal is to propagate the state from the parent to the child fiber.
When we first did this, we tried using
Datadog::Context
and recreating it using thetrace_id
andspan_id
but this produced some odd diagrams:Notice the blank rows.
I assumed it's because we are not providing enough details, so I tried experimenting with capturing the actual
span
and reusing it aschild_of: span
and this produced better pictures:To be honest, it's not clear to me the right way to do this. Clearly recreating the context is not sufficient (even with the
span_id
). Maybe this is a bug? Or maybe there is a better way to do this.Finally, and this might just be a problem on my end, within those fibers, we have
async.http.client.call
traces, but those are not connected. They end up getting logged as separate traces and I'm not sure why:When fibers are not used, the HTTP client request does show up correctly:
Any help making this work correctly would be most appreciated. I'll continue to work on it on my end too.