How to use `Datadog::Context` to trace fiber execution.

ioquatix commented 3 years ago

Hello.

I'm building a gem for tracing Ruby: https://github.com/socketry/trace. This gem is vendor agnostic and the goal is to provide vendor agnostic interfaces sufficient for distributed tracing including across fibers and threads. It's not designed to be a full interface but rather to provide the shims so that libraries can implement the tracing interface rather than be monkey patched. I know there are options like OpenTelemetry but they are too heavy for gems which in many cases won't be used in environments where tracing is occurring.

We have been using Datadog::Context internally to try and connect these traces. We have an open PR for experimentation: https://github.com/socketry/trace/pull/1

Essentially:

parent_context = self.current_context # returns an instance of Trace::Context which contains trace_id and span_id at least.

Fiber.new do
  trace(..., parent_context, ...) do
    # .. do work
  end
end

The goal is to propagate the state from the parent to the child fiber.

When we first did this, we tried using Datadog::Context and recreating it using the trace_id and span_id but this produced some odd diagrams:

Notice the blank rows.

I assumed it's because we are not providing enough details, so I tried experimenting with capturing the actual span and reusing it as child_of: span and this produced better pictures:

To be honest, it's not clear to me the right way to do this. Clearly recreating the context is not sufficient (even with the span_id). Maybe this is a bug? Or maybe there is a better way to do this.

Finally, and this might just be a problem on my end, within those fibers, we have async.http.client.call traces, but those are not connected. They end up getting logged as separate traces and I'm not sure why:

When fibers are not used, the HTTP client request does show up correctly:

Any help making this work correctly would be most appreciated. I'll continue to work on it on my end too.

ioquatix commented 3 years ago

I started investigating this on a different app.

Here is the current implementation of the application:

on 'index' do
    numbers = 10.times.map{rand}
    internet = Async::HTTP::Internet.instance

    barrier = Async::Barrier.new

    numbers.each do |number|
        barrier.async do
            internet.get("http://httpbin.org/delay/#{number}") do |response|
                response.read
            end
        end
    end

    barrier.wait

    succeed! content: "Hello World!"
end

The barrier.async creates a new fiber. It essentially does the following:

def trace_context=(context)
    if context
        ::Datadog.tracer.provider.context = ::Datadog::Context.new(
            trace_id: context.trace_id,
            span_id: context.span_id,
            sampled: context.sampled?,
        )
    end
end

def trace_context(span = Datadog.tracer.active_span)
    return nil unless span

    flags = 0

    if span.sampled
        flags |= Context::SAMPLED
    end

    return Context.new(
        span.trace_id,
        span.span_id,
        flags,
        nil,
        remote: false,
        span: span,
    )
end

def trace_context=(context)
    if context
        ::Datadog.tracer.provider.context = ::Datadog::Context.new(
            trace_id: context.trace_id,
            span_id: context.span_id,
            sampled: context.sampled?,
        )
    end
end

def trace_context(span = Datadog.tracer.active_span)
    return nil unless span

    flags = 0

    if span.sampled
        flags |= Context::SAMPLED
    end

    return Context.new(
        span.trace_id,
        span.span_id,
        flags,
        nil,
        remote: false,
        span: span,
    )
end

# barrier.async invokes this:
def make_fiber
    context = self.trace_context

    Fiber.new do
        self.trace_context = context
        yield # client http request
    end
end

It produces the following picture:

What's the reason for the empty rows between the trace spans? Is it because they are considered decentralised traces? Because of how we copy the context?

delner commented 3 years ago

I don't know why you're seeing blank lines between each call. If I were to hypothesize, each of these spans are probably "root spans" (parent ID 0) with the same trace ID. Not sure though, would need to see the trace data (spans and their attributes) and reconcile that with how our UI works.

You could experiment by wrapping the contents of on 'index' with another trace call: I know async can be modeled this way and it might look better. Still I would expect to see roughly parallel groupings like this under one parent. Not sure if that'd be satisfactory.

I think this resulting visualization is largely a facet of how flamegraphs work. In the case above, you're doing a "fan-out" (to start each HTTP call), followed by a "fan-in" (to wait on each HTTP call to finish) before completing the original operation (index.) Because you have multiple execution contexts (fibers) running concurrently in the same trace, the graph must somehow plot these operations on the same timeline while indicating they are not direct descendants of one other (because stacked bars typically imply parent/child relationships.)

Given this example, what would you (intuitively) expect the trace visualization look like? (Something of a mock flamegraph would be a really powerful aid.) I think if we have an idea of how you expect the visualization to behave, we might be able to work backwards and figure out how the trace should be represented in the data model.

For what its worth, I think this is a very interesting issue and I would like to make this async use-case far more intuitive to implement and visualize, so happy to work with you on this!

ioquatix commented 3 years ago

Thanks for your detailed reply. I am sorry I did not update this earlier but I have been busy. I actually already experimented with what you suggested and it shows up as you expected:

As to your suggestion about what this should look like, I also already made a mock-up:

I think the idea is we need to group units of execution (processes, threads, fibers) but they aren't the same as spans as traditionally used to represent "traces" of application code but rather groupings - i.e. you never have a thread/fiber/process "inside" another one.

Rather than going deep on this right now, I'd rather just provide these high level ideas and see what you think. So, I'll finish here.

mmizutani commented 3 years ago

I have been using dd-trace-rb to trace typhoeus's multi-threaded parallel HTTP requests, and the flamegraphs look nice as below.

In the case of the ddtrace patch for typhoeus (ethon), individual requests (ethon.request) are children of a wrapper virtual span (ethon.multi.request). Ethon::Multi works as a multi-request aggregator, and ddtrace just applies trace patches to the aggregator.

So, to draw similar flamegraphs for async fiber HTTP requests, we might need to implement a wrapper, aggregator library for async http requests, which itself would be beyond the scope of ddtrace.

Screen Shot 2021-09-08 at 8 13 45

ioquatix commented 3 years ago

@mmizutani how do you get the vertical teal "prefix" on all the children requests?

Actually I did read through the ethon instrumentation code to try and understand how I could achieve similar traces.

Is it because of:

            @datadog_multi_span ||= datadog_configuration[:tracer].trace(
              Ext::SPAN_MULTI_REQUEST,                                   # THIS SPECIAL NAME?
              service: datadog_configuration[:service_name]
            )

From multi_patch.rb.

The other complex thing to consider here is that the life time of a thread of fiber can be bigger than an individual request in the flame graph i.e. thread pools, fiber pools, background fibers, etc. This could make visualisation a little bit tricky, but it should still try to capture the concurrent execution if possible.

ioquatix commented 2 years ago

I've made some progress on this as outlined here: https://github.com/DataDog/dd-trace-rb/issues/1136#issuecomment-945401624

I'm going to follow up again once I've confirmed this is working, at least in it's most basic form. We can check the current visualisation and go from there.

DataDog / dd-trace-rb

How to use `Datadog::Context` to trace fiber execution. #1669