DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
303 stars 368 forks source link

Tracing inside Fibers #1389

Open rmosolgo opened 3 years ago

rmosolgo commented 3 years ago

👋 Hi! I'm coming here by way of a bug report in graphql-ruby: https://github.com/rmosolgo/graphql-ruby/issues/3366

A recent GraphQL-Ruby release included a Fiber-based batch loading plugin: it takes little "jobs" during GraphQL execution, then uses a bunch of Fibers to run the jobs concurrently. (Hopefully, it'll leverage the Ruby 3 scheduler soon so that I/O in those fibers happens in parallel, but that's not done yet.)

A user noticed that code run inside Fibers doesn't show up in DataDog APM, and that it's likely due to the use of Thread.current for storing DataDog state:

https://github.com/DataDog/dd-trace-rb/blob/079f18f410f590252030578aa240576f3fc4c6e5/lib/ddtrace/opentracer/thread_local_scope_manager.rb#L35-L37

(Inside a new Fiber, Thread.current is empty.)

I'm sorry to say I don't have an idea for how to address this, but I thought I'd open an issue in case y'all have any suggestions. This certainly seems related to #1136 , since both are about using Fibers.

Does anyone have a suggestion for how this issue might be addressed?

delner commented 3 years ago

Thanks for the heads up @rmosolgo . Indeed ddtrace does depend on Thread.current to store trace context; when this isn't available, or changes, then active traces will not cross execution contexts unless the previous context is loaded & accessed in the new context.

We expect this to be case with Ractor and Fiber alike, so we're actively evaluating 1) what the trace behavior should be, and 2) how such behavior should be implemented. Sorry we don't have something for this right now, but feedback like this really does help us contextualize the right solution, so this is very helpful. If you or anyone else has suggestions, we'll happily consider that as well.

Once I round up with the team I expect I'll have more to share.

AlexRiedler commented 7 months ago

Is there any update you can provide around support for Ractor's + Fibers in Ruby ? Seems like a pretty vital feature for the future of Ruby development, particularly performance.

narasimhaprasad-jiva commented 6 months ago

We are still having the same issue where we see gaps in the datadog traces for the once we are using Dataloader! Any update on this would be of help. @delner

delner commented 6 months ago

@AlexRiedler We spent a bit of time looking into this. Last I remember (some time ago), Ractors themselves have been a bit unstable, not receiving widespread use, and there are some significant challenges in loading constants/sharing data across Ractor boundaries. It's still something I'm interested in adding in the future, but isn't currently prioritized. We're open to changing that if there's significant community interest in this feature, though!

@narasimhaprasad-jiva can you clarify about Dataloader? Sorry, I'm not familiar to what this refers. (A library perhaps?) Is it using Fibers or Ractors?

I think tracing should support workloads across Fibers. There is some ambiguity as to how these concurrency patterns should be modeled in the trace data model. If we can clarify how that looks, I think there will be a clearer path forward. I will renew this for reprioritization with the team.

AlexRiedler commented 6 months ago

@delner might be able to answer question about DataLoader, GraphQL-ruby has a dataloader module that uses Fibers... maybe something funky is up there?

jakeonfire commented 5 months ago

we are noticing, after changing the app server for a deployment/service from unicorn to falcon (which uses fibers for greater concurrency), that the reported throughput is almost 50% lower. i'm fairly certain the throughput should be the same since we are hitting the deployment with the same number of requests, and average latency is significantly lower. Screenshot 2024-02-09 at 4 40 17 PM

jakeonfire commented 5 months ago

i just realized that despite the graph title that's actually total requests time, not count of requests. when changing the graph operator from "sum by everything" to "count all values" the numbers make more sense. so nm!