Support OpenTelemetry concepts on Activity

tarekgh commented 4 years ago

This issue tracking the work as the part of the issue https://github.com/dotnet/corefx/issues/42305

Please look at https://github.com/dotnet/designs/pull/98 for the detailed design.

noahfalk commented 4 years ago

@tarekgh is the primary dev owner, I am actively assisting. So far design feedback has been occurring in https://github.com/dotnet/runtime/pull/32660

reyang commented 4 years ago

Looks great! Thanks @tarekgh and @noahfalk!

One question related to

using (Activity foo = new Activity(...))
{
    throw new FoobarException(...);
}

If it is possible to capture the exception and associate it with the Activity?

In OpenTelemetry Python there is such feature (although I think it is more like a question to the language/runtime rather than API) where we know if there is exception and derive status code.

tarekgh commented 4 years ago

Thanks @reyang for your review.

I don't think we can automatically capture the exception inside using block without try-catch. The user can write the code like:

using (Activity foo = Activity.StartNew(...))
{
    try
    {
        throw new FoobarException(...);
    }
    catch (Exception e)
    {
        foo?.SetCustomProperty("Exception", e);
    }
}

and if there is any listener, will get the notification in stopping the activity and then can check if the exception is set there using GetCustomProperty. I'll think if we'll have a better way to handle it.

reyang commented 4 years ago

If developers perform exception handling in a central place (e.g. the evil global exception handler, or catching exception from top level code), is there a way for them to know what's the enclosing activity? (I guess no?)



try {
    using (var foo = new Activity("foo"))
    {
        using (var baz = new Activity("baz")
        {
            DoSomething();
        }
        DoSomething();
    }
    using (var bar = new Activity("bar"))
    {
        DoSomething();
    }
} catch (FoobarException ex) {
    log(
        ex,
        MagicalFunctionWhichReturnsTheEnclosingActivity(ex).Name, // this is hard since activity is disposed already
        MagicalFunctionWhichReturnsTheVeryContext(ex).SpanId,
    );
}

chandramouleswaran commented 4 years ago

Can you also share the examples for creating an Activity from a parent activity ID only? Specifically around async workflows.

There are lots of use cases for us where we might spawn off an async task while the sync portion of the API needs to resume back in existing Activity context. In that case, we would still want to link the async task's activity to the API activity.

There are couple of options I can think of Option 1 (taking inspiration from internal Ifx tools) Create an Activity and do not dispose it. Provide the ability to Unload it which returns to the older context. Activity created can be passed to the task Allow the option to Load the activity back and continue using a dispose pattern

Option 2 Get the serialized Activity data (may be just the current full ID is good enough if it has parent information) Pass that along to the async task Create a new activity with the serialized data as the parent Use the dispose pattern

noahfalk commented 4 years ago

Thanks Tarek! Its a little tricky to comment on in GitHub because all the text is in the body of an issue. I'll do quotes but if the comments grow it might be helpful to put the text as a markdown document in a PR to allow inline commenting.

We also enhance and simplify the usage of the Activity

I think there is another major value prop worth calling out explicitly: it is now reasonable for a listener such as OpenTelemetry to observe Activity instrumentation from any code component in the app. Previously OpenTelemetry would have needed a-priori knowledge of the strings that would be used for DiagnosticSource name and IsEnabled() callbacks for each Activity it wanted to receive callbacks for.

The current usage pattern of the Activity is not neat and not that performant

I don't think the current proposal is making any performance improvement? Previously when we had ActivitySource that offered a faster path, but we've removed it.

Also, the listener should be more performant than the current listening way (through DiagnosticListener)

Similar to above, I don't think the current proposal is faster?

// The following pattern is useful for the Dependency Injection scenario to allow easily creating the listener with the reflection without the need to subclass and override the listener methods.
        ActivityListener listener = ActivityListener.CreateAndStart(shouldCreateActivity: (name, ContextBoundObject, links) => true, onActivityStarted: (activity) => { ... }, onActivityStopped: (activity) => { ... });

I think ContextBoundObject was unintended? Lower in the spec it is ActivityContext
The strongly typed objects in the parameters may make it tricky to use via reflection but something like expression trees might make it doable. I'll play with it.

public virtual bool ShouldCreateActivity(string activitySourceName, ActivityContext context, IEnumerable? links)

Naming suggestions: activitySourceName -> activityName context -> parent

For ActivityListener I assume the callback behavior is that OnActivityStarted/OnActivityStopped get invoked for every Activity, but ShouldCreateActivity only gets invoked for Activities created via Activity.StartNew? We should clarify that.

I've been thinking a bit about how we handle cases where someone is using Hierarchial ids rather than W3C ids. I think there may be an easy path where we add an overload to Activity.StartNew and ActivityListener.ShouldCreateActivity that uses a string for the id instead of the ActivityContext.

noahfalk commented 4 years ago

If it is possible to capture the exception and associate it with the Activity?

If we had a defined place on the Activity where Exception should be stored then the BCL (or any of our users) could make a helper method that lets you write code like this:

Activity.RunWithActivity("MyActivityName", activity =>
{
   ... code that does whatever work inside activity scope
   throw new Exception("oopsie");
});

That helper could be implemented something like:

class Activity
{
    public static RunWithActivity(string name, Action<Activity?> action)
    {
        try
        {
             Activity? a = Activity.StartNew(name);
             action(a);
        }
        catch(Exception e) when AssociateException(a, e)
        { Assert.Fail("This should be unreachable code") }
        finally
        {
             a?.Dispose();
        }
    }

    static bool AssociateException(Activity a, Exception e)
    {
         a.SetCustomProperty("Exception", e);
         return false; // this function gets evaluated as an exception filter, we don't want to catch the exception
    }
}

If developers perform exception handling in a central place (e.g. the evil global exception handler, or catching exception from top level code), is there a way for them to know what's the enclosing activity? (I guess no?)

In a catch handler no, but in an exception filter potentially yes. However if the exception is caught and then rethrown at some intermediate frame (as often happens with async) then the filter won't help either.

tarekgh commented 4 years ago

@noahfalk thanks for the feedback. I have updated the description according to your comments. please let me know if I missed anything there.

@chandramouleswaran thanks for the feedback too, I'll try to get back to you soon.

pakrym commented 4 years ago

Is there a fast way to check if activity should be created? Sometimes collecting context and links might be expensive, it would be nice to have a performant way to know beforehand if someone is listening on the activity.

tarekgh commented 4 years ago

@pakrym we Activity.StartNew(string name) which doesn't take a context or links.

if you are talking about the Listener side and want to avoid dealing with context/links there, we may think to add a callback to the listener which doesn't take a context/links. but I am not sure if this case concerns you.

namespace System.Diagnostics
{
    public class ActivityListener : IDisposable
    {
        ...
        public virtual bool ShouldCreateActivity(string activityName, ActivityContext parent, IEnumerable<ActivityLink>? links)
        public virtual bool ShouldCreateActivity(string activityName)
        ...   
   }
}

pakrym commented 4 years ago

As a producer of activity I want to know when any particular activity is enabled so I don't have to extract links/ActivityContext from, for example an incoming request, when no one is listening.

tarekgh commented 4 years ago

got it. would adding something like:

public static bool ActivityListener.IsEnabled(string activityName)

will it address your scenario?

pakrym commented 4 years ago

would adding something like: public static bool ActivityListener.IsEnabled(string activityName)

Would it be as fast as DiagnosticSource.IsEnabled() check currently is (a single non-virtual field read)?

tarekgh commented 4 years ago

Would it be as fast as DiagnosticSource.IsEnabled() check currently is (a single non-virtual field read)?

If there are no registered listeners, then yes, we'll just check a static field value and if it is null we'll just return false. if there is any registered listener, we'll need to call them to know if any of the listeners interested in creating such an activity with the input name.

pakrym commented 4 years ago

Then I don't think it addresses my scenario, N virtual calls just to know if an activity should be created seems way to high for something like Kestrel.

tarekgh commented 4 years ago

ActivityListener.ShouldCreateActivity is added to the listener mainly a request from @noahfalk to enable the sampling scenario which he is interested in. We may try to turn off the sampling by default in the listener and be opt-in to turn it on. so by default the virtual calls to ShouldCreateActivity will be skipped except if someone enables it on any listener. would that be reasonable middle ground to support your scenario and in same time having a way for sampling?

SergeyKanzhelev commented 4 years ago

I think it will be great to align this design closer to the OpenTelemetry API.

Can we introduce a Tracer class that will start and stop activities instead of static methods? Tracer will also have a name and version so the entire Tracer may be disabled.
Let's have a "current" span always be present. It may be one with the invalid context.
ActivityContext.IsValid is a very useful method that allows to check whether the context returned by a class deserializing it from some external message was actually there. This way we can avoid using nullable ActivityContexts everywhere
Activity must allow to update it's name. I have a feeling that this design suggests that Activity name is closer to the Tracer name in OpenTelemetry speak.
Allow to provide start and end time of Activity in Start and End methods.

In general, can we just follow the open telemetry spec closer? Would it help to start collecting the use cases for the API?

tarekgh commented 4 years ago

I have looked at the OT specs and here are the differences:

SpanContext:

Tracestate is defined as key-value pairs and in our proposal is defined as a string containing the key-value pairs. should we stick to the specs? Activity already exposing this as string.
IsValid is defined in the spec and not in the proposal. I had it in the previous proposal and Liudmila suggested to remove it https://github.com/dotnet/runtime/pull/32660#pullrequestreview-362893928
IsRemote exists in the specs and not in the proposal. I had it in the previous proposal and I recall Liudmila mentioned this is going to be removed from the OT specs.

Span/Activity

OT specs have Span.GetContext which we can add to our proposal as ActivityContext Activity.Context { get; }. This property is going to throw if the Activity is not started as the context fields wouldn't be available.
Span in the OT spec has Status and Event (SetStatus/AddEvent) which we have decided not to add them from now before discussing with the community and try to change the specs.
Span allows creating a new object using the name, context, Kind, attributes, links, and start time. In the proposal, we are missing some of these and I'll add them to the proposal.
OT spec has IsRecording. I believe this can be equivalent to detecting if the Activity is stopped. please advise if I am wrong here. Activity currently has internal property IsFinished which we can expose the getter for this property (and possibly with IsRecording name, i.e. IsRecording = !IsFinsihed)
OT specs have UpdateName. the spec saying the following Alternatives for the name update may be late Span creation, when Span is started with the explicit timestamp from the past at the moment where the final Span name is known, or reporting a Span with the desired name as a child Span.. Which means we don't have to provide such a method. If we decide to add it, we'll need to figure out the consequences of that (like if listeners need to get notified? what happen if someone using the activity name in a dictionary? ...etc.) I am not seeing it is a good idea to provide this method and instead we should recommend creating a late span.

reyang commented 4 years ago

Tracestate is defined as key-value pairs and in our proposal is defined as a string containing the key-value pairs. should we stick to the specs? Activity already exposing this as string.

I think it is totally fine to leave it as a string. Most users won't consume it directly. Propagator authors will need to access this field for interop scenario, and there are more efficient ways to manipulate tracestate as a string rather than a dictionary.

IsValid is defined in the spec and not in the proposal. I had it in the previous proposal and Liudmila suggested to remove it #32660 (review)

IsRemote exists in the specs and not in the proposal. I had it in the previous proposal and I recall Liudmila mentioned this is going to be removed from the OT specs.

These seem to be helper methods that we can add later without breaking existing thing. Given OpenTelemetry specification is evolving, I think it is fine to not have them for now.

OT specs have Span.GetContext which we can add to our proposal as ActivityContext Activity.Context { get; }. This property is going to throw if the Activity is not started as the context fields wouldn't be available.

Span in the OT spec has Status and Event (SetStatus/AddEvent) which we have decided not to add them from now before discussing with the community and try to change the specs.

Sounds good. Adding @GeorgeJoy and @Kanwaldeep for awareness.

Span allows creating a new object using the name, context, Kind, attributes, links, and start time. In the proposal, we are missing some of these and I'll add them to the proposal.

OT spec has IsRecording. I believe this can be equivalent to detecting if the Activity is stopped. please advise if I am wrong here. Activity currently has internal property IsFinished which we can expose the getter for this property (and possibly with IsRecording name, i.e. IsRecording = !IsFinsihed)

It is a little bit different. IsRecording would return false if the span is not realized. For example, if a dummy tracer is used, all spans will be dummy spans and have IsRecording() == False.

Here goes a concrete example on how this would be useful.

OT specs have UpdateName. the spec saying the following Alternatives for the name update may be late Span creation, when Span is started with the explicit timestamp from the past at the moment where the final Span name is known, or reporting a Span with the desired name as a child Span.. Which means we don't have to provide such a method. If we decide to add it, we'll need to figure out the consequences of that (like if listeners need to get notified? what happen if someone using the activity name in a dictionary? ...etc.) I am not seeing it is a good idea to provide this method and instead we should recommend creating a late span.

I don't have a strong opinion here. Personally I would prefer to keep the UpdateName since I can think of several scenarios where it could be very useful. Regarding the semantic and listener behavior, I think as long as we have a well-defined behavior (so developers know what is the side effect if they change the name), it'll be okay.

tarekgh commented 4 years ago

It is a little bit different. IsRecording would return false if the span is not realized. For example, if a dummy tracer is used, all spans will be dummy spans and have IsRecording() == False.

thanks, @reyang for your quick feedback.

Does this suggest we don't need to provide IsRecording in BCL as we never have a dummy Activity?

reyang commented 4 years ago

It is a little bit different. IsRecording would return false if the span is not realized. For example, if a dummy tracer is used, all spans will be dummy spans and have IsRecording() == False.

thanks, @reyang for your quick feedback.

Does this suggest we don't need to provide IsRecording in BCL as we never have a dummy Activity?

I think the IsRecording is introduced to reduce instrumentation cost when the span is not used (e.g. as mentioned here). As long as we can address the problem - whether IsRecording, IsEnabled or IsSubscribed. From usability perspective, I would prefer Span.IsRecording though.

@pakrym what's your perspective?

tarekgh commented 4 years ago

From usability perspective, I would prefer Span.IsRecording though.

I am trying to understand how this property can be supported on the Activity. I am seeing we will have 2 options to do so:

Either having Activity.IsRecording as a static property or method taking activity name, which can tell if there are any enabled listeners. I think this will be kind of confusing.
Or having it as an instance property which will turn true if there are any added tags, links, ...etc. to the activity. otherwise will be false. This can be useful in the sampling scenario mentioned in OT specs but even without providing this property the callers can easily know that by checking the properties Links, Tags...etc. so the property would be provided for convenience and not lacking functionality.

SergeyKanzhelev commented 4 years ago

Activity.IsRecording is per every activity. It indicates whether attributes, events, status code, etc. should be set on Activity or not. It's very useful. It wasn't as useful before as Activities operated with raw objects which didn't require much parsing and allocations to be reported in DiagnosticsSource events

Activity must allow multiple states: Completely turned off (rare), Propagating context, but IsRecording is false (most often) , Active and collecting data. So I'd suggest to always create Activity with the context.

One difference between proposed API and OT - in OT starting of Activity was handled by Tracer implementation and recording decision was made there without an explicit callback from the library. So instrumentation code will be smaller. Introducing a separate callbcack makes code more complicated and require customer to store and propagate the result of the callback execution.

tarekgh commented 4 years ago

Propagating context, but IsRecording is false (most often)

does this mean when no one listening to the Activity?

Active and collecting data.

Does this mean when there is a listener?

I think we can add Activity.IsRecording with the behavior you are describing.

So I'd suggest to always create Activity with the context.

do you mean create an activity with a state? telling if recording or not?

One difference between proposed API and OT - in OT starting of Activity was handled by Tracer implementation and recording decision was made there without an explicit callback from the library. So instrumentation code will be smaller. Introducing a separate callbcack makes code more complicated and require customer to store and propagate the result of the callback execution.

Doesn't the tracer need to listen on Activity stopping? I mean should it already have a callback from the activity anyway?

If you have a specific model, please send more details with some pseudo-code show how things will work and why the current proposal not satisfying the scenarios you are addressing. I'll be happy to explore it more.

noahfalk commented 4 years ago

Sorry for a big reply on multiple threads of discussion. I'm working to get something in PR form so that discussion can happen inline as review comments or across multiple issues.

I think the IsRecording is introduced to reduce instrumentation cost when the span is not used (e.g. as mentioned here). As long as we can address the problem - whether IsRecording, IsEnabled or IsSubscribed. From usability perspective, I would prefer Span.IsRecording though.

Right now I think the proposal roughly maps OT concepts like this: Span.IsRecording = false -> Activity object was never allocated Span.IsRecording=true, IsSampling=false -> Activity is allocated and started, Activity.Recorded = false Span.IsRecording=true, Span.IsSampling=true -> Activity is allocated and started, Activity.Recorded = true

Beware the naming confusion - Activity.Recorded has nothing to do with Span.IsRecording. The BCL's name refers to the original naming used in the W3C spec at the time the BCL API was implemented. It is unfortunate that the spec has changed the name of this flag and more unfortunate that OpenTelemetry specification has repurposed the name to now represent a different concept.

The recommendation in this proposal is that consumers of the Activity API recognize that the Activity could be null and use the ?. operator to avoid trying to log state that will be ignored:

using(Activity? a = Activity.StartNew(...))
{
    a?.AddTag(...)
}

The spec isn't clear about what the value prop is to create Spans where IsRecording is false given that they won't be reported to SpanProcessors?

Can we introduce a Tracer class that will start and stop activities instead of static methods? Tracer will also have a name and version so the entire Tracer may be disabled.

As far as I can tell the Tracer is intended to support two scenarios that each appear to bring significant degrees of added complexity:

Multiple vendor implementations of the SDK objects
Multiple tracer objects in use inside the same app (multi-tenancy?)

In past discussions/comments I've warned that we didn't expect the BCL to do these things. Are you hoping for the capabilities or solely trying to align naming or something else? I'm happy to get these conversations written down rather than verbal so it doesn't feel like we are circling here.

Let's have a "current" span always be present. It may be one with the invalid context.

If you mean having Activity.Current return a non-null Activity where it currently returns null, I fear that would have bad compatibility implications. Creating an alternate property is possible but having multiple ways to check the current activity feels like it would create confusion about which API to use with limited benefit.

Activity must allow to update it's name. I have a feeling that this design suggests that Activity name is closer to the Tracer name in OpenTelemetry speak.

Activity.Name is a pre-existing property and a decade of precedent dictates the property is immutable. We can gladly try to find an alternate way to satisfy the use case but the spec does not provide insight into what that use-case is.

Then I don't think it addresses my scenario, N virtual calls just to know if an activity should be created seems way to high for something like Kestrel.

I'm not sure if the performance part of the discussion has been precise about where these costs are likely to come from and how it differs from the status-quo : ) Hopefully this helps everyone understand in theory where we are at now. If we want to go further we'll need to define and measure some benchmarks to see if practice matches theory.

If there are no listeners in the app (ie tech-empower) then there are no virtual calls. Both DiagnosticSource or the proposed Activity.StartNew implementation are maintaining a list of listeners in a field. We pay the cost of a null check on that field and if we failed to inline then we'd pay the cost of a non-virtual call to DiagnosticSource.IsEnabled() or Activity.StartNew(). In the specific case of an ingress point that will need to parse the traceparent field from the headers there is also some cost to prepare arguments for the Activity.StartNew call. I'm guessing that cost is order 10s of nanoseconds and getting rid of it would need a helper such as Activity.HasAnyListeners() to do the null check in advance and short circuit the work. This appears to be an easy enhancement we could do.
There are listeners in the app, but they aren't interested in listening to Kestrel's activities. I believe this would be an obscure scenario that isn't worth optimizing for. It would force Kestrel to parse traceparent always, then do 1 virtual call per such listener each of which return false. Cost 10s of nanoseconds for the http header field fetch + a few nanoseconds/listener. DiagnosticSource would win in this scenario by not paying those costs.
There are listeners in the app and they do want to listen to Kestrel activities. In this case the virtual calls inside of StartNew(...) are the equivalent of the delegate invocations that occur inside DiagnosticSource.IsEnabled(name, HttpContext) to let each subscriber decide if the IsEnabled() call should return true or not. The difference in cost comes down to how efficiently the subscriber can determine that they do want Activities with this name and then any optional work they do to make a sampling decision. ApplicationInsights does this. An ActivityListener certainly could fast-path a few name such as the one Kestrel uses but a fast general purpose mechanism would require a dictionary lookup costing ~10ns once the cache is hot. The DiagnosticSource probably saves 5-10ns here because being scoped to a single component means there are only a small number of potential strings making it possible to do hardcoded comparisons. The original ActivitySource design could eliminate the per-request string filtering recovering ~10ns, but its not clear to me if it is worth spending our time on that now (the current design doesn't appear to preclude adding those APIs in a later release). For comparison, Application Insights currently returns true from IsEnabled(name, httpContext) always which means Kestrel proceeds to pay ~700ns in costs creating, starting, and stopping that Activity. If we are going to spend time here I'd propose we ignore the 10ns name filtering for now and optimize a few 100ns out of Activity Start/Stop implementation instead? Or perhaps @benaadams wants to jump on the case and optimize all the things?

noahfalk commented 4 years ago

Sorry for a big reply on multiple threads of discussion. I'm working to get something in PR form so that discussion can happen inline as review comments or across multiple issues.

That PR is https://github.com/dotnet/designs/pull/98

pakrym commented 4 years ago

I agree that #1 and #3 would have pretty much same perf as in DiagnosticSource solution but #2 is what I have on my mind.

I believe this would be an obscure scenario that isn't worth optimizing for.

I think this scenario would become way less obscure when more activities would get added thought the library ecosystem.

I'd like the new solution to be as much pay for play for producers as possible and avoid a situation where adding a listener for one or two subsystems customer is interested in would incur a cost on all other producers instead of being scoped.

noahfalk commented 4 years ago

I'd like the new solution to be as much pay for play

No argument that this is a desirable property, in my mind the question is just that it takes a certain amount of design complexity, implementation costs, and testing costs to achieve it. On the one hand if Tarek thinks ActivitySource and filtering will be easy to restore then I'm not going to object, it feels like the long term good outcome. On the other hand I've been getting increasingly convinced that I was over-emphasizing that bit of performance and you haven't (yet) convinced me that the scenario would be common or overly impactful. Do you have an example of a performance critical scenario where the developer would want to add a 2nd listener to their solution that only listens to a small subset of the Activities? Assuming they did add a 2nd listener I'd guess that high perf web service scenarios have <= 20 Activity instrumention sites per request. If we speculate that the 2nd listener adds 10ns to each of those checks that is 200ns/request or ~0.3% perf regression in the desired Fortunes case. Is the scenario important enough that we need to optimize it down to 0.015% rather than 0.3%?

pakrym commented 4 years ago

It's not only about the 20x perf difference in the best case but also the way it scales:

Proposed solution has ActivityProducerCount X ListenerCount X 10ns of impact while the DiagnostiSource had ActivityProducerCount X 0.5ns so every additional listener adds an overhead to every callsite.

I don't think I have an exact performance scenario in mind just want to make diagnostic primitives as cheap as possible especially when they are turned off.

As for if it's worth it I'd like @davidfowl to chime in.

noahfalk commented 4 years ago

in the best case but also the way it scales

Agreed it certainly scales differently. So far I can't come up with a scenario where that scaling appears to matter? If there are scenarios that are creating 100s or 1000s of Activities per request but somehow the requests are also really fast it would be an issue - but I don't forsee that happening in practice. If we had a plausible hypothetical example I might see it differently.

tarekgh commented 4 years ago

Summary to the status of the discussions

SpanContext.Tracestate will keep it as string per @reyang recommendation
SpanContext.IsValid/IsRemote will not be added until we find a value for it. per @reyang recommendation. IsValid is a trivial check (context == default). IsRemote, was recommended not to added before by Liudmila.
Added public bool TryGetContext(out ActivityContext context) to Activity class to retrieve the Activity context.
Status/Event will not be added to the Activity for now.
I added a couple of overloads that allow passing Kind, attributes, start time, context and links.
IsRecording would not be needed per @noahfalk comment: https://github.com/dotnet/runtime/issues/31373#issuecomment-599165679. @SergeyKanzhelev please let's know if you have a comment regarding that.
I added a setter to OperationName (to address UpdateName). @noahfalk please advise if you have any concern allowing the setter.
Removed the ActivityListener and added public static IDisposable StartListening(Func<string, ActivityContext, IEnumerable<ActivityLink>?, bool> shouldCreateActivity, Action<Activity> onActivityStarted, Action<Activity> onActivityStopped); to the Activity class.
Will not add Tracer for the mentioned complexity. @SergeyKanzhelev let's know if you have a comment about that.
@pakrym concern about the perf which @noahfalk replied.
- @davidfowl could you please help to look and advise for the ASP.NET scenario?
- I don't mind returning ActivitySource back if the perf is really a concern here and also @SergeyKanzhelev need to clarify Luidmila's comments regarding ActivitySource which mentioned here https://github.com/dotnet/runtime/pull/32660#pullrequestreview-362888707.

noahfalk commented 4 years ago

Thanks @tarekgh!

I added a setter to OperationName (to address UpdateName). @noahfalk please advise if you have any concern allowing the setter.

I am concerned that if we add that setter people will use it in ways that break pre-existing code. All .NET code for the past decade has been able to rely on OperationName being an immutable value. I recommend we do not add the setter and get feedback from @SergeyKanzhelev or @reyang to understand what the use case is to search for an alternate solution.

IsRecording would not be needed per @noahfalk comment: #31373 (comment).

I had an open question (to @SergeyKanzhelev and @reyang) in that portion that needs to be answered before we can resolve IsRecording: "The spec isn't clear about what the value prop is to create Spans where IsRecording is false given that they won't be reported to SpanProcessors?"

I added a couple of overloads that allow passing Kind, attributes, start time, context and links.

It is hard to provide feedback on this before the code changes are visible. Ideally new work would should up as changes to the APIs in the design doc. Changes that people are proposing but we ultimately decide not to do can show up in the Q&A section to record the request and the rationale for why it didn't happen or what the alternative is.

Earlier I commented about needing to handle the hierarchial ID case. I think we need an overload ShouldCreateActivity which uses a string instead of ActivityContext.

reyang commented 4 years ago

I am concerned that if we add that setter people will use it in ways that break pre-existing code. All .NET code for the past decade has been able to rely on OperationName being an immutable value. I recommend we do not add the setter and get feedback from @SergeyKanzhelev or @reyang to understand what the use case is to search for an alternate solution.

FYI - this PR has some context about the span name semantic. I don't think users would be blocked if there is no UpdateName. It could be a bit inconvenient as I could imagine, as we have to give users some guidance which is not aligned with other OpenTelemetry SDKs.

I had an open question (to @SergeyKanzhelev and @reyang) in that portion that needs to be answered before we can resolve IsRecording: "The spec isn't clear about what the value prop is to create Spans where IsRecording is false given that they won't be reported to SpanProcessors?"

I think the spec does not ask SDK to create a new span. Most implementations would return a dummy span singleton when we know it is not being reported to SpanProcessor.

benaadams commented 4 years ago

in the best case but also the way it scales

Agreed it certainly scales differently. So far I can't come up with a scenario where that scaling appears to matter? If there are scenarios that are creating 100s or 1000s of Activities per request but somehow the requests are also really fast it would be an issue. If we had a plausible hypothetical example I might see it differently.

Isn't it that entire design though? This is tracing through a latency chain of many components not a single request; its N requests occurring sequentially so is directly N x additive to the latency.

tarekgh commented 4 years ago

@noahfalk I have changed the design doc to reflect the latest changes. we may need to add more bullets in the Q & A section to capture the updates in the discussion here.

noahfalk commented 4 years ago

[@reyang] FYI - this PR has some context about the span name semantic. I don't think users would be blocked if there is no UpdateName. It could be a bit inconvenient as I could imagine, as we have to give users some guidance which is not aligned with other OpenTelemetry SDKs.

Thanks! Reading thought the PR I still felt like I was grasping at the scenario where a user would need to call UpdateName(). The closest explanation I could find was this sentence at the end "Instrumentation MUST NOT default to using URI path as span name, but MAY provide hooks to allow custom logic to override the default span name." If I follow correctly the worry was that the code creating the Span either can't access the URI or doesn't know what portions of it are dynamic. Some code later on does have that information and wants to use it to improve descriptive text shown in the UI. If that is accurate it suggests there are two different name concepts being squeezed into one property, the initial one is for classification and sampling within the app, and the later improved one is only intended to aid end-user understanding. If we wanted to model that in .Net my first thought would be to have two properties, Activity.OperationName and Activity.DisplayName. The default value of DisplayName would be identical to OperationName, but it can be changed. However this feels like a place we'd want to engage on the broader spec rather than coming up with our scheme in isolation. I agree that the best path in the immediate future would be to leave it out.

[@reyang] I think the spec does not ask SDK to create a new span. Most implementations would return a dummy span singleton when we know it is not being reported to SpanProcessor.

Cool, then it sounds like there is no issue if .Net returns a null Activity to represent this case rather than a dummy.

[@benaadams ] Isn't it that entire design though? This is tracing through a latency chain of many components not a single request; its N requests occurring sequentially so is directly N x additive to the latency.

By single request I am referring to a single trace-id from the point it enters the .Net Core app until they point we send a response message that completes the request. Certainly it is possible the code that implements this request could have many sequential steps within it, including the creation of many Activities. However I don't think there is a scenario where the implementation both creates a large number of Activities and it is already extremely fast? For example do we expect a scenario where a request completes in 100us needs 100 Activities to instrument all of its dependencies? Being fast is usually mutually exclusive with doing a large amount of work.

@noahfalk I have changed the design doc to reflect the latest changes. we may need to add more bullets in the Q & A section to capture the updates in the discussion here.

Thanks!

SergeyKanzhelev commented 4 years ago

I don't know how to comment on this thread. I'll try. Pretty-please, can we switch to google doc?

UpdateName is absolutely needed. Asp.Net Core listener wouldn't work without it: https://github.com/open-telemetry/opentelemetry-dotnet/blob/633dda37bdb171f337596ba4b53021b5d35f117c/src/OpenTelemetry.Collector.AspNetCore/Implementation/HttpInListener.cs#L148
IsRecording vs. no activity allocation. I commented before that not creating activity breaks propagation of the context. Which will break user scenarios. So we need Activity to be created or at least context propagated. Thus it needs some indication on whether recording enabled or not.
Having specific callback for sampling instead of having the sampling decision made in creation callback is not aligned with OpenTelemetry spec. Some features like attributes returned by Sampler to be associated with the Span will simply not work.

noahfalk commented 4 years ago

Pretty-please, can we switch to google doc?

I agree a single inline thread is hard to follow - I am already trying to move discussion to the PR which does allow for inline annotations and replies to different threads of conversation. I think this is similar to properties we'd get from a google doc. I don't really want to move yet again to google doc because (a) it is different than the convention every other .NET design issue uses, (b) every time we move the discussion forum it breaks the flow and makes it that much harder to understand the history or accumulate issues in one place.

It is very tempting (I am guilty too) to keep replying back to this issue. To try and break that pattern I copied each of the issues you raised to comments in relevant portions of the PR and responded there instead.

SergeyKanzhelev commented 4 years ago

@noahfalk yes, PR in designs repo is OK.

tarekgh commented 4 years ago

Just FYI, I have updated the design proposal https://github.com/dotnet/designs/pull/98 to reflect the latest discussions and decisions.

chandramouleswaran commented 4 years ago

@chandramouleswaran thanks for the feedback too, I'll try to get back to you soon.

Any updates here?

tarekgh commented 4 years ago

@chandramouleswaran, sorry for the late reply I have missed this one unintentionally. thanks for the reminder.

ActivitySource.StartActivity overload APIs provide multiple options to specify the parent context.

        /// Creates a new Activity object if there is any listener to the Activity, returns null otherwise.
        public Activity? StartActivity(string name, ActivityKind kind = ActivityKind.Internal);

        /// Creates a new <see cref="Activity"/> object if there is any listener to the Activity events, returns null otherwise.
        public Activity? StartActivity(
                            string name,
                            ActivityKind kind,
                            ActivityContext parentContext, IEnumerable<KeyValuePair<string, string?>>? tags = null,
                            IEnumerable<ActivityLink>? links = null,
                            DateTimeOffset startTime = default);

        /// Creates a new <see cref="Activity"/> object if there is any listener to the Activity events, returns null otherwise.
        public Activity? StartActivity(
                            string name,
                            ActivityKind kind,
                            string parentId,
                            IEnumerable<KeyValuePair<string, string?>>? tags = null,
                            IEnumerable<ActivityLink>? links = null,
                            DateTimeOffset startTime = default);

Here are the options:

If you use the first overload which not providing ActivityContext nor parent Id, we'll create the Activity and have the parent pointing at the existing Activity.Current. This should address the async scenario as Activity.Current is defined as AsyncLocal which transfers with the context of the async call.
If you use the second overload, you can provide the parent ActivityContext (which includes the parent trace Id, span Id, flags and state). You have the control to set the parent context to the newly created Activity.
If you use the third overload, you can create the Activity object using the parent Id only.

Please let me know if you have more questions or anything unclear here.

tarekgh commented 4 years ago

https://github.com/dotnet/apireviews/pull/117

terrajobst commented 4 years ago

(marked as approved because that's what we decided in the notes but I didn't update the item during the review)

tarekgh commented 4 years ago

Use new Activity to Replace OT Span: https://github.com/open-telemetry/opentelemetry-dotnet/pull/660

dotnet / runtime