dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

Activity: RootId based sampling #21120

Closed lmolkova closed 4 years ago

lmolkova commented 7 years ago

When Activity starts without internal 'Parent' or external ParentId, it generates Id in the following way: |MachineName-timestamp-number.

Where

As a result Ids generated for the consecutive requests from one instance look like

|Service1-36ef1287-1ad6aa.
|Service1-36ef1287-1ad6ab.
|Service1-36ef1287-1ad6ac.
|Service1-36ef1287-1ad6ad.

They all have the same beginning and only few last characters differs.

It is natural to sample requests based on the Activity.RootId since it ensures consistent sampling for the operation traces.

It could be the problem for the hashing algorithms used for sampling E.g. djb2 wiki, sample implementation favors first characters over the last ones. ApplicationInsights uses it to sample telemetry items, so Id of the particular Activity almost always appear in the same hash bucket for the given service instance.

In case of small cluster (e.g. 1-2 instances), it is likely all Ids will share the same hash bucket. In case of large cluster, it is still likely that machine names are quite similar and differ only with the last few characters, so distribution of hashes may still be far from uniform.

I'd like to highlight that it mostly affects small clusters and only in case when clients (client apps or pages) do not generate the first Id, however it is common case for ApplicationInsights and perhaps other tracing systems.

So, 'root' Id generation should be revised, to make sure existing sampling/hashing algorithms would be able to use it and achieve close to uniform distribution. It implies that 'root id' should be random, at least it's first and last part.

/cc @vancem @cwe1ss @Dmitry-Matveev @SergeyKanzhelev

lmolkova commented 7 years ago

stephentoub added this to the Future milestone a day ago

@stephentoub It is critical for ApplicationInsights scenarios and possibly other tracing systems we not aware of, so I consider it as a bug that we should solve for 2.0

benaadams commented 7 years ago

Currently its a sortable set of ids; change would make it unsortable due to implementation detail of a consumer. Can the hashing be changed?

lmolkova commented 7 years ago

@benaadams Do you mean the Ids are monotonically incremented in the sample? Consider there are dozens of service instances started at different time and having different random seed for the 3-rd part (number). Ids are generated per-incoming request on particular instance and stored (along with log records) in some distributed storage. What do you mean by sortable and what is the value of it considering this scenario?

Regarding the hashing: even though it is possible to change hashing algorithm used in ApplicationInsgihts, we want to make Activity work well with other tracing systems, and we cannot ask everyone to change hashing algorithm to one that is sensitive to least significant (left) bytes

benaadams commented 7 years ago

What do you mean by sortable and what is the value of it considering this scenario?

If you have something go wrong, it is generally on a single instance and at a particular time. To get an idea of what was going on you would want to pull all the traces from that time period and that instance to get an idea of what was happening and what order it occurred in. Using a datestamp is too imprecise to get event order.

lmolkova commented 7 years ago

1) Let's consider requests independent: so if something was broken by one request, in general it should not affect next requests. Of course, there are some cases when it's not true, but we did not intend to solve that problem

2) All traces that belong to one request should share the Activity.Id generated for the request (or be logged in scope of Activity that is descendant of root Activity). So from the log storage, you will be able to query all logs that belong for this request (or the whole operation involving multiple services) based on the Activity.Id like select * where Id.startsWith('|Service1-36ef1287-1ad6aa.').

benaadams commented 7 years ago

We consider requests independent: so if something was broken by one request, in general it should not affect next requests

Assuming a single request caused a problem with an instance (rather than a badly interacting set) you may have an example of a request that was having a problem then want to work backwards looking at previous requests to do root cause analysis.

So the goal of Activity.Id that from the log storage, you will be able to query all logs that belong for this request base on the Activity.Id like

And you can use that same query currently to look at requests that occurred around the same time period by removing a few characters. e.g. select * where Id.startsWith('|Service1-36ef1287-1ad6')

we did not intend to solve that problem

However, you already have even if you did not intend to. Changing this means you are undoing a problem you solved along the way. Even if it was an unintentional side effect; it is a good one.

Re: hash code

The System.Numerics HashHelpers Combine would likely be a better hashcode generator than the DJB one used by ApplicationInsights /cc @jamesqo @VSadov @dsyme

benaadams commented 7 years ago

The issue though isn't the hash that is favouring the high values rather than the low ones; its the calculation of the sampling score where you take that integer hash then divide by int.MaxValue to get a double. So its the significance that calculation places on the bits rather than via the hash calculation.

e.g.

samplingScore = (double)telemetry.Context.Operation.Id.GetSamplingHashCode() / int.MaxValue;

Dmitry-Matveev commented 7 years ago

@benaadams, here is the result of the same hash of the consecutive Activity IDs without the "double" magic: image

This still looks pretty close. First 6 digits are often identical. The sampling decision (comparison to some sampling score threshold) on such set would be skewed regardless of "double"/ int.MaxValue magic. Simple modulo does not seem to help either since this is not uniform distribution but consecutive input leading to the somewhat consecutive hash results.

Changing sampling/hash in AI would mean to change it in every supported AI technology up to date - JS/Java/.NET Core/.NET/Node.JS and etc.. AI Customer experience will be different between those technologies until the change is propagated across all of them.

Another issue is that service with small amount of instances (that are started at about the same timestamp) will have similar hashed values of IDs, so all requests from this service into AI-monitored service may be sampled out if they fall within the same range.

benaadams commented 7 years ago

@Dmitri-Botcharnikov not sure how you use the sampling; but if you want to emphasise the low bits you can multiply by 0x01010101 so

(double)Math.Abs(hash * 0x01010101) / int.MaxValue * 100;

Which then gives you

Hash      | * 0x01010101 | samplingScore 
----------|--------------|-----------------
198414362 |  -1988253158 | 92.5852525479092
198414363 |  -1971410149 | 91.8009388222363
198414364 |  -1954567140 | 91.0166250965635
198414365 |  -1937724131 | 90.2323113708907
198414366 |  -1920881122 | 89.4479976452179
198414367 |  -1904038113 | 88.6636839195451
198414368 |  -1887195104 | 87.8793701938723
198414369 |  -1870352095 | 87.0950564681995
198414370 |  -1853509086 | 86.3107427425267
198414371 |  -1836666077 | 85.5264290168539
198414372 |  -1819823068 | 84.7421152911811
198414373 |  -1802980059 | 83.9578015655083
198414374 |  -1786137050 | 83.1734878398355
198414375 |  -1769294041 | 82.3891741141626
198414376 |  -1752451032 | 81.6048603884898
198414377 |  -1735608023 | 80.8205466628170
198414378 |  -1718765014 | 80.0362329371442
198414379 |  -1701922005 | 79.2519192114714
198414380 |  -1685078996 | 78.4676054857986
198414381 |  -1668235987 | 77.6832917601258
lmolkova commented 7 years ago

@benaadams we cannot ask everyone who would like to use Activity to change their sampling score algorithm. We should rather change Activity to work with some popular algorithms.

benaadams commented 7 years ago

You would be throwing out the last fig leaf of serializability though as StartTimeUtc isn't accurate enough on coreclr and is very inaccurate on full framework.

lmolkova commented 7 years ago

Well, even if we keep current Id generation, it's always subject to change since it is internal implementation details. It is accident that currently it is monotonically incremented. You can only rely on the fact that is globally unique.

Moreover, on you service you get request from the other services that may or may not pass Request-Id header (that will be used as a ParentId on Activity). So even with the current generation approach you never know which request(s) preceded some other request.

I believe the right way is to query for errors/exceptions first and find out when error was started to happen. Also if you know that RequestN is broken you still don't know when it happened: on RequestN-1 or RequestN-M. So it neither convenient, nor reliable way to determine when the problem started.

You would be throwing out the last fig leaf of serializability though as StartTimeUtc isn't accurate enough on coreclr and is very inaccurate on full framework.

Do you find that https://github.com/dotnet/coreclr/pull/9736 does not give enough precision?

We should find other ways to solve it rather than have contract that Activity.Id increments monotonically under some conditions.

benaadams commented 7 years ago

it's always subject to change since it is internal implementation

This is the absurdity; the id is used by application insights purely to generate a "random" number; which is then used to determine if its one of the N% of events that gets sent.

The issue is that the internal implementation of corefx isn't producing a varied enough id for an external choice of hashing algorithm and manipulation of that hash to then appear as random.

Application insights doesn't even need any Id for this it could just generate a local random number - nothing is particularly dependant on what the value of the Id is.

So even with the current generation approach you never know which request(s) preceded some other request.

Currently you can forensically determine which request preceded the current request locally on that instance as each portion of the Id has its own RootId which by default will contain the instance Id; whether that's in the middle of the Id as it originated somewhere else or at the start as it "originated" locally or had no parent Id.

When trying to work out the root cause of an entire instance fault (rather than single request fault); you don't necessarily care what the distributed request chain was but want to know the local serial request events sequence prior to the first request going "wrong".

e.g. StackOverflow's outage on July 20, 2016 was caused by bad regular expressions burning all the cpu. At some point all requests on a machine suffering from this will start getting slower than expected and eventually start timing out; but this will include normal requests.

When the slower requests are identified which are a symptom, you then want to track back and look at earlier requests to try to find a cause. If the Ids are completely random then its not a particular help to do this.

Do you find that dotnet/coreclr#9736 does not give enough precision?

Accuracy not precision. While smallest difference shown can represent 3.3M events per second, the largest difference is down to 74k events per second; and if you are on full framework you are down to 100 events per second. And as shown with the improved version in coreclr there are still collisions as it doesn't increase at a constant and granular rate. It might be enough for most causes but at high throughput the buckets of what could have been before will be really big.

Anyway have made my points, will leave this issue alone and go with what is decided.

benaadams commented 7 years ago

Ah, you want to sample all related events together so they all hash to the same threashold so the Id is sorta important. Though items will become unsampled as the Id chain gets longer.

lmolkova commented 7 years ago

Currently you can forensically determine which request preceded the current request locally on that instance as each portion of the Id has its own RootId which by default will contain the instance Id;

Only first portion (root Id) will have instance name in it. Others nodes do not. So in the real case you would see mix of Ids where the root Ids generated on client side, other service, this service. Some of them may come from node.js or go app which generate Request-Id in a completely different way.

Even in the example with StackOverflow outage, you'll find request number 1234567890 was slow, but may be root case was request number 345. You cannot examine them all even if you know the order and my point you can not rely on the order even now except the case you are analyzing logs from the front end and nobody ever calls it with Request-Id header.

Though items will become unsampled as the Id chain gets longer.

Sampling is done based on Activity.RootId (root node) that is common for every trace related to the operation, so no matter how long is the chain, all traces will be sampled consistently

cwe1ss commented 7 years ago

My opinion still follows the opentracing practices. Either let tracers create the ID(s) themselves (to let them use whatever they want/need) or - if you want/need to standardize - use separate IDs: a 64/128bit random for the trace id and a 64bit for the span/activity-id. This seems to be the common practice among tracers who adopted OpenTracing. ulong trace-ids would allow for simple "lower than" sampling checks. Zipkin recently moved to 128bit trace-ids but I don't know how necessary this is in the .NET world where there's little Twitters and Ubers.

The currently implemented chaining is not necessary. Showing the relationship between activities is the responsibility of the tracing system.

benaadams commented 7 years ago

@lmolkova yes I see the problem šŸ˜¢

Its just a shame that a string hash that favours changes to the the low bits (djb2 so modulus works); has been combined with a sampling method that favours changes to the high bits.

However, in spite of my protestations going more random shouldn't cause too much of an issue and will make people take less of a dependency on its behaviour (other than it being more random šŸ˜›)

vancem commented 7 years ago

It is not clear to me yet what the proposed fix is. I really LIKE the fact the root ID starts with identifiable things like the name of the machine, and something that represents the process on that machine. Also it is unclear what change is sufficient to fix common logging systems (which is all we are looking for).

For example, it would be trivial when generating IDs to use 0xB50001 as the number to add to get the 'next' id. This would have the effect of 'scrambling' higher order bits the has for algorithms like DJB which do a linear hashing and should get you a what is probably a fine distribution (we should of course tests this). I also note that it preserves the property that Ben likes (you can tell something about the sequencing of events by looking at the ID's lower order bits (to 64K)).

Comments? Is someone in a position to try this out (or refute my claim that this fixes things)

cwe1ss commented 7 years ago

I really LIKE the fact the root ID starts with identifiable things like the name of the machine, and something that represents the process on that machine.

These things should be added as separate tags/metadata by the tracer. It's way too hard to query otherwise. Having them in the ID as well is just duplication.

cwe1ss commented 7 years ago

I also note that it preserves the property that Ben likes (you can tell something about the sequencing of events by looking at the ID's lower order bits (to 64K)).

Not really IMO, because you probably wouldn't be able to do these kind of queries in any tracer.

I think incremental request ids would have to be another tag which could easily be added by people who need it. It could then be a regular ulong that's easily queryable. As a separate field it also wouldn't need to have any uniqueness requirements (because it would be queried together with the process name/machine name).

SergeyKanzhelev commented 7 years ago

@cwe1ss proposal to use random will work the best as it forces developer to NOT rely on internal implementation details. Carrying extra pieces of information like machine name and sequence number is not the ID task.

Also without the randomness of ID Application Insights will generate and store in the context the sampling score number for EVERY request which is quite an overhead assuming that most applications are not heavy enough to start sampling requests.

lmolkova commented 7 years ago

For example, it would be trivial when generating IDs to use 0xB50001 as the number to add to get the 'next' id.

With this change Activity.Id looks like |DESKTOP-OJC80DI-c0059174-64e1202c., only the last component increments by 0xB50001 on each Activity created.

I played with it together with DJB and it gives pretty good uniform distribution on any number of samples.

@Dmitry-Matveev @SergeyKanzhelev I will send you private DiagnosticSource.dll build so you can verify it.

I really LIKE the fact the root ID starts with identifiable things like the name of the machine, and something that represents the process on that machine.

I agree with @cwe1ss. If users are interested in such information, it's better to have it as a tag. Especially considering that in the cluster machine name may be not very useful useful. There are more useful tags like Azure role name/number. However I'm fine leaving machine name in the Id since it is more useful than random number anyway.

SergeyKanzhelev commented 7 years ago

@lmolkova I'm generally fine leaving machine name, but if we can make ID shorter compromising on readability - I'd prefer that. 128bit random number can do it as a trace id as @cwe1ss commented

vancem commented 7 years ago

I agree with @cwe1ss. If users are interested in such information, it's better to have it as a tag.

To be clear I am not suggesting that anyone take a dependency on the format of the ID. But I will say that HUMANS doing DEBUGGING are likely to find the information handy for ad-hoc comparison.

Also to be clear, the machine name is not redundant, because it is part of what makes the ID unique. We are just doing it in a way that is more useful for ad-hoc debugging than a 128 bit random number.

cwe1ss commented 7 years ago

@lmolkova However I'm fine leaving machine name in the Id since it is more useful than random number anyway.

For what scenarios would it be more useful?

cwe1ss commented 7 years ago

But I will say that HUMANS doing DEBUGGING are likely to find the information handy for ad-hoc comparison.

Hmm... I can't see why. In distributed tracing, you can't do much with a single ID anyway, you'd have to look up the full trace in most cases and there you get this information and much more. And if you do local debugging with ETW/PerfView, you're on the machine anyway.

SergeyKanzhelev commented 7 years ago

@vancem for ad-hoc debugging you probably either rdp to the box or get log file uploaded with the computer name as part of the name.

Worse - keeping machine name makes this ID dangerous from PII standpoint. When we collect telemetry from customer machines reported by our installer we explicitly replacing machine name to the unreadable hash string. It is a compliance requirement. It is unnecessary information exposure will force us to strip out activity ID from machine name substring.

So I do not understand benefits and see many drawbacks: size, information exposure, people taking dependency on the format.

vancem commented 7 years ago

Hmm... I can't see why

Have you used systems (like Microsoft likes to produce), where the URLS are just one nasty long GUID thing. Then compare it to say GitHub, where the URLS are all sensible. It is the same issue. A URL at the end of the day is just a Unique ID, but encoding the hierarchical data (this is machine X, process Y, Request Z), is useful.

cwe1ss commented 7 years ago

I don't see how this relates. Trace IDs are just copy&paste strings that you put into some input box or use for a "where traceId = 'xxx'" query in the tracer. Also, the first Activity is in no way more special then child requests. A trace could involve 10 machines, why is it so important where it was started to put it into the ID?

I hope I made my point so I'll stop arguing. I still care very deeply about this topic and I hope you don't see my frustration as personal criticism - it's just a result of my inability to understand most of your viewpoints regarding the whole activities-topic. I really wish for this to turn out well, but unfortunately, I still have a lot of doubts.

benaadams commented 7 years ago

Issue isn't limited to Activity.Id. Combining a hashcode that maximises entropy in the low order bits then sampling from the high order bits where entropy is lowest also causes issues with other things, like User.Id

https://github.com/Microsoft/ApplicationInsights-dotnet-logging/issues/82; but I do recognise there would be compat issues with changing it at this stage.

SergeyKanzhelev commented 7 years ago

@vancem @lmolkova what do you think? Is random instead of the current algorithm looks reasonable? Or can you allow to define the algorithm/override the id by consumer?

@benaadams - yes, we work on fixing this. Sampling by User.Id is also bad as we mix up the notion of anonymous user that may not be the same across all the calls of the distributed transaction and sampling decision.

vancem commented 7 years ago

Here is my suggestion on how to move forward: There seems to be two issues

  1. Making the ID more friendly to bad (but common) hashing. This can be solved very easily by simply inverting the order used to make the root ID (Basically the thing that is incremented comes before the thing that represents that host/process). Thus the ID would look like this instead |64e1202c-c0059174-DESKTOP-OJC80DI. This a trivial change (1 line) and solves that problem.

  2. There is a concern that the name of the machine in the ID actually problematic because it leak private information. In the end having the machine name in the ID does make it longer, so it really needs to prove its value. Thus I am open to removing it, but it still seems like it would be diagnostically useful (it marks the place where the top-level activity originated). Thus I would like not remove it until it has had a chance to prove its value (or not), in practice. I am also skeptical of the PII requirement because it was not part of the original issue and logging systems log all sorts of potentially private information and machine names are not particularly private information (every TCP-IP packet you send identifies the machine after all). Thus I would like to treat this PII requirement as a separate issue. @SergeyKanzhelev if you can show that there is a REAL compliance issue that you security people would force you to mask out the information to fix, then I will concede it is a problem. Create a new issue for it and we will basically turn the machine name into some hash/random number which will solve that problem.

cwe1ss commented 7 years ago

Some other reasons why the machine name is useless/problematic:

Simetimes an easy fix is not the best. If you want to properly support correlation, do as the .NET framework should do and stay implementation agnostic. Don't try to solve problems you can't possibly solve. Let the tracers decide the ID and headers with proper Inject/Extract hooks. you can then make your own default tracer with whatever convention you like.

It's not yet too late and will make everyone happy and .NET an awesome language for distributed programming!

... https://xkcd.com/927/

SergeyKanzhelev commented 7 years ago

@vancem computer name considered PII. Exposing it without customer consent is not a permitted. You can argue that every http request already exposes an IP address. But this is most often obfuscated by using load balancers and DMZ servers when http header will not be hidden. Also Activity ID likely will be used bo all sort of tracing mechanisms and storages so everybody will need to implement Activity ID obfuscation that defeat the purpose of having a single Activity ID generation.

If the decision is to take a hash of the machine name - ID became a gibberish number of symbols. Can we take it one step forward and solve the problems of excessive size of ID as well? Just use the random number? We have a fast pseudo random generator that is faster than Math.Random: https://github.com/Microsoft/ApplicationInsights-dotnet/blob/151509eac5e23fcbd09f3a8989a35d1099703957/src/Core/Managed/Shared/Extensibility/Implementation/XorshiftRandomBatchGenerator.cs

cwe1ss commented 7 years ago

Here's some quick research about Trace IDs in existing tracers:

@SergeyKanzhelev can probably provide details about the current Application Insights ID.

lmolkova commented 7 years ago

@SergeyKanzhelev

computer name considered PII. Exposing it without customer consent is not a permitted.

ApplicationInsights ALREADY reports the host name and machine name is just a part of it: cloud_RoleInstance DESKTOP-OJC80DI.europe.corp.microsoft.com.

Another question, though that Activity.Id is passed to any endpoint potentially through http, not secure.

SergeyKanzhelev commented 7 years ago

@lmolkova it is one thing to store the machine name and show it to the telemetry owner and different thing is to transmit it to the outside world with every HTTP/HTTPS query.

We are doing similar thing with roleName for service fabric with the recent Nizar's changes and I'll review this use with our security advisor.

lmolkova commented 7 years ago

fixed with dotnet/corefx#18881