[DiagnosticsSource] http correlation protocol should not force to make Request-Id unique

SergeyKanzhelev commented 7 years ago

Http correlation protocol is designed defensively to protect from bad callers. Many tracing systems are used in enclosed systems with trusted callers. These systems may benefit having shorter identifiers.

Proposal is to

change protocol description and
provide a way to enable "trusted callers" mode in built-in .NET instrumentation

HTTP Correlation Protocol states:

When Request-Id is provided by upstream service, there is no guarantee that it is unique within the entire system. Implementation SHOULD make it unique by adding small suffix to incoming Request-Id to represent internal activity and use it for outgoing requests, see more details in Hierarchical Request-Id document.

Hierarchical request format states:

Implementation SHOULD make it unique by adding small suffix to incoming Request-Id to represent internal activity and use it for outgoing requests. If implementation does not trust incoming Request-Id in the least, suffix may be as long as Root Request Id. We recommend appending random string of 8 characters length (e.g. 32-bit hex-encoded random integer).

CC: @jacpull @lmolkova @vancem

vancem commented 7 years ago

Do you have a proposal for how we should update the standard?

My understanding of why this clause is in the standard is the following scenario

Machine A is processing Request with ID1. It uses this ID1 as the ID for two request to two different machines, Machine B and Machine C.

Thus both B and C know that their 'parent' is ID1. So far all is good. However each in turn need to create children IDs. Normally each would put ID1.1 (since this is the first child of this ID as far as both B and C are concerned). However this causes an ID collision (the sub-requests from B and C are NOT the same request).

Requiring B and C to each add a ransom suffix solves the problem. (But it does make the IDs bigger).

There are two 'obvious' ways of solving this.

To make it illegal to reuse an ID on two separate HTTP calls. (Basically it is easier for machine A to create the child nodes efficiently than it is for the children). What this means is that the receiver can assume that the ID in HTTP headers can be assumed to have NO other children in the system.
Have something in the ID that tells target machines that the ID has property (1) and a randomizing suffix is not needed (our ID standard we are creating can do this). In the spec we would simply state that you only need to add the randomizing suffix if you don't know caller is obeying the convention.

I am assuming you are suggesting (1) (it is simpler). I believe our current implementation does do this (we create an activity for just single HTTP outgoing request).

Thus the work is to:

Change the spec to this.
Remove any logic we currently have to add the uniqueness suffix.

Is this correct?

lmolkova commented 7 years ago

Randomization is not only needed because we do not trust callers but to provide unique Id in certain scenarios:

HTTP request headers are copied when request is redirected. So if service calls any dependencies before redirecting request (e.g. auth, caching, proper location lookup, etc), there will be Id collisions.
retries which are done by proxy

So making it illegal to have the same Id on two separate calls contradicts HTTP and existing proxies implementations.

jacpull commented 7 years ago

@vancem - What would be preferable is something in the Activity/Proc configuration. I'm not sure keeping this in the ID itself is a good idea as the value passed in by an untrusted caller.

@lmolkova - Agree. Another example is pub-sub mechanisms, where the same value is subscribed to by many components.

jacpull commented 7 years ago

Wanted to clarify, that we are not removing the restriction that Request Id should be unique. We are simply creating an option for components to maintain uniqueness along with enabling sort and more efficient wire cost under the right circumstances, with the tradeoff being a small performance cost and responsible configuration. This proposal and the one in dotnet/corefx#22339 provides this flexibility.

vancem commented 7 years ago

I am still on the fence about this. Basically the issue is whether IDs coming into a system need to have a random child ID appended to it to insure that the IDs on that system do not collide with children generated on other machines (which got the request from the same parent).

On the one hand:

Adding the random child ID is not that big of deal. It simply adds a modest size. Thus nothing is broken. There is the issue dotnet/corefx#22339 which points out that random child ID might cause grief with sorting the ID, but ONLY in the case where there would have been a collision. That is the sorting is 'wrong' only when creating the ordering was problematic already.

On the other hand:

The random child does not really serve a useful purpose in the common case where the 'callers' are doing what we want (and making an ID per HTTP request), which should be frankly all the time.

The fact that HTTP redirection and proxies cause the ID to be sent to more than one target is concerning. However it is not a 'slam dunk' for me because either

The redirector / proxy is not participating logging / activity propagation, in which case it will not emit events, and thus there is no issue OR
The redirector /proxy IS participating, and frankly it SHOULD update the ID In the HTTP header

Also note that if nothing is done the effect is not likely to be that bad. It does mean that activities in the proxy might be aliased to activities in the target system, but frankly it does not seem that bad.

So I could go either way right now.

My preference right now is to leave things as they are since the size issue does not seem that bad, and we can always mitigate that in the future by dropping the number of 'random' bits from 32 to say 16 (and using more compact formats like Base64 instead of Hex).

I also would not have a problem adding syntax to the ID that allows the sender to indicate that the ID being sent is a singleton (it will only be for that request) so that receivers can make more optimal decisions.

But it is not clear that any of this is a high priority (we could change our minds in the future without issue).

jacpull commented 7 years ago

The additional size is a concern for two reasons.

It adds up for large systems. Assuming we are adding 4 bytes on each incoming path, on a large system like Asimov where the cV based cooker can see upwards of 200 Billion events per day with an average cV having 5 stages, this translates to roughly 4 TB increase per day, most of which is unnecessary.
Increased wire cost on metered connections for a Part a field is a sensitive issue for OSG. The initial version of the cV was allowed only 16 bytes for the base for this very reason. The current 1024 byte max size for the Request ID is much higher than the 128 byte max limit for the cV and this will be another point of concern.

Additionally, making the suffix optional would make it simple and clean to convert the cV 2.0 into a valid Request Id. So it should be possible to pass in a cV (without generating random data on the fly) and back again. While we will rev the cV specification to support the random prefix for the next version, interop with the current version would help make this transition a lot smoother and therefore a significant support for convergence.

vancem commented 7 years ago

From a spec perspective, I do not have a problem making the appending of a suffix on incoming request ID optional. As mentioned above, if we did not add a suffix on incoming requests, the worst that would happen is that it would cause ID ambiguity if the same HTTP request was forwarded to multiple machines (without updating the ID) which both logged activities (since each machine would generate child nodes starting at 0 from the same ID and be unaware that another machine was doing the same). This seems unlikely, and can be fixed if HTTP forwarding would update the ID.

So I have no problem with simply stating the problem in the spec, and indicating a solution, but not mandating it.

The main issue for me, is what we ACTUALLY do. In the .NET Framework we do have incoming HTTP code, and we have to decide what we do with the IDs that come at us. We have the choice of either adding a random child ID or not. There are several options

Don't Add the ID. This is presumably what cV does.
Add the ID as we do today. This seems OK, but the IDs get bigger.
Make option 1 or 2 configurable. This is OK but there WILL be default, and typically 99% of people leave the default so this option is pretty similar to 1 or 2 (based on which is the default).
Do (2), but use a smaller random number (say 2 characters). Two characters will give you 12-16 bits of entropy (thus at least 4096/1 odds against a collision) Given that we think that collisions are already rare, they just got 4K rarer for low cost.

My inclination is to do 2 (keep the implementation the same), but entertain the other options if size becomes a problem (It turns out encoding the ID as Base64 (we use Hex today), and other optimzations are likely save more so we should do them first).

jacpull commented 7 years ago

For the .NET implementation, I would suggest doing dotnet/corefx#3 with the default configuration being dotnet/corefx#2. We can take the burden of configuration enforcement for internal MS components. This is a small price for a smoother transition.

vancem commented 7 years ago

I am not against option 3 (configurable), but I would want to understand who exactly sets it and that the benefit is great enough to overcome the pain of getting dealing with a non-default configuration. It seems like everything works with option 2 as well (just as sooth a transition), and if/when we care about the size, we can pursue option 3 (or 4, or maybe even 1).

jacpull commented 7 years ago

If this option were enabled, we would have guidance on the relevant onboarding wikis/docs for any developer instrumenting with a Common Schema compliant logger on adding the necessary configuration in code. Sample wikis like those for the Asimov cV and Geneva IFx in any case would need changes to leverage the Request Id and subsequent integration will require some messaging. So this will target internal service developers and owners of managed apps/client components.

We would also add relevant checks to our audit workflows (which look for violations) and if we found instances where the entropy is irrelevant, point them to the same documentation.

vancem commented 7 years ago

@jacpull, you can see that plumbing the configuration has non-trivial cost (you have to get a bunch of devs all doing the right thing). It is an ongoing cost and while possible, it is very easy to underestimate its cost (and it will be hard to get 100% compliance). This is why I don't like systems that have configuration (easy to say in the design 'we support both' hard in practice to get what you want where you want it).

If the only issue with option 2 is that header IDs are a bit larger, going with option 2 is probably a good tradeoff. This is especially true because the system can work perfectly well if only parts of the system opt out. Thus

Start with option 2 (do what we do today). (of course all existing Cv implantation do what they do today), the only the we are talking about is people using the ASP.NET hosting logic (which is only place we have instrumented incoming HTTP logic that sets an activity ID based on an incoming HTTTP request).
If size is an issue change the IDs by using Base64 instead of Hex, etc.
If that is still not enough, add the option to turn it off, and pick the most important (volumeous) clients to opt out of making the IDs unique.

jacpull commented 7 years ago

Quite see the additional cost and overhead. But we cannot reactively add the option to turn it off when it becomes a problem. That approach is a difficult one to sell to platform owners that own the common sinks, upload agents and cookers who are sensitive to getting end users to make code changes in a short time window and also sensitive to growth projections from near term to maybe 18 months. So I expect considerable pushback on just about every aspect where wire cost optimization has not been pursued and agreed upon. As much as we dislike the need for policing this, the tradeoff makes the convergence plan a lot easier to communicate and sell.

My hope is that with option dotnet/corefx#3, we demonstrate the flexibility to align internal stakeholders by retaining the dials needed to address their concern. At the same time, should developers not follow the recommended practice, we have time and processes in place to deal with this.

dotnet / runtime

[DiagnosticsSource] http correlation protocol should not force to make Request-Id unique #22777