dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Add short and URL friendly string representation to `System.Guid` #55290

Closed prezaei closed 2 years ago

prezaei commented 3 years ago

Background and Motivation

The shortest form of a string representation of System.Guid is 32 characters long (format = "N"). Although this is URL friendly, it is not the most concise URL friendly representation of it. From RFC2396 Section 2.3, the URL safe characters are:

Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper- and lower-case letters, decimal digits, and a limited set of punctuation marks and symbols.

unreserved  = alphanum | mark
mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

It would be worthwhile to add support for a new format specifier, perhaps U to System.Guid that generates a shorter URL friendly string representation of the guid.

Proposed API

The following changes will be required:

API Change
System.Guid.Parse(string input) It will be able to parse the shorter string representation
System.Guid.Parse(ReadOnlySpan<char> input) It will be able to parse the shorter string representation
System.Guid.ParseExact(string input, string format) It will parse the shorter string representation when format is "U"
System.Guid.ParseExact(ReadOnlySpan<char> input, ReadOnlySpan<char> format) It will parse the shorter string representation when format is "U"
System.Guid.TryParse([NotNullWhen(true)] string? input, out Guid result) It will be able to parse the shorter string representation
System.Guid.TryParse(ReadOnlySpan<char> input, out Guid result) It will be able to parse the shorter string representation
System.Guid.TryParseExact(ReadOnlySpan<char> input, ReadOnlySpan<char> format, out Guid result) It will parse the shorter string representation when format is "U"
System.Guid.TryParseExact([NotNullWhen(true)] string? input, [NotNullWhen(true)] string? format, out Guid result) It will parse the shorter string representation when format is "U"
System.Guid.ToString(string? format) It will return the shorter string representation when format isU
System.Guid.TryFormat(Span<char> destination, out int charsWritten, ReadOnlySpan<char> format = default) It will try to format the current Guid instance into the provided character span in its shorter string form when format is "U"

Usage Examples

var str = Guid.NewGuid().ToString("U");
Console.WriteLine(str); // prints out something like: "abcdefgh123"

Alternative Designs

We could also add extension methods.

Risks

All I can think of is that TryParse(...) now requires an extra check on the length of the string to determine if it should try to parse the string as a short representation of the URL.

Notes

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
## Background and Motivation The shortest form of a string representation of `System.Guid` is 32 characters long (`format = "N"`). Although this is URL friendly, it is not the most concise URL friendly representation of it. From [RFC2396 Section 2.3](https://www.ietf.org/rfc/rfc2396.html#section-2.3), the URL safe characters are: > Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper- and lower-case letters, decimal digits, and a limited set of punctuation marks and symbols. > > ``` > unreserved = alphanum | mark > mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" >``` > Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear. It would be worthwhile to add support for a new *format specifier*, perhaps `U` to `System.Guid` that generates a shorter URL friendly string representation of the guid. ## Proposed API The following changes will be required: |API|Change| |--|--| |`System.Guid.Parse(string input)`|It will be able to parse the shorter string representation| |`System.Guid.Parse(ReadOnlySpan input)`|It will be able to parse the shorter string representation| |`System.Guid.ParseExact(string input, string format)`|It will parse the shorter string representation when `format` is `"U"`| |`System.Guid.ParseExact(ReadOnlySpan input, ReadOnlySpan format)`|It will parse the shorter string representation when `format` is `"U"`| |`System.Guid.TryParse([NotNullWhen(true)] string? input, out Guid result)`|It will be able to parse the shorter string representation| |`System.Guid.TryParse(ReadOnlySpan input, out Guid result)`|It will be able to parse the shorter string representation| |`System.Guid.TryParseExact(ReadOnlySpan input, ReadOnlySpan format, out Guid result)`|It will parse the shorter string representation when `format` is `"U"`| |`System.Guid.TryParseExact([NotNullWhen(true)] string? input, [NotNullWhen(true)] string? format, out Guid result)`|It will parse the shorter string representation when `format` is `"U"`| |`System.Guid.ToString(string? format)`|It will return the shorter string representation when `format` is`U`| |`System.Guid.TryFormat(Span destination, out int charsWritten, ReadOnlySpan format = default)`|It will try to format the current `Guid` instance into the provided character span in its shorter string form when `format` is `"U"`| ## Usage Examples ``` C# var str = Guid.NewGuid().ToString("U"); Console.WriteLine(str); // prints out something like: "abcdefgh123" ``` ## Alternative Designs We could also add extension methods. ## Risks All I can think of is that `TryParse(...)` now requires an extra check on the length of the string to determine if it should try to parse the string as a short representation of the URL. ## Notes - Perhaps, we also might want to consider not using all `marks` characters (`"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"`) to keep the URLs even more readable. - I'll be happy to send a PR if
Author: prezaei
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime`, `untriaged`
Milestone: -
GrabYourPitchforks commented 3 years ago

Your sample has the comment prints out something like: "abcdefgh123". Can you give a concrete example of the type of output you expect? For example, what would be the exact output of Guid.Parse("7a1e687f-a5a9-47e1-b5ec-fd71abf06303").ToString("U")?

svick commented 3 years ago

I think you won't be able to do much better than base64-encoding the bytes of the GUID. If you do that, you'll get a 24 character string (22 if you remove padding). And since you can already easily do that today (e.g. Convert.ToBase64String(guid.ToByteArray())), it doesn't help that much and any such encoding would be completely non-standard, I don't see much reason to add this directly to Guid.

Tornhoof commented 3 years ago

I think you won't be able to do much better than base64-encoding the bytes of the GUID.

As for Web, Base64Url encoding fits better, there is a nice helper method in ASP.NET Core. https://docs.microsoft.com/en-us/dotnet/api/microsoft.aspnetcore.webutilities.webencoders.base64urlencode?view=aspnetcore-5.0

prezaei commented 3 years ago

You got it. Effectively, we want to base64url encode the Guid. Doing this outside of System.Guid forces a heap allocation if we use Guid.ToByteArray(). The only way around the heap allocation that I can think of is something like this:

var guid = Guid.NewGuid();
Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);

// now convert the bytes to a string using a base64URL encoder...
var result = Base64UrlEncode(bytes);

This is messy and given how often we have all seen Guids in URLs of the sites that we visit, it seems like a common problem that we should have a solution for.

Thoughts?

GrabYourPitchforks commented 3 years ago

I don't see much appetite for adding a domain-specific method (base64url encoding) directly on the Guid type. Keep in mind also that GUIDs and other identifiers tend to be used as paths in URLs rather than as query string components, and base64 is a case-sensitive encoding. Most real-world applications stick to all-lowercase identifiers for things that appear in paths and do not expect to see mixed case-sensitive identifiers. This further restricts the range of applications which might get use out of such an API.

prezaei commented 3 years ago

@GrabYourPitchforks, totally agree that we might end up with Base45. I would not look at this as a domain specific thing here. The actual problem I am trying to solve right now is to pass shorter correlation id (x-correlation-id) headers between some of our Azure products. Today, we use the simple Guid.ToString("N"). That wastes bandwidth.

In fact, Guid.ToString("N") is significantly used for serializing to JSON, YAML, gRPC and much more. Oh and don't forget all the logs that go into Geneva with all these long identifiers. Only if there was a shorter version of this, we will be helping climate change! You think I am joking, but I am not. This really is not a niche scenario for service code.

GrabYourPitchforks commented 3 years ago

That last response kinda provides evidence for my point that this is domain-specific, no? :) The problem as originally stated is that you wanted something appropriate for placement in URLs; but https://github.com/dotnet/runtime/issues/55290#issuecomment-875939075 shows that you actually want something that's the shortest ASCII computer-readable representation of arbitrary binary data (which doesn't need to be URL-safe); and that making something human-readable and URL-appropriate might require yet another format (like base45). But Guid.ToString is really meant to produce something that fulfills both a standard pattern and is human-readable, so it's really not the ideal place for putting this functionality.

I'm sympathetic to the problem, but since your desire is for the shortest possible representation and that you're willing to use a non-standard format to accomplish it, what's wrong with defining your own extension method?

public static string ToMinimalRepresentation(this Guid guid)
{
    Span<byte> asBytes = stackalloc byte[16];
    Guid.TryWriteBytes(asBytes);
    Span<char> asChars = stackalloc char[22];
    Base64UrlEncode(from: asBytes, to: asChars);
    return asChars.ToString(); // the one and only allocation
}
prezaei commented 3 years ago

@GrabYourPitchforks, I can certainly do this and in fact have done so. My point is this pattern is pretty common out there. From websites to HTTP headers, to logs, etc. One of the reasons is that frameworks just don't make it available/easy for all devs to use these. Open any of our logs in Kusto/Cosmos and you will be shocked that no-one has taken the time to use a shorter version for a correlation id. Why? Is it because they can't write the code? No. It is because we don't make it easy for them to use an out of the box formatter and they are busy with so many other things. A good framework is there to simplify these types of work.

Let me ask you this: Why do we have so many other format specifiers but feel hesitant to add one more that has serious and real use cases? For instance, have you ever seen a Guid in this format: {0x00000000,0x0000,0x0000,{0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00}} (that is format = X).

Another question: What is the downside of adding this format? I am totally with you that we need to have a high bar for corlib but I believe I have made a good case here with so many use cases.

svick commented 3 years ago

@prezaei

The actual problem I am trying to solve right now is to pass shorter correlation id (x-correlation-id) headers between some of our Azure products. Today, we use the simple Guid.ToString("N"). That wastes bandwidth.

If you care about saving every byte of bandwidth, why are you using a globally unique identifier? Wouldn't an identifier that's unique just to your application serve you as well, while being much shorter?

On the other hand, I just googled "guid to short string" and it seems to be a relatively common problem (with base64 usually being the suggested solution).

Why do we have so many other format specifiers but feel hesitant to add one more that has serious and real use cases?

Maybe there was a reason for the other formats when they were first added. Maybe there still is. Or maybe they were a mistake. In any case, I don't that's really a justification to add one more format.

prezaei commented 3 years ago

@svick, still need something globally unique. This is not for a single application. It will potentially be used by all of Azure if I get my way. HTH

tannergooding commented 2 years ago

Agree with @GrabYourPitchforks that this seems like a very domain specific API and not something we'd be interested in expose on System.Guid directly.

Given that Guid can format to a Span<char>, Utf8Formatter can be used to format to a Span<byte>, and Base64Encoder likewise has APIs that can process a Span, you can already do this "allocation free" just potentially with an additional loop over what a custom implementation might provide/allow.