dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

[API Proposal]: Uuid data type #86084

Closed vanbukin closed 1 year ago

vanbukin commented 1 year ago

Background and motivation

There are 2 ways to organize the binary representation of Uuid: 1) as 16 separate octets 2) mixed-endian, which .NET inherited from COM/OLE.

Currently, the .NET Base Class Library only includes System.Guid. This is a data structure that implements the second method of binary representation.

If you use System.Guid as a unique identifier for objects, it is necessary to be extremely careful. For example, if it is used as a parameter in a database query. Due to the implementation specifics of System.Guid, calling the method System.Guid::ToByteArray() and calling System.Guid::ToString() with subsequent conversion of the resulting hexadecimal string to a byte array will produce different results.

From this, it follows that calling the constructor with a byte array or with a string also produces different results. If the constructor that accepts a string was called, the result of calling System.Guid::ToString() will match the value of the string passed to the constructor, but the result of calling System.Guid::ToByteArray() will not match. If the constructor that accepts a byte array was called, the opposite situation arises - the result of calling System.Guid::ToByteArray() matches the constructor parameters, but System.Guid::ToString() does not match.

This can lead to situations where, for example, the log records the string representation, but the database stores the binary representation. And if you decide to find an object whose identifier you saw in the logs, you need to perform the same conversion that is done inside System.Guid.

The above examples demonstrate the difficulties in working with System.Guid that arise due to differences in string and binary representations.

That's why I suggest adding a data structure called System.Uuid with a simple API that will have the same string (hexadecimal) and binary representation. The algorithms for generating a sequence of 16 bytes to construct this structure can be left as a space for creativity for the .NET community. Adding such a data type to the base class library would provide a solid foundation for the .NET ecosystem to freely use the first option of the binary representation (as 16 separate octets), without worrying about how the data is serialized - whether by converting to a string or binary format.

API Proposal

namespace System
{
    [StructLayout(LayoutKind.Sequential)]
    public readonly struct Uuid
        : ISpanFormattable,
          IComparable,
          IComparable<Uuid>,
          IEquatable<Uuid>,
          ISpanParsable<Uuid>,
          IUtf8SpanFormattable
    {
        public static readonly Uuid Empty;
        public Uuid(byte[] b)
        public Uuid(ReadOnlySpan<byte> b)
        public Uuid(string u)
        public static Uuid Parse(string input)
        public static Uuid Parse(ReadOnlySpan<char> input)
        public static bool TryParse([NotNullWhen(true)] string? input, out Uuid result)
        public static bool TryParse(ReadOnlySpan<char> input, out Uuid result)
        public byte[] ToByteArray()
        public bool TryWriteBytes(Span<byte> destination)
    }
}

And also all interface methods, comparison operators, equality operators

API Usage


var input= new byte[]
{
    0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 
    0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF
};
var uuid = new Uuid(input);
var uuidBytes = uuid.ToByteArray(); // output: 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF
var uuidString = uuid.ToString(); // output: 00112233445566778899aabbccddeeff

var uuidStr = new Uuid("00112233445566778899aabbccddeeff");
var uuidStrBytes = uuidStr .ToByteArray(); // output: 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF
var uuidStrString = uuidStr.ToString(); // output: 00112233445566778899aabbccddeeff

Alternative Designs

No response

Risks

No response

vanbukin commented 1 year ago

@DaZombieKiller

ToString is not intended for binary serialization, it is intended for display and parsing. You should not expect the results of ToString and ToByteArray to be equivalent, just like you shouldn't expect int.ToString() and BitConverter.GetBytes(int) to be equivalent. They serve very different purposes.

I completely agree, but System.Guid has methods for both binary and string representation. And these methods take endianess of the machine they are running on into account. And from the very beginning, I took into account that methods for working with binary representation cannot be removed. That's why I didn't suggest doing it, neither in the first post nor in further discussion.

The same fundamental problem would still exist, if you are serializing to binary then endianness cannot be avoided. I agree it would be less confusing though, because it would likely have dedicated serialization methods on BitConverter or BinaryPrimitives instead of the ambiguous constructor and ToByteArray method.

Great constructive suggestion that deserves discussion. In this case, System.Uuid can be considered as a container, whose public API allows working only with string representation, and the work with binary representation can be moved to BinaryPrimitives, which may be the best place to implement an API that takes endianess into account.

vanbukin commented 1 year ago

@tannergooding

API review explicitly discusses alternatives and linked issues. We will not review the other proposal without bringing up the fact that users originally asked for this. Many API reviewers are also already familiar with the context, as per my callout of the internal checks before I closed this issue in favor of the new one.

Thank you for the clarification. It makes the processes more transparent.

That is not how any type across the BCL works. Binary layout and string layout are not consistent and explicitly do not match by default for the vast majority of hardware that exists.

But as correctly noted above, System.Guid is special. It takes care of endianess itself when its methods are called.

Most modern machines (x86, x64, Arm32, Arm64, RISC-V, etc) are exclusively or at least primarily little-endian. They all represent their values with the least significant byte first in memory. However, ToString, Parse, and most other APIs all must follow the same format that the current (or explictly requested) culture follows. For English and invariant languages, this is that text and numbers are read from left to right, top to bottom, with the most significant digit appearing first. Thus, 65534 is 0xFFFE and is ordered in memory on most machines as 0xFE, 0xFF. If you were to access the raw bytes whether from some ToByteArray() method, from using a BinaryWriter, or many other scenarios, you would find that it serialized inversely from how ToString prints it.

I understand that. The problem is that Guid takes into account the endianess of its components. If we keep the current implementation, then if its internal representation were implemented as, say, int128, or as 16 separate bytes, then there would be no issues with endianess. You simply reverse all the bytes if necessary and there are no issues. But it behaves like a mixed-endian structure, which also exposes a public API for working with its binary representation.

Guid here is really no different and in fact if we exposed some Uuid we could safely define its internal layout to likewise be identical to Guid. We could define it to be a Guid field itself. Because the internal layout of the type doesn't matter.

I completely agree.

What matters is what the API that reads/writes the byte sequence does and there are two valid behaviors with exactly which being correct depending on whether you are expecting a variant 1 or variant 2 UUID.

As discussed earlier, it may be worth considering Sysytem.Uuid as an alternative to Guid that will not have a public API for working with its binary representation and instead will have methods in BinaryPrimitives that could take endianess into account and possibly provide a better way of working with this kind of data structure. This way, it would not have any methods for working with its binary representation at all. Alternatively, we could implement all possible methods for working with the binary representation as a public API for Uuid, taking into account the way this is done in other languages and ecosystems, simply to ensure that developers transitioning from other languages to .NET do not encounter difficulties in this area.

That is not how API review in .NET works. It is the responsibility of the area owners, me in this case, to make an initial determination on whether something is even worth bringing to API review in the first place. We get literally thousands of API proposals, from all kinds of users, covering all ranges of scenarios. It would be impossible to truly review them all in depth.

Thank you very much to you personally, as well as to the API Review team, and everyone who is involved in improving the .NET ecosystem. This is a voluminous and complex job.

This was a case where I had my own initial feeling of how API review would react and given the number of users asking for it, I did an initial offline check to confirm my suspicions. The new proposal was then opened based on the feedback from the API review members that a new type is indeed something we would not be willing to do; particularly given the scenario involved, how .NET has handled similar scenarios up until this point, etc.

The problem with the proposed solution is that it can be done in two ways: 1) as a breaking change, which I tried to avoid 2) by adding new methods that will make the usage of System.Guid and the ambiguities around it even more complex.

If we were doing this today, with no concern of back-compat. We would likewise have 1 type. We would likely call it System.Uuid instead. It would then have a constructor that required specifying isBigEndian, rather than it being an additional overload. We would similarly reject a request for System.Guid or System.MsGuid under the same principles.

So, would it be worth designing this type as something that you would do today and systematically migrating the .NET ecosystem to its usage? I understand that this is not a quick process and it will definitely take years. The final result, in the best case, we will see in .NET 12 or .NET 14, but maybe it's worth starting to address this technical debt.

Raw sequences of bytes fundamentally must be told what order they are in to be read correctly.

I agree. The problem is that the raw sequence, which Guid works with, needs to take into account its internal layout.

If I have an array of bytes 0x00112233445566778899AABBCCDDEEFF in a file, then to get the string "00112233445566778899AABBCCDDEEFF" in the ToString call, I need to pass the array 0x33221100554477668899AABBCCDDEEFF to the Guid constructor. That is, I need to swap the first 4 bytes, the next 2 bytes, and the next 2 bytes. This is because of the internal structure of System.Guid which is taken into account by the ToString method. If the situation is the opposite, for example, I have an array of bytes 0xFFEEDDCCBBAA99887766554433221100 in a file, then to get the string "00112233445566778899AABBCCDDEEFF" again in the ToString call, I first need to reverse this array and then swap the bytes as described above. If you do this with separate methods, you won't be able to get rid of byte shuffling. It will still be there, only hidden in the implementation of the BinaryPrimitives methods.

The Guid constructor that takes in a byte array, as well as the ToByteArray method, have existed since the .NET Framework 1.0, which was introduced in 2002. I'm afraid to imagine how much code would break if they were removed. And if they are not removed, adding new methods to the public API of System.Guid will only make things worse than they are now.

If you don't want to work with raw bytes, use strings. If you do work with raw bytes, you must understand the endianness of them or you risk the wrong thing happening on mismatch.

It's difficult to control across hundreds of repositories within a company with thousands of employees to ensure that nobody, anywhere, ever calls the byte constructor or the ToByteArray method. Analyzers (which can be not installed or disabled, regardless of whether it's intentional or due to human error), deprecation warnings (which can break many codebases with TreatWarningsAsErrors, which can also be disabled with #pragma) - these are all half-measures that provide an illusion of certainty (фnd we still haven't addressed the question of what it would cost to deprecate an API that has been around for 20 years).

The same issue would be present for one application calling ToByteArray() on Uuid and another calling new Guid(byte[]) on the other end. This is something that will happen in a number of scenarios for a new Uuid type.

Within .NET, this could be resolved by byte shuffling, but in such cases, it might be worth considering adding two overloads with different parameters.

It will also impact users targeting .NET Framework or .NET Standard where such a Uuid type doesn't exist and where we typically do not ship polyfill packages due to .NET Framework no longer being versioned.

That didn't stop the addition of DateOnly and TimeOnly, though.

It introduces a magnitude of additional complexity, considerations, integration concerns, and general failure points above and beyond simply having overloads on Guid that handle the relatively minor difference which only exists as part of serialization.

Yes, but that's all because nobody did anything about this problem for 20 years. It might be a good idea to include a public API review to discuss whether this type should be added, what the cost of adding it would be, how to adapt existing codebases to this type of data if it were to be added, and whether it is even necessary to do so. It could be worth considering adding it simply as a 'data box' with no API that could allow end users to make mistakes. Alternatively, a decision may be made to convert all of .NET to use this data type, and to designate Guid as 'legacy'. Perhaps the best solution would be to add separate static methods and perform some minor reworking of Guid itself.

vanbukin commented 1 year ago

@tannergooding We've discussed a lot in this discussion, and it's clear that there is a problem. I'm glad I was able to convey that to you, and thank you very much for listening to me. However, perhaps it would be worth discussing this publicly with the entire API Review team? Many members of the .NET community would be grateful for an understanding of why the API Review team decided to go down one path or another.

tannergooding commented 1 year ago

But as correctly noted above, System.Guid is special. It takes care of endianess itself when its methods are called.

It doesn't necessarily "take care of endianess". It has an internal format and exposes a set of APIs for dealing with serialization. Those APIs are currently incomplete and only let you work with little endian data, even if you may have big endian data.

If we keep the current implementation, then if its internal representation were implemented as, say, int128, or as 16 separate bytes, then there would be no issues with endianess

The same issue still exists, regardless of the internal layout of the data. The fundamental problem is that users need to be able to serialize to and from byte sequences representing variant 1 as well as byte sequences representing variant 2 data. Even if we exposed Uuid as proposed with the layout described in the original post, the exact same but inverse concern would exist in that the APIs exposed on the type would default to big endian but some users would need to work with the struct in little endian format.

We would end up having exposed Uuid that behaves and operates identically to Guid, except it defaults to big endian serialization rather than little endian.

As discussed earlier, it may be worth considering Sysytem.Uuid as an alternative to Guid that will not have a public API for working with its binary representation and instead will have methods in BinaryPrimitives that could take endianess into account and possibly provide a better way of working with this kind of data structure.

This would be the same as simply marking the APIs on System.Guid as [Obsolete("Use the explicit BigEndian and LittleEndian APIs on System.Buffers.Binary.BinaryPrimitives", error: true)]

We would most likely take such a breaking change over introducing a new type if API review determines its as big of a concern as some users in this thread are surfacing.

Alternatively, we could implement all possible methods for working with the binary representation as a public API for Uuid, taking into account the way this is done in other languages and ecosystems, simply to ensure that developers transitioning from other languages to .NET do not encounter difficulties in this area.

This is effectively what https://github.com/dotnet/runtime/issues/86798 is proposing. It is exposing all the APIs for working with the two possible binary representations of a GUID/UUID; only it is doing it on the existing type.

The problem with the proposed solution is that it can be done in two ways:

  1. as a breaking change, which I tried to avoid
  2. by adding new methods that will make the usage of System.Guid and the ambiguities around it even more complex.

A new type, System.Uuid, is even more problematic and ambiguity causing than 2. If we exposed System.Uuid we would now have 2 types using frequently interchanged terms that do the same thing except for how they serialize bytes, which is inverted from each other. Users then have to rationalize the subtle differences between these two types, when it is appropriate to use one vs the other and the 20 years of existing code that uses Guid but maybe should've used Uuid instead. Such code that exposes Guid but should've used Uuid then has to also make the decision on whether they want to take a breaking change, to obsolete their existing Guid API, etc.

2 is only an issue in terms of making it visible that some users want to pass isBigEndian: true, it becomes no more ambiguous than it is today and allows devs to transition towards the "better" approach without taking breaking changes to their public API surface themselves.

1 is only really desirable if we believe that the existing serialization APIs on Guid are truly that problematic. Even if we decide not to do 1 now, we could do it in the future if additional feedback comes in. Obsoleting the confusing existing APIs would still be more desirable than a new type. This is because a new type effectively means that downstream users are themselves forced to consider taking a breaking change and start exposing overloads that take Uuid and/or having to interoperate between Guid and Uuid, even if they don't do binary serialization.

With 1 or 2, they only have to consider Guid and they only have to consider the binary serialization differences if they are doing binary serialization, which is no different from the considerations they already have today.

So, would it be worth designing this type as something that you would do today and systematically migrating the .NET ecosystem to its usage? I understand that this is not a quick process and it will definitely take years. The final result, in the best case, we will see in .NET 12 or .NET 14, but maybe it's worth starting to address this technical debt.

That is what https://github.com/dotnet/runtime/issues/86798 is doing. It is exposing the missing overloads and it matches what we would expose today, minus the name where it remains Guid rather than using the "more cross platform" Uuid. We are not willing to expose a new type and deprecate a very broadly used existing type just to "fix" the name.

The problem is that the raw sequence, which Guid works with, needs to take into account its internal layout.

The same is true for Uuid. There is no guarantee that Uuid would be 16 byte fields and in fact it likely would not be 16 byte fields because that makes it significantly less performant to work with and much harder to convert to/from Guid, which would be a base requirement if we exposed it.

I need to pass the array 0x33221100554477668899AABBCCDDEEFF to the Guid constructor

Yes, or with https://github.com/dotnet/runtime/issues/86798 you would pass in the array of bytes 0x00112233445566778899AABBCCDDEEFF and say isBigEndian: true. Which tells the constructor that the bytes are in big endian order and not the little endian order they are currently presumed to be. This matches how we handle this for other types.

If you do this with separate methods, you won't be able to get rid of byte shuffling. It will still be there, only hidden in the implementation of the BinaryPrimitives methods.

The byte shuffling would also exist for the proposed Uuid type, because it is more efficient. It is far more efficient to emit two 64-bit loads of an intended endianness and byte swap than it is to copy 16 individual bytes. This is because most modern hardware (certainly everything RyuJIT currently supports) provides dedicated functionality to do this.

What the internal implementation does to ensure correct behavior for a given method doesn't matter for the exposed public API surface. We are going to do whatever is most efficient, within the limits of the platforms we run on and general needs of the type.

And if they are not removed, adding new methods to the public API of System.Guid will only make things worse than they are now.

It will not make it worse than adding an entirely new type and having to explain all the subtle differences between Guid, Uuid, when to use which and how to interoperate between the two of them.

We do not have this problem for any of the other types we expose big and little endian serialization APIs for, several (but not all) of which are more broadly used than Guid.

That didn't stop the addition of DateOnly and TimeOnly, though.

It was a general consideration for the types and the impact it may have. The number of advantages of exposing them were directly weighed against the drawbacks and there was enough evidence to justify it.

That has not been the case for Uuid, particularly given the minor difference and the possible alternatives to provide the same end functionality.

Yes, but that's all because nobody did anything about this problem for 20 years. It might be a good idea to include a public API review to discuss whether this type should be added, what the cost of adding it would be, how to adapt existing codebases to this type of data if it were to be added, and whether it is even necessary to do so. It could be worth considering adding it simply as a 'data box' with no API that could allow end users to make mistakes. Alternatively, a decision may be made to convert all of .NET to use this data type, and to designate Guid as 'legacy'. Perhaps the best solution would be to add separate static methods and perform some minor reworking of Guid itself.

We've discussed a lot in this discussion, and it's clear that there is a problem. I'm glad I was able to convey that to you, and thank you very much for listening to me. However, perhaps it would be worth discussing this publicly with the entire API Review team? Many members of the .NET community would be grateful for an understanding of why the API Review team decided to go down one path or another.

It is my job, as area owner, to be the interface between the community and API review here. The community is allowed to engage directly with API review via YouTube chat if and when a given proposal goes to API review. That is what I am doing, that is what I am explaining. If you go back and look at the people who have chimed in or thumbs up'd my responses above, you'll find many managers, principal level engineers, and even the runtime architect.

I can tell you with absolute certainty that I am not misrepresenting the stance of API review here, many of which are included in those that have reacted to my posts above. I have attempted to, repeatedly, explain why API review is going this route. You'll just be trading off "Tanner is saying 'x'" for "Alternative API Review Member is also saying 'x'".

When API review does happen, I expect it will effectively boil down to:

  1. We aren't going to expose System.Uuid for the reasons that I have already elaborated on (which we will reiterate)
  2. We'll then cover the question of whether we should we even expose these APIs or should we just leave it to users to swap the bytes themselves and roll a helper method when actually required?
  3. If we do expose these APIs, is it worth deprecating or obsoleting the existing overloads to help make it clearer

1 is effectively a guarantee. This has already been covered in my offline checks and I have iterated the reasons above. We'll just formalize the decision on live stream and reiterate the reasons again there.

Given that the runtime architect asked 2, we'll likely spend a bit of time debating it. I'm currently pushing for us to do something given the sheer number of people that upvoted the ask for System.Uuid here. Given the number of upvotes, it's clear there is a portion of the community that needs to frequently interact with variant 1 UUIDs and which has encountered problems given the currently fixed behavior of System.Guid.

Those numbers would then indicate that we should do something and we would then opt for the approach I laid out in the new proposal given that it follows the Framework Design Guidelines, is consistent with how we have historically approached this topic for other types, is consistent with how we have recently approached this topic for other types (even as recently as .NET 7, the most recent release), in that it improves an existing type we already have, and solves the actual root problem that is being raised even if it doesn't necessarily solve the problem in exactly the way that was requested.

If API review determines it is not worth exposing the overloads after more in depth discussion, then we'll end up exactly where we are today and we'll recommend that users roll their own helper API. Such a helper API is trivial to write and is about 7 lines of code. Someone from the community could provide that functionality via a NuGet package (binary or source based).

I do expect that API review will say it is worth exposing the helpers, however. So we'll then discuss 3 to determine if we believe the concerns around users choosing the wrong overload are justified or not. I expect API review will say they aren't and that users that are impacted by this will quickly find out via testing and docs that things are wrong and that they should be passing in isBigEndian: true. We'll also end up saying that we can revisit that decision based on additional community feedback and that enough feedback may convince us to expose an analyzer or to actually obsolete the APIs.