dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

[API Proposal]: Uuid data type #86084

Closed vanbukin closed 1 year ago

vanbukin commented 1 year ago

Background and motivation

There are 2 ways to organize the binary representation of Uuid: 1) as 16 separate octets 2) mixed-endian, which .NET inherited from COM/OLE.

Currently, the .NET Base Class Library only includes System.Guid. This is a data structure that implements the second method of binary representation.

If you use System.Guid as a unique identifier for objects, it is necessary to be extremely careful. For example, if it is used as a parameter in a database query. Due to the implementation specifics of System.Guid, calling the method System.Guid::ToByteArray() and calling System.Guid::ToString() with subsequent conversion of the resulting hexadecimal string to a byte array will produce different results.

From this, it follows that calling the constructor with a byte array or with a string also produces different results. If the constructor that accepts a string was called, the result of calling System.Guid::ToString() will match the value of the string passed to the constructor, but the result of calling System.Guid::ToByteArray() will not match. If the constructor that accepts a byte array was called, the opposite situation arises - the result of calling System.Guid::ToByteArray() matches the constructor parameters, but System.Guid::ToString() does not match.

This can lead to situations where, for example, the log records the string representation, but the database stores the binary representation. And if you decide to find an object whose identifier you saw in the logs, you need to perform the same conversion that is done inside System.Guid.

The above examples demonstrate the difficulties in working with System.Guid that arise due to differences in string and binary representations.

That's why I suggest adding a data structure called System.Uuid with a simple API that will have the same string (hexadecimal) and binary representation. The algorithms for generating a sequence of 16 bytes to construct this structure can be left as a space for creativity for the .NET community. Adding such a data type to the base class library would provide a solid foundation for the .NET ecosystem to freely use the first option of the binary representation (as 16 separate octets), without worrying about how the data is serialized - whether by converting to a string or binary format.

API Proposal

namespace System
{
    [StructLayout(LayoutKind.Sequential)]
    public readonly struct Uuid
        : ISpanFormattable,
          IComparable,
          IComparable<Uuid>,
          IEquatable<Uuid>,
          ISpanParsable<Uuid>,
          IUtf8SpanFormattable
    {
        public static readonly Uuid Empty;
        public Uuid(byte[] b)
        public Uuid(ReadOnlySpan<byte> b)
        public Uuid(string u)
        public static Uuid Parse(string input)
        public static Uuid Parse(ReadOnlySpan<char> input)
        public static bool TryParse([NotNullWhen(true)] string? input, out Uuid result)
        public static bool TryParse(ReadOnlySpan<char> input, out Uuid result)
        public byte[] ToByteArray()
        public bool TryWriteBytes(Span<byte> destination)
    }
}

And also all interface methods, comparison operators, equality operators

API Usage


var input= new byte[]
{
    0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 
    0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF
};
var uuid = new Uuid(input);
var uuidBytes = uuid.ToByteArray(); // output: 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF
var uuidString = uuid.ToString(); // output: 00112233445566778899aabbccddeeff

var uuidStr = new Uuid("00112233445566778899aabbccddeeff");
var uuidStrBytes = uuidStr .ToByteArray(); // output: 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF
var uuidStrString = uuidStr.ToString(); // output: 00112233445566778899aabbccddeeff

Alternative Designs

No response

Risks

No response

vanbukin commented 1 year ago

This would fully solve the problem described in issue #29523 Uuid could become the foundation for solving issues #22721, #23868, #62090, and #65736, as the community could implement the necessary generators on their own, while BCL would provide a container in the form of Uuid.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation There are 2 ways to organize the binary representation of Uuid: 1) as 16 separate octets 2) mixed-endian, which .NET inherited from COM/OLE. Currently, the .NET Base Class Library only includes [System.Guid](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Guid.cs). This is a data structure that implements the second method of binary representation. If you use `System.Guid` as a unique identifier for objects, it is necessary to be extremely careful. For example, if it is used as a parameter in a database query. Due to the implementation specifics of `System.Guid`, calling the method [`System.Guid::ToByteArray()`](https://github.com/dotnet/runtime/blob/51876eec648de4ef9bc3bd9db6e747b635d868b6/src/libraries/System.Private.CoreLib/src/System/Guid.cs#L847) and calling [`System.Guid::ToString()`](https://github.com/dotnet/runtime/blob/51876eec648de4ef9bc3bd9db6e747b635d868b6/src/libraries/System.Private.CoreLib/src/System/Guid.cs#L1086) with subsequent conversion of the resulting hexadecimal string to a byte array will produce **different results**. From this, it follows that calling the constructor with a [byte array](https://github.com/dotnet/runtime/blob/51876eec648de4ef9bc3bd9db6e747b635d868b6/src/libraries/System.Private.CoreLib/src/System/Guid.cs#L46-L83) or with a [string](https://github.com/dotnet/runtime/blob/51876eec648de4ef9bc3bd9db6e747b635d868b6/src/libraries/System.Private.CoreLib/src/System/Guid.cs#L245-L254) also produces different results. If the constructor that accepts a string was called, the result of calling `System.Guid::ToString()` will match the value of the string passed to the constructor, but the result of calling `System.Guid::ToByteArray()` will not match. If the constructor that accepts a byte array was called, the opposite situation arises - the result of calling `System.Guid::ToByteArray()` matches the constructor parameters, but `System.Guid::ToString()` does not match. This can lead to situations where, for example, the log records the string representation, but the database stores the binary representation. And if you decide to find an object whose identifier you saw in the logs, you need to perform the same conversion that is done inside `System.Guid`. The above examples demonstrate the difficulties in working with `System.Guid` that arise due to differences in string and binary representations. That's why I suggest adding a data structure called `System.Uuid` with a simple API that will have the same string (hexadecimal) and binary representation. The algorithms for generating a sequence of 16 bytes to construct this structure can be left as a space for creativity for the .NET community. Adding such a data type to the base class library would provide a solid foundation for the .NET ecosystem to freely use the first option of the binary representation (as 16 separate octets), without worrying about how the data is serialized - whether by converting to a string or binary format. ### API Proposal ```csharp namespace System { [StructLayout(LayoutKind.Sequential)] public readonly struct Uuid : ISpanFormattable, IComparable, IComparable, IEquatable, ISpanParsable, IUtf8SpanFormattable { public static readonly Uuid Empty; public Uuid(byte[] b) public Uuid(ReadOnlySpan b) public Uuid(string u) public static Uuid Parse(string input) public static Uuid Parse(ReadOnlySpan input) public static bool TryParse([NotNullWhen(true)] string? input, out Uuid result) public static bool TryParse(ReadOnlySpan input, out Uuid result) public byte[] ToByteArray() public bool TryWriteBytes(Span destination) } } ``` And also all interface methods, comparison operators, equality operators ### API Usage ```csharp var input= new byte[] { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF }; var uuid = new Uuid(input); var uuidBytes = uuid.ToByteArray(); // output: 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF var uuidString = uuid.ToString(); // output: 00112233445566778899aabbccddeeff ``` ### Alternative Designs _No response_ ### Risks _No response_
Author: vanbukin
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime`, `untriaged`, `needs-area-label`
Milestone: -
tannergooding commented 1 year ago

Why can't//shouldn't this just be an explicit set of APIs on Guid?

This feels like adding a nearly identical type for slightly different semantics that really only matter at the serialization/deserialization boundaries.

vanbukin commented 1 year ago

Great question! That's exactly where the problem lies! As a developer, I absolutely have to know the construction details of the serializer, because I am obligated to use different APIs.

For instance, the deserializer can parse a System.Guid from the string '00112233445566778899aabbccddeeff' in two different ways. 1) Convert the hexadecimal string into a byte array and then call the Guid(byte[]) constructor. 2) Call Guid.TryParse(ReadOnlySpan input, out Guid result).

Depending on the way the value was deserialized, the contents of the Guid will be different. 1) If a Guid was constructed using the constructor that takes a pre-created byte array of 0x00,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xAA,0xBB,0xCC,0xDD,0xEE,0xFF:

For instance, if I need to use such a value as a parameter in an SQL query, I must also know how the database driver converts a Guid into a value that will actually be sent to the database.

An example of this is the MySqlConnector, which has a setting in the ConnectionString that can take 7 (seven!) different values.

Depending on how serialization and deserialization are actually implemented, I as a developer am obliged to use different values of this parameter.

When using Guid as a primary key in a table in MySQL, the byte order in such a column is extremely important. The first 8 bytes must be reversed time-low and time-high parts of UUIDv1, which ensures the monotonic increase of the primary key. To achieve this, MySQL 8.0 even added a special function called UUID_TO_BIN.

If you work with a table containing tens of billions of records and start sending Guid bytes in the wrong order, it would be faster to buy a new server and deploy a backup of the database on it, rather than waiting for the BTREE to be rebuilt whose balance was disrupted because some serializer started deserializing Guid differently.

The situation becomes even worse if the service uses several different serializers that handle System.Guid differently. And everything becomes even worse if all of the above happens in a large company where dozens of development teams with different levels of skills work. There is no guaranteed way to create working conditions that would exclude the human factor.

By using the proposed data structure with the same string and binary representation, it is impossible to make a mistake because regardless of what is input - hexadecimal string or Span/byte array - the output will be the same.

In the scenario described above, using UUID eliminates the possibility of doing something wrong.

hopperpl commented 1 year ago

How will Uuid solve any of the named issues? I work with a big game engine that has over 15 million assets (files), each having multiple versions, and objects... so give or take 100 million different Guids/Uuids I deal with to identify asset and objects.

On top of that we have like 100 different data readers and writers, 10-15 year old, and proprietary formats that read Guid either big endian, or little endian, or component wise and each component big/endian-endian ... in short it's a mess. So, I understand it can be problematic. I lost a lot of hair over it. Usually, I just treat it as a 128-bit number and don't even bother with its representation. It's a number, nothing else. I usually nowadways use MemoryMarshal.Cast to directly pull the value as a 128-bit numeric value, and keep it there.

But, how will Uuid solve all this? The way Guid is serialized is fixed and defined in dotnet. It doesn't affect what other languages, frameworks, systems do. It also adjusts for system endianness to get the same result on big- and little endian machines. Within the dotnet ecosystem. But overall, it is not uniquely defined how such a 128-bit number is represented as a string. At most, it is recommended.

For instance, if I need to use such a value as a parameter in an SQL query, I must also know how the database driver converts a Guid into a value that will actually be sent to the database.

Yes, that is correct. And if you use another database driver, the expected string representation will be different. Then you would need UuidEx and UuidExEx and Uuid3 and so on.

Maya uses he first dword as little-endian, the 2nd and 3rd word as big-endian, and the final 12-bytes as 3x dword little-endian. Houdini has the whole 128-bit value swapped (full big-endian). There are 100, if not 1000 different ways to present that 128-bit value as a string. And rest assured, there is at least a dozen software implementations out in the world for each combinatorical serialization.


Also, in many scenarios, a generated Guid/Uuid must be sent through a seeded hasher. It cannot simply contain a timestamp or (as Microsoft once did many decades go) contain the MAC address of a physical network card or any other identifiable characteristics. It's a security and privacy risk.

huoyaoyuan commented 1 year ago

See also #53354

vanbukin commented 1 year ago

@hopperpl

How will Uuid solve any of the named issues?

The data structure described in this API Proposal is simply a container for data. Its API implies working in a 'what came in is what came out' format, regardless of whether binary data or strings were provided as input. In this case, you can be guaranteed that the original byte order will always be preserved, regardless of how this data structure was constructed.

On top of that we have like 100 different data readers and writers, 10-15 year old, and proprietary formats that read Guid either big endian, or little endian, or component wise and each component big/endian-endian ... in short it's a mess. So, I understand it can be problematic. I lost a lot of hair over it. Usually, I just treat it as a 128-bit number and don't even bother with its representation. It's a number, nothing else. I usually nowadways use MemoryMarshal.Cast to directly pull the value as a 128-bit numeric value, and keep it there.

This is a great abstraction when it doesn't matter how exactly the content of Guid is represented. Or when the software is completely controlled by one team. If you alone own all the data throughout its lifecycle - then you control everything. But that's not always the case.

But, how will Uuid solve all this? The way Guid is serialized is fixed and defined in dotnet.

Yes, but the problem is that as a developer, I need to know exactly how (de)serialization occurred. Because depending on the chosen way of (de)serialization of Guid, the data in it may differ.

For instance, if I need to use such a value as a parameter in an SQL query, I must also know how the database driver converts a Guid into a value that will actually be sent to the database.

Yes, that is correct. And if you use another database driver, the expected string representation will be different. Then you would need UuidEx and UuidExEx and Uuid3 and so on.

No, I won't need another data type, since the proposed API, as I mentioned above, works on the principle of "what goes in, comes out." This provides the ultimate opportunity to construct such a data structure from both binary and string representations without worrying about what transformations are happening and where they are happening. Because they simply won't happen.

Maya uses he first dword as little-endian, the 2nd and 3rd word as big-endian, and the final 12-bytes as 3x dword little-endian. Houdini has the whole 128-bit value swapped (full big-endian). There are 100, if not 1000 different ways to present that 128-bit value as a string. And rest assured, there is at least a dozen software implementations out in the world for each combinatorical serialization.

Uuid implies the interpretation of input and output data as a sequence of 16 bytes. This is exactly what is written in RFC4122. The way the contents of these 16 bytes can be interpreted by specific software, either as int/short/short/byte/byte..byte or in some other way, and whether to use big-endian or little-endian, is not important for Uuids. It is simply a container with data. A black box that will deliver data from the sender to the receiver intact and unaltered, without interfering with the contents. Because Guid is not such a black box. Guid is a box that needs to be packed and unpacked from the same side if there is a need to get the same data that was originally packed into it.

Also, in many scenarios, a generated Guid/Uuid must be sent through a seeded hasher. It cannot simply contain a timestamp or (as Microsoft once did many decades go) contain the MAC address of a physical network card or any other identifiable characteristics. It's a security and privacy risk.

I'm not suggesting any specific way to generate Uuid content. This is not in the API Proposal. I'm suggesting adding Uuid as a container for data that has the property of "what goes in, comes out". The .NET community is able to write generators on their own or use existing ones (which many probably already have).

vanbukin commented 1 year ago

@huoyaoyuan Thank you for the link! The current API Proposal may be a way to solve the problem described there.

hopperpl commented 1 year ago

@vanbukin .... I'm sorry, I still really don't understand...

for me System.Guid is also just a data container that stores 16-bytes of data in some way that has no real relevance outside. A new Uuid in effect would use Int128 (or sixteen Int8), instead of one Int32, two Int16 and eight Int8. But today it's not possible to access these internal fields in Guid anyway. It's a black box. If there weren't the problem of legacy binary serialization that addresses internal fields by name, Guid would use an Int128 as internal storage field today.

Uuid implies the interpretation of input and output data as a sequence of 16 bytes.

So does Guid. It is stored in internal fields that are of different size but that is internal logic not exposed to consumers. A round trip from a blob (binary large object) to Guid back to blob results in the same sequence of bytes. The issue is the Guid to String representation. But that is done by different systems and different frameworks in different ways, fully unrelated to dotnet.

the byte sequence 11 22 33 44 55 66 77 88 99 00 AA BB CC DD EE FF

I just don't understand how this new Uuid struct would solve or address the different ways of presenting the byte sequence as string (make it human readable). Or how it would prevent human error. If any database is storing Guid/Uuid in its own mangled binary format, and the command issued is text based via SQL, how does the new Uuid class know how to format that string to prevent human error?

I do understand tho that Uuid would represent the string in a different way, but that could also be achieved by adding a different ToString() format specifier (same way DateTime formats time, RFC3339).

vanbukin commented 1 year ago

@hopperpl

for me System.Guid is also just a data container that stores 16-bytes of data in some way that has no real relevance outside.

This is not just a container, it is a container where the same value can only be obtained if it is extracted in the same way it was placed there. For example, the result of the ToString call will be written to the logs. When deserializing JSON, the property containing the Guid will be parsed as a string value. In this scenario, we construct a Guid based on the string from JSON and serialize it back to a string representation when writing to logs. Therefore, the value we passed in JSON and in the logs will be the same.

But if we access the database, read 16 bytes from there, put them into the Guid constructor, and write such a Guid to the log, or take the Guid obtained from JSON and write it to the database in binary format, the values will start to differ. As a result, there is a situation where there is a value in the logs, but there is no exactly the same value in the database. This can be overcome by using roundtrip, which will rearrange the bytes inside the Guid in such a way as to match either the string or binary representation of the source. Earlier I mentioned how this situation becomes worse when you have both string and binary representations as your data source.

the byte sequence 11 22 33 44 55 66 77 88 99 00 AA BB CC DD EE FF is displayed by Guid "44332211-6655-8877-9900AABBCCDDEEFF"

That's where the problem lies. The string and binary representations differ from each other. And not only on output, but also on input. It would be helpful to have a built-in data type in .NET where these representations match. And that's exactly what I suggest adding to this API proposal.

Or how it would prevent human error.

Having such a data structure, you don't have to worry about which representation to take as the source of truth - the string or binary one.

I do understand tho that Uuid would represent the string in a different way, but that could also be achieved by adding a different ToString() format specifier (same way DateTime formats time, RFC3339).

The issue is not only with ToString, but also with TryParse.

vanbukin commented 1 year ago

Let's take a look at Java and invoke both constructors (from string and from byte array) of a similar data structure

import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.util.HexFormat;
import java.util.UUID;

public final class Main {
    public static void main(String[] args) throws InvocationTargetException, InstantiationException, IllegalAccessException, NoSuchMethodException {
        Constructor<UUID> constructor = UUID.class.getDeclaredConstructor(byte[].class);
        constructor.setAccessible(true);
        UUID a = UUID.fromString("00112233-4455-6677-8899-AABBCCDDEEFF");
        UUID b = (UUID)constructor.newInstance(HexFormat.of().parseHex("00112233445566778899AABBCCDDEEFF"));
        System.out.println(a);
        System.out.println(b);
    }
}

To build and run

javac Main.java
java --add-opens=java.base/java.util=ALL-UNNAMED Main

And as a result, we will get the following output.

00112233-4455-6677-8899-aabbccddeeff
00112233-4455-6677-8899-aabbccddeeff

Regardless of whether the input is a byte array or a hexadecimal string representing those bytes, the output will always be the same. There is no built-in data structure in the .NET BCL that behaves in a similar way.

vanbukin commented 1 year ago

Similar situation also exists in GoLang. Regardless of what was entered - bytes or a string with their hexadecimal representation - the output is the same. https://go.dev/play/p/tP0FLbosApu

vanbukin commented 1 year ago

Similarly, in Python 3.

import uuid
a = uuid.UUID('00112233445566778899AABBCCDDEEFF')
b = uuid.UUID(bytes=b'\x00\x11\x22\x33\x44\x55\x66\x77\x88\x99\xAA\xBB\xCC\xDD\xEE\xFF')
print(a)
print(b)

We will get in the console.

00112233-4455-6677-8899-aabbccddeeff
00112233-4455-6677-8899-aabbccddeeff

If we want to get behavior similar to how Guid behaves by default, then we must explicitly specify bytes_le.

import uuid
a = uuid.UUID('00112233445566778899AABBCCDDEEFF')
b = uuid.UUID(bytes_le=b'\x00\x11\x22\x33\x44\x55\x66\x77\x88\x99\xAA\xBB\xCC\xDD\xEE\xFF')
print(a)
print(b)

Then the console output will change to the following.

00112233-4455-6677-8899-aabbccddeeff
33221100-5544-7766-8899-aabbccddeeff
vanbukin commented 1 year ago

It looks like there is a common use of a data structure that works on the principle of "what goes in is what comes out" everywhere, but in .NET there is only Guid, which was originally a structure necessary for interop with COM/OLE/WinAPI, but due to the lack of alternatives, has become widely used.

tannergooding commented 1 year ago

There are multiple variants, multiple versions, and ultimately multiple layouts for the different UUID kinds and which is "correct" is dependent on the scenario in question. Likewise, while there is a technical spec indicating how a UUID should be interpreted, there are many things that simply use UUID as a 128-bit integer and don't truly follow any spec. They often do this for convenience, even if its not necessarily "ideal".

The fundamental issue called out here is effectively in how ToByteArray/Guid(byte[]) operate and how that is "incorrect" in some scenarios due to the bytes being serialized/deserialized by something other than Guid or outside of .NET and the other components sometimes expecting the bytes in big-endian format. The actual underlying representation of the bytes in the underlying data structure as used by .NET is largely irrelevant aside from some minor nuance in how this "could" impact relational comparisons.

For Parse/ToString there isn't really an issue because the data will always roundtrip through a string because, regardless of the underlying format, the bytes are effectively displayed in big endian format the same as an integer would be. That is "11223344" is the parsed and formatted the same on big and little endian and regardless of the variant/encoding/layout being used.

The APIs that care about the difference in how the raw bytes are represented are largely things that are interoperating outside of .NET and thus are being used as a form of serialization/deserialization. Such APIs already must consider that the bytes might be interpreted as a different format on the consumer side and so must already take into account things like endianness. The same is true for all primitives, for complex structures where padding may differ, endianness, etc.

Given that, this would be solvable in the same way we already solve it for the primitive types or other types such as BigInteger. That is, we could expose a ReverseEndianness(System.Guid) API (either on Guid or on System.Buffers.Binary.BinaryPrimitives) and correspondingly explicit WriteGuidLittleEndian/WriteGuidBigEndian and ReadGuidLittleEndian/ReadGuidBigEndian APIs.

I do not see the need to introduce an entirely new type just to handle this minor difference in endianness that is only relevant when serializing/deserializing raw bytes. We do not do this for any other type and it is a non-issue for other types in general. I would be fine with simultaneously proposing the obsoletion of ToByteArray and Guid(byte[]) with a message indicating users should prefer the APIs that explicitly dictate the endianness being used instead.

vanbukin commented 1 year ago

@tannergooding

There are multiple variants, multiple versions, and ultimately multiple layouts for the different UUID kinds and which is "correct" is dependent on the scenario in question. Likewise, while there is a technical spec indicating how a UUID should be interpreted, there are many things that simply use UUID as a 128-bit integer and don't truly follow any spec. They often do this for convenience, even if its not necessarily "ideal".

That is precisely why I am proposing a data structure that effectively serves as a container for data, functioning on the principle of "what goes in is what comes out." The interpretation of the values will remain on the side that accepts such values. It is for this very reason that this API Proposal does not include APIs specific to any of the Uuid variants described in any of the RFCs. And for the same reason, I am not proposing to add any algorithms for generating any of the variants or verifying that the data contained in the Uuid is Uuidv1, Uuidv4, or anything else.

The fundamental issue called out here is effectively in how ToByteArray/Guid(byte[]) operate and how that is "incorrect" in some scenarios due to the bytes being serialized/deserialized by something other than Guid or outside of .NET and the other components sometimes expecting the bytes in big-endian format. The actual underlying representation of the bytes in the underlying data structure as used by .NET is largely irrelevant aside from some minor nuance in how this "could" impact relational comparisons.

The APIs that care about the difference in how the raw bytes are represented are largely things that are interoperating outside of .NET and thus are being used as a form of serialization/deserialization. Such APIs already must consider that the bytes might be interpreted as a different format on the consumer side and so must already take into account things like endianness. The same is true for all primitives, for complex structures where padding may differ, endianness, etc.

However, modern development involves the integration of various components with each other. There are not many applications where absolutely everything is written only using .NET - without using databases, without using third-party native libraries (or wrappers around them), without integration with any third-party services, as well as without using RPC or something similar.

For Parse/ToString there isn't really an issue because the data will always roundtrip through a string because, regardless of the underlying format, the bytes are effectively displayed in big endian format the same as an integer would be. That is "11223344" is the parsed and formatted the same on big and little endian and regardless of the variant/encoding/layout being used.

Indeed, the data may exist in binary form - in files, databases, or base64 strings. The presence of a constructor that accepts an array of bytes as a parameter or a method for converting the content to an array of bytes is convenient, appropriate, and necessary in such scenarios. The data source is not always a string. Also, the data source is not always an array of bytes. That is why I suggest that regardless of whether the source was a set of bytes or a hexadecimal string representing their value, the resulting binary or string representation should match the original data regardless of the data source.

Given that, this would be solvable in the same way we already solve it for the primitive types or other types such as BigInteger. That is, we could expose a ReverseEndianness(System.Guid) API (either on Guid or on System.Buffers.Binary.BinaryPrimitives) and correspondingly explicit WriteGuidLittleEndian/WriteGuidBigEndian and ReadGuidLittleEndian/ReadGuidBigEndian APIs.

This does not negate the need for me as a developer to know exactly how the structure was constructed - whether through string parsing or through a constructor that accepts an array of bytes. I don't even want to think about it. But I am forced to do so because of the way the Guid API works. However, I understand that there is an enormous amount of software out there, and therefore breaking the API is not an option. That is why I suggest adding a new data structure.

I do not see the need to introduce an entirely new type just to handle this minor difference in endianness that is only relevant when serializing/deserializing raw bytes. We do not do this for any other type and it is a non-issue for other types in general. I would be fine with simultaneously proposing the obsoletion of ToByteArray and Guid(byte[]) with a message indicating users should prefer the APIs that explicitly dictate the endianness being used instead.

Any integration with an external component always involves (de)serialization. .NET does not exist in a vacuum - it integrates with countless other components in a huge number of different ways. Adding an API that explicitly specifies the byte order used implies that the authors of libraries in the .NET ecosystem will need to provide an API for controlling this behavior. However, whether they will provide it or choose a standard value without the ability to change it is entirely up to them. After all, this is a task that requires effort and time, and not everyone may be willing to do it. Having a data container that works on the principle of "what goes in is what comes out" does not require any alternative ways of working with it. If you need to shuffle the bytes, you can do so in a pre-allocated buffer. The method of creating such a container is not important - whether it is a hexadecimal string or an array of bytes - the result is always deterministic and the same.

aloraman commented 1 year ago

NB Similar situation in Rust

use uuid::Uuid;
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let bytes = [
        0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee,
        0xff,
    ];
    let str = "00112233-4455-6677-8899-aabbccddeeff";
    let bts = Uuid::from_bytes_ref(&bytes);
    let stb = Uuid::parse_str(str)?;
    println!("string from bytes:");
    println!("{}", bts.hyphenated().to_string());
    println!("bytes from string:");
    println!("{:X?}", stb.as_bytes());
    return Ok(());
}

which produces

string from bytes:
00112233-4455-6677-8899-aabbccddeeff
bytes from string:
[0, 11, 22, 33, 44, 55, 66, 77, 88, 99, AA, BB, CC, DD, EE, FF]

P.S. Note that all other platforms use UUID designation. If someday .NET provides UUID v5/v6/v7 generation, it will be less surprising to find such methods in Uuid type, rather than in Guid type - because other platforms consistently use UUID, and v5/v6/v7 aren't actually versions of GUID.

hopperpl commented 1 year ago

I now understand what you want to achieve, you want a BigEndian version of Guid. I agree, that is useful.

That is why I suggest that regardless of whether the source was a set of bytes or a hexadecimal string representing their value, the resulting binary or string representation should match the original data regardless of the data source.

But this statement doesn't make sense to me, I don't understand what you want to say.

var A = new Guid("11223344-5566-7788-9900-AABBCCDDEEFF").ToString().ToUpperInvariant();
var B = new Guid(new byte[] { 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0x00, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF }).ToByteArray();
var C = new Guid(0x11223344, 0x5566, 0x7788, 0x99, 0x00, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF).ToByteArray();

though... quote "the resulting binary or string representation ... match[es] the original data"

new byte[] { 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0x00, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF } does not visually look the same as B, which is printed as 44332211-6655-8877-9900-AABBCCDDEEFF; A and B are different Guids, of course.

But that's not specific to Guid. That applies to unsigned integers as well

var A = (uint.TryParse("11223344", out var V) ? V : 0).ToString();
var B = BitConverter.GetBytes(BitConverter.ToUInt32(new byte[] { 0x11, 0x22, 0x33, 0x44 }));

Both round trips work. Still A and B are different but visually look the same. For numbers, the string representation of a binary data source is arbitrary. It can be decimal, hexadecimal, octal, binary or hexavigesimal (26, a-z). Or Base64. And if someone fails to specify the used representation, it can cause problems anywhere. Like with SQL statements when you like generate a string this way: var Number = 1000; var NumberStr =$"0x{Number}"; because you forgot the ":X" specifier.

I think the misunderstanding I have is that you believe

is the same; but it just isn't. One is an array of characters, and the other an array of bytes. There are 2 mappings applied; the first is a position-independent byte to double-char mapping, and the second is a position to position mapping. The string length is 32+4 characters, the byte array length is 16.

You want a Guid implementation, named Uuid that has a 1-to-1 position mapping, and I agree that is very useful.

But, "need for me as a developer to know exactly how the structure was constructed" would not change with Uuid in any way. You always need to know what the definition in any platform is between string and binary. Uuid would also not prevent human error.

tannergooding commented 1 year ago

That is precisely why I am proposing a data structure that effectively serves as a container for data, functioning on the principle of "what goes in is what comes out." The interpretation of the values will remain on the side that accepts such values. It is for this very reason that this API Proposal does not include APIs specific to any of the Uuid variants described in any of the RFCs. And for the same reason, I am not proposing to add any algorithms for generating any of the variants or verifying that the data contained in the Uuid is Uuidv1, Uuidv4, or anything else.

This doesn't work because it fails variant 2 UUIDs on big endian systems. Such variants require a byte reordering to be correctly interpreted.

However, modern development involves the integration of various components with each other. There are not many applications where absolutely everything is written only using .NET - without using databases, without using third-party native libraries (or wrappers around them), without integration with any third-party services, as well as without using RPC or something similar.

Yes, and that's all the more reason to not bifurcate the ecosystem with an identical type that only differs in behavior at the serialization boundary.

This does not negate the need for me as a developer to know exactly how the structure was constructed - whether through string parsing or through a constructor that accepts an array of bytes. I don't even want to think about it. But I am forced to do so because of the way the Guid API works. However, I understand that there is an enormous amount of software out there, and therefore breaking the API is not an option. That is why I suggest adding a new data structure.

How the data is represented internally doesn't matter. Just as it does not matter for Int32, Int64, Int128, Double, Single, DateTime, etc.

What matters is that the producer/consumer contract is followed. If the database requires big endian ordered data then the only valid thing is to serialize the Guid using WriteGuidBigEndian. The data must then be read using ReadGuidBigEndian on the other end. This is how it works for all other types in the ecosystem.

Bytes are reversed to follow the data contract all throughout the computer. This is true for network packets, for interpreting file metadata (ZIP, PE, ELF, JPG, PNG, even UTF-16 text, etc). You must follow the contract at the boundaries, not doing so is a bug. Having the APIs to ensure the data is emitted in the desired format and read in the intended format, regardless of how the data is represented in the type system, is how everything else works.

vanbukin commented 1 year ago

@hopperpl

I now understand what you want to achieve, you want a BigEndian version of Guid. I agree, that is useful.

Great!

I suggest looking at this from a slightly different perspective. I propose considering constructing from a byte array and converting back to a byte array as binary serialization. And constructing from a hexadecimal string and converting back to a string as string serialization.

Now let's take a look at the situation through the eyes of a developer who is not familiar with the nuances of Guid behavior. For example, he needs a UUID because the UUID is used as a primary key in the database he is working with. He goes to Google and finds a recommendation that for working with UUIDs in .NET, use System.Guid. At this point, the association System.Guid == UUID arises in his mind. Okay, let's say he needs binary serialization - he looks into the API documentation and finds the constructor that takes bytes, and the ToByteArray method that returns them. Perfect.

Now let's imagine another developer who needs string serialization. He finds the constructor that takes a hexadecimal string and the ToString method that returns it. Wonderful.

And both of them don't have any problems until the first one needs to convert his data to a string, and the second one to a byte array.

But the reason is not that they are doing something wrong, but because Guid does not consider the data it works with as 16 separate octets described in RFC4122, section 4.1. It considers them with respect to its internal structure, which is not 16 separate octets, but rather an int, 2x short, and 8 bytes. As a result, Guid provides an API based on its internal structure.

In fact, the constructor that takes bytes expects the bytes to be pre-shuffled - that's exactly what it's designed for. But this is not obvious from the documentation. This can only be learned by looking at the source code of System.Guid, where it becomes clear that, for example, the constructor performs a reinterpret_cast-like operation on the input byte array using the MemoryMarshal.Read method (but with respect to endianess, with a fallback implementation for big-endian).

The ToByteArray method essentially allocates a result array of 16 bytes, performs a similar reinterpret_cast-like operation on it, interpreting it as a Guid, and then copies the value of the current Guid into it. The resulting array is then returned from the method. Essentially, this is a dump of the Guid structure (but with respect to endianess, with a fallback implementation for big-endian).

That means the calling code must take these nuances of Guid's operation into account. You cannot simply take 16 bytes passed from outside and work with them. If you use the constructor from a byte array or conversion to a byte array – you need to prepare the input and output binary data before passing them outside your application.

From all of this it follows that both developers in the example above are mistaken because Guid is not equivalent to Uuid. It cannot be used as a drop-in replacement. It is a data structure designed to solve specific tasks. And that's okay.

But what should they do to solve their own tasks? There are several options:

  1. Using Guid only in a certain way to emulate Uuid. For example, never using methods for working with binary representation. And in cases where it is necessary, you would need to take binary input data, manually create an input hexadecimal string from it, and then construct a Guid from the string. For reverse conversion, you would take the original Guid, convert it to an output hexadecimal string, and then convert that string into output binary data for subsequent transmission, recording, processing, or whatever else may be necessary.
  2. Create a wrapper structure over System.Guid. For example, as it is done in Npgsql.
  3. Do not use System.Guid at all, limiting yourself to either string or binary representation.
  4. Create your own data type.

All of the listed options have drawbacks.

Option 1 is precisely the place where the human factor can come into play, when someone fails to perform the preliminary conversion and hands over the data 'as is'.

Option 2 will only work in small projects where everything is under your control. In large projects where dozens of teams work with one solution, we end up with a hybrid of options 1 and 2, with the drawbacks of the first option.

In option 3, instead of 16-byte structures in memory, we store strings of 32 characters or byte arrays that require comparators. In the case of string comparisons, for example, case sensitivity is important.

In option 4, our scalability is limited by the extensibility provided by the ecosystem, and the existence of corresponding APIs depends entirely on the willingness of the owners of these projects (for example, the API of the library may prevent the creation of an alternative implementation due to the presence of internal access modifiers on its classes). In this case, it is necessary to communicate with the author of the library, trying to prove the necessity of such an API, and if he agrees, prepare a PR. Alternatively, if time is pressing or the architecture of the library does not allow for implementing the required changes in a reasonable amount of time, a fork must be made, which needs to be built, released, updated, and maintained independently.

INTERNALINTERFERENCE commented 1 year ago

I'm not sure that it will be useful, but this is a haskell code that has the same behavior:

import Data.UUID
import Data.ByteString.Lazy (pack)

main :: IO ()
main = do
  print $ fromString "00112233-4455-6677-8899-aabbccddeeff"
  print $ fromByteString $ pack [0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF]

output: Just 00112233-4455-6677-8899-aabbccddeeff Just 00112233-4455-6677-8899-aabbccddeeff

vanbukin commented 1 year ago

@hopperpl

You want a Guid implementation, named Uuid that has a 1-to-1 position mapping, and I agree that is very useful.

Yes. Perhaps I poorly formulated my sentence about the changes.

These are popular languages that compete with .NET overall and C# in particular (although I doubt about Haskell).

We have, without exaggeration, the best development tools in the world, a powerful ecosystem, and a large community.

However, in all these languages, there is a special data type for working with UUID in the standard library (or in the package that is de facto standard for this language), but we do not have one.

Think about a developer who decided to switch to the .NET ecosystem from any of these languages. Most likely, he will not immediately encounter the problems I am writing about, but they will become a very unpleasant surprise for him, leaving not the best impressions of the platform in which, for 21 years, such a data type could not be made part of the standard library.

tannergooding commented 1 year ago

That means the calling code must take these nuances of Guid's operation into account. You cannot simply take 16 bytes passed from outside and work with them. If you use the constructor from a byte array or conversion to a byte array – you need to prepare the input and output binary data before passing them outside your application.

This is true of all bytes for all types.

And both of them don't have any problems until the first one needs to convert his data to a string, and the second one to a byte array.

Yes, again true for all of our types. You cannot take raw byte data and simply construct a type out of it without accounting for where that byte data came from and the format it was in.

In fact, the constructor that takes bytes expects the bytes to be pre-shuffled - that's exactly what it's designed for. But this is not obvious from the documentation. This can only be learned by looking at the source code of System.Guid, where it becomes clear that, for example, the constructor performs a reinterpret_cast-like operation on the input byte array using the MemoryMarshal.Read method (but with respect to endianess, with a fallback implementation for big-endian).

Right. Guid(byte[]) and ToByteArray() which are always little-endian and it is non-obvious that they are always little endian so people may end up using them incorrectly.

From all of this it follows that both developers in the example above are mistaken because Guid is not equivalent to Uuid.

This is akin to saying that Int32 on an x86 machine is not the same as Int32 on a PowerPC BE machine. The type is the same, its just that the raw byte layout differs and the API you must use to ensure the data is in the correct layout changes between the two systems.

Using Guid only in a certain way to emulate Uuid. For example, never using methods for working with binary representation. And in cases where it is necessary, you would need to take binary input data, manually create an input hexadecimal string from it, and then construct a Guid from the string. For reverse conversion, you would take the original Guid, convert it to an output hexadecimal string, and then convert that string into output binary data for subsequent transmission, recording, processing, or whatever else may be necessary.

Another option is the one I've proposed and which we already know works well for other primitive types.

You have 1 type, Guid. You then have explicit APIs for dealing with raw bytes as "big endian" vs "little endian". Any existing APIs which are confusing (ToByteArray and Guid(byte[]) should have their behaviors made more obvious. This could be via an analyzer, via obsoletion, or some other mechanism.

You then deal with Guid serialization just as you would for Int32. It is consistent with the ecosystem and the callsites make it very obvious how the data is interpreted.

tannergooding commented 1 year ago

Put in a slightly different perspective. Let's take this entire conversation and replace Guid (the existing type) with UInt128 (another existing type of the same size).

To me, the ask for Uuid which is guaranteed to be in big endian format is effectively asking for a UInt128BE to exist, where the data is always stored in big endian format and the constructor/ToByteArray methods always operate on the data as the same

vanbukin commented 1 year ago

@tannergooding

This doesn't work because it fails variant 2 UUIDs on big endian systems. Such variants require a byte reordering to be correctly interpreted.

This does not in any way hinder other programming languages. Why should this hinder us?

Yes, and that's all the more reason to not bifurcate the ecosystem with an identical type that only differs in behavior at the serialization boundary.

Any interaction with the external world always involves (de)serialization. .NET does not exist in a vacuum, it interacts with countless components in many different ways. Generally, this works the same way everywhere, except in .NET. We have widely adapted System.Guid to work with such data, which was inherited from COM, and several alternative solutions, none of which completely solves the problems that arise when using System.Guid as a replacement for Uuid.

How the data is represented internally doesn't matter. Just as it does not matter for Int32, Int64, Int128, Double, Single, DateTime, etc.

I completely agree. However, the problem is that Uuid is defined in the specification as 16 octets, so the public API for working with it should interpret input and output data as 16 octets. System.Guid is used in the .NET ecosystem as a replacement for Uuid, but not all methods of its public API interpret input and output data as 16 octets. Therefore, it cannot be used as a replacement for Uuid. That's why I suggest adding Uuid.

In other words, I suggest adding a data type whose public API behaves similarly to how it behaves in other languages.

What matters is that the producer/consumer contract is followed.

Yes.

If the database requires big endian ordered data then the only valid thing is to serialize the Guid using WriteGuidBigEndian. The data must then be read using ReadGuidBigEndian on the other end. This is how it works for all other types in the ecosystem.

If you look at the bigger picture, beyond just working with the database, what will help prevent the author of a serializer from using the binary constructor or ToByteArray in Guid? What will prevent suppressing a warning on obsolete? Nothing, except for removing the API altogether. And that is such a breaking change that it's frightening to imagine how much code would be affected if the binary methods were removed.

tannergooding commented 1 year ago

what will help prevent the author of a serializer from using the binary constructor or ToByteArray in Guid?

Analyzers or obsoletion.

What will prevent suppressing a warning on obsolete?

This is a non-argument. That's like asking what is there to guarantee that a user will know about and use System.Uuid?

Users who encounter problems will start investigating and ideally find docs, forum posts, warnings, analyzers, or other resources that help root cause and suggest the right way to fix it.

Users globally disabling warnings, ignoring advice or documentation, etc is entirely on them. It is only our responsibility to try and make sure things are visible and devs can generally stay on the golden path.

vanbukin commented 1 year ago

@tannergooding

Put in a slightly different perspective. Let's take this entire conversation and replace Guid (the existing type) with UInt128 (another existing type of the same size).

To me, the ask for Uuid which is guaranteed to be in big endian format is effectively asking for a UInt128BE to exist, where the data is always stored in big endian format and the constructor/ToByteArray methods always operate on the data as the same

Great analogy!

From that perspective, I can say that what's needed is not a structure that always stores data as big-endian, but a structure that stores data "somehow" inside, but has a public API that allows for the following:

What is actually inside doesn't matter - it could be int128, 2x int64, 16x byte, or something else entirely.

It's important that the public API of Guid doesn't have such properties. And if you try to extend it with something new, there's no guarantee that it will be used correctly, even if you mark existing methods as Obsolete. And it can't be removed either.

tannergooding commented 1 year ago

What is actually inside doesn't matter - it could be int128, 2x int64, 16x byte, or something else entirely.

👍. That's why I'm saying Guid remains fine and we don't need Uuid. Rather instead we just need a couple extra methods and to make it clear that users want those methods and not the existing methods.

In your case you'd simply want to always use ReadGuidBigEndian/WriteGuidBigEndian (or alternatively, new Guid(byte[], isBigEndian: true) and ToByteArray(isBigEndian: true) or APIs in a similar vein).

It's important that the public API of Guid doesn't have such properties. And if you try to extend it with something new, there's no guarantee that it will be used correctly, even if you mark existing methods as Obsolete. And it can't be removed either.

Right. The same can be said for System.Uuid. There is no guarantee that users will use it, know that it exists, or even use it correctly. They may in fact simply use Uuid for a variant 2 scenario and be in the exact inverse scenario that you're in with Guid.

Fixing or improving the existing type we have is almost better than exposing a new similar but not quite the same type/API. -- Within the limitations of what is allowed vs not. We do indeed not want to remove APIs and we want to avoid behavioral changes for existing consumers outside of major bugs. Even obsoletions we try to shy away from. Analyzers are the easiest to expose and typically our preferred mechanism for helping highlight newer/better practices.

vanbukin commented 1 year ago

@tannergooding

Users who encounter problems will start investigating and ideally find docs, forum posts, warnings, analyzers, or other resources that help root cause and suggest the right way to fix it.

Everything changes drastically when the user is not you, but the author of the library you are using. I have described what happens next above.

Users globally disabling warnings, ignoring advice or documentation, etc is entirely on them. It is only our responsibility to try and make sure things are visible and devs can generally stay on the golden path.

Then why not lay down a red carpet in the form of a separate data type, which exists in any other popular language? Even the API itself will not allow anything wrong to be done.

tannergooding commented 1 year ago

Then why not lay down a red carpet in the form of a separate data type, which exists in any other popular language? Even the API itself will not allow anything wrong to be done.

Because the path, to me at least, is not as gold or red carpet as it may appear to you.

To me, adding a new type (especially to the system namespace) is very "expensive" and already a hard sell to API review. Add on top that the type is basically just a minor semantic difference in two methods as to how Guid behaves and it gets even harder.

Then I have to consider how users (new and old) might interpret or misuse the type and that they may misuse it in the same way that people are experiencing bugs around Guid today. That it is a near identical term often used interchangeably and the nuance of the differences will be lost on many people. That Guid has been around for 20 years and is already used in many places that would have preferred a Uuid if it existed instead and for which those APIs can't change without taking similar breaks, etc.

To me, it looks shiny at a glance but quickly becomes fraught with much deeper issues/concerns that are hard to justify in favor of simply improving the relatively minor issues with our existing type.

vanbukin commented 1 year ago

To me, adding a new type (especially to the system namespace) is very "expensive" and already a hard sell to API review.

I understand that adding such an API and even fixing places where it would be much better than Guid is an extremely large, complex and time-consuming task.

Add on top that the type is basically just a minor semantic difference in two methods as to how Guid behaves and it gets even harder.

They are similar, but designed for different purposes. Guid is an excellent data type for interop with WinAPI / COM / OLE. It is the best for solving such problems.

But why was it created in the first place if, say, an array of bytes could be used instead? Because user experience matters.

And when working with Guids, the user experience is worse than in other languages. There, such a problem simply does not exist, everything just works.

Then I have to consider how users (new and old) might interpret or misuse the type and that they may misuse it in the same way that people are experiencing bugs around Guid today. That it is a near identical term often used interchangeably and the nuance of the differences will be lost on many people. That Guid has been around for 20 years and is already used in many places that would have preferred a Uuid if it existed instead and for which those APIs can't change without taking similar breaks, etc.

Is this perhaps a technical debt that we should start paying off before it's too late? If Uuid doesn't appear, nothing will change.

george-polevoy commented 1 year ago

Not only user experience, server performance matters too.

Guid in database is plain useless beyond hello world scale databases, as a table with guid primary key does not scale beyond 100k records. The reason is simple - read queries are mostly time based, and the lookups for primary keys for random (Guid.NewGuid()) values scatter all over the index, so there will be constant cache misses, every record in a time-based query will end up in cache miss, which leads to a buffer flush.

I would say - Guid should be deprecated by that reason alone. Arguably you can't build anything scalable with it.

tannergooding commented 1 year ago

I would say - Guid should be deprecated by that reason alone. Arguably you can't build anything scalable with it.

Guid will not be deprecated, there are still a plethora of valid scenarios for it both in and out of COM scenarios. Those scenarios will not go away. Many of those scenarios are scalable and are used in high perf situations (including games, etc). Likewise, not every scenario requires a random key, many scenarios use Guid effectively as a monotonic increasing integer and that is just as valid as any other usage.

Guid indeed exists because .NET is an Object Oriented platform and we have type safe classes/structs for many concepts. We do not, however, strictly have something for every concept and there are many places where types get reused because they are a general fit. Taking a byte[] is not type safe, doesn't perform any validation, doesn't make it easy to debug or diagnose, and would require an allocation.

In this case, the proposed difference between Uuid and Guid is simply one in how new Uuid(byte[])/Uuid.ToByteArray() vs new Guid(byte[])/Guid.ToByteArray() work. The details of that difference is functionally that one does ReadInt32LittleEndian, ReadInt16LittleEndian, ReadInt16LittleEndian, Read 8 Individual Bytes The other does ReadInt32BigEndian, ReadInt16BigEndian, ReadInt16BigEndian, Read 8 individual Bytes.

For other types in the BCL such a difference is handled via explicitly named helper APIs. For primitives such as Double, Int16, Int32, Int64, Int128, IntPtr, Single, UInt16, UInt32, UInt64, UInt128, UIntPtr, etc; we have APIs on BitConverter and BinaryPrimitives. We also expose explicit APIs off IBinaryInteger<T> for use in generic contexts. For BigInteger we also support IBinaryInteger<T> as well as a constructor and ToByteArray method that takes a bool isBigEndian parameter.

If Uuid doesn't appear, nothing will change.

That's not true. Simply exposing the new APIs I suggested above will introduce the necessary stuff for people to do the right thing and should be overall more discoverable as its on the existing surface. Such APIs would be polyfillable downlevel so the same solution could work on codebases targeting .NET Standard or .NET Framework.

I have seen no evidence given so far that updating Guid will not work. I've only seen some concerns raised about the discoverability and risk of people still doing the wrong thing, both of which equally exist with the alternative Uuid type (which also comes with many of the other concerns I raised above that versioning Guid does not).

vanbukin commented 1 year ago

@tannergooding

Guid will not be deprecated, there are still a plethora of valid scenarios for it both in and out of COM scenarios. Those scenarios will not go away. Many of those scenarios are scalable and are used in high perf situations (including games, etc). Likewise, not every scenario requires a random key, many scenarios use Guid effectively as a monotonic increasing integer and that is just as valid as any other usage.

That's why I suggest not touching it at all. There are code bases where it works perfectly fine and there is no reason to introduce any breaking changes there.

Taking a byte[] is not type safe, doesn't perform any validation, doesn't make it easy to debug or diagnose, and would require an allocation.

Absolutely agree.

That's not true. Simply exposing the new APIs I suggested above will introduce the necessary stuff for people to do the right thing and should be overall more discoverable as its on the existing surface. Such APIs would be polyfillable downlevel so the same solution could work on codebases targeting .NET Standard or .NET Framework.

Okay, I suggest taking a look at a very similar situation in the world of .NET. Date and time. What prevented end-users from using DateTime without specifying the time and TimeSpan to emulate the functionality provided by recently added DateOnly and TimeOnly? After all, existing data structures and their APIs can be reused to solve the same tasks.

Because user experience matters.

Using existing DateTime and TimeSpan to solve a range of tasks is not convenient.

Let's check the announcement where DateOnly and TimeOnly were introduced.

While that still works, there are several advantages to using a DateOnly instead. These include:
- A DateOnly provides better type safety than a DateTime that is intended to represent just a date.

Type safety. That's what is mentioned as the first item in the list of advantages for DateOnly.

Let's see what is written about TimeOnly:

A TimeSpan is primarily intended for elapsed time, such as you would measure with a stopwatch. 
Its upper range is more than 29,000 years, and its values can also be negative to indicate moving backward in time. 
Conversely, a TimeOnly is intended for a time-of-day value, 
so its range is from 00:00:00.0000000 to 23:59:59.9999999, and is always positive. 
When a TimeSpan is used as a time of day, 
there is a risk that it could be manipulated such that it is out of an acceptable range. 
There is no such risk with a TimeOnly.

And again, the main reason is type safety: There is no such risk with a TimeOnly.

But it could have gone a different way, for example, by adding a couple of new APIs to TimeSpan and DateTime, so that people could continue to use them to solve tasks related to situations where only the date or only the time within a day is needed. But instead, separate data types were created. Why? For the sake of convenience.

.NET Standard and .NET Framework? Okay, let's take a look at apisof.net for DateOnly and TimeOnly. There is no support for .NET Standard and .NET Framework there. The official documentation shows the same situation. So, the lack of backport to gradually outdated target frameworks did not become an obstacle for adding these types to the standard library. So, there is no reason not to do the same for Uuid.

ForNeVeR commented 1 year ago

The current GUID type is, indeed, pretty messy w.r.t. (de)serialization.

By adding a new API based on same GUID with scary words like isBigEndian: true you'll only make it worse: people will be even more confused by this (as they already are whenever they encounter anything endianness-related). I think they are already confused and mixing endianness in this same thread! We'd just increase a possibility to misuse GUID, serialize it using one set of options (be it ToString() or ToByteArray() or ToLittleEndianByteArray()) and deserialize using another one (be it a special constructor flag, Parse or anything else).

The idea of a new UUID type as a pure wrapper around 16 bytes with no legacy burden, no scary words in the API and sane (bear with me here: by "sane" I mean portable between languages, databases, runtimes and various schools of thought) byte order is a must for a modern successful programming ecosystem (and I believe we all in this thread share the same opinion that .NET is and should be this kind ecosystem in the future).

We are in a nice position because it is still possible to distance from the whole word "GUID", add a new nice UUID-based API and deprecate System.Guid (or just leave ol' good GUID in its current place and stop introducing new usages, moving to UUID instead).

UUIDs are pretty ubiquitous in modern software (and I truly believe anything becomes slightly better after it gets its own UUID), so it may be a good idea to start radically improving the corresponding .NET API for better everyday use in mixed-tool environments (pretty much every environment these days).

tannergooding commented 1 year ago

After further discussion with other API review members. The general consensus is that introducing a new type is undesirable and improving the existing type is the general way we approach these scenarios in .NET.

As such, https://github.com/dotnet/runtime/issues/86798 has been created and represents the general direction we'd like to move forward in this area.

This comes about for many reasons including that all of the concerns raised around confusion and chance for user-error are not exclusive to the proposed improvements for System.Guid. They also exist for System.Uuid but System.Uuid also brings in additional concerns and issues on top, many of which were called out above and which were touched on in the new proposal.

We believe that versioning System.Guid is then the best approach and will allow users to achieve everything that System.Uuid would have allowed, but in a manner that is more consistent with how .NET exposes and supports types in general.

vanbukin commented 1 year ago

@tannergooding Thank you for not listening to the community and deciding to create an inconvenient API.

tannergooding commented 1 year ago

As indicated, I discussed this in depth with several other API review members. The general consensus is that a new type is ultimately more inconvenient and introduces more problems than it solves.

Having a new type to represent the same thing, only differing in how bytes are serialized/deserialized on boundary conditions is not something we do in .NET.

vanbukin commented 1 year ago

@tannergooding

As indicated, I discussed this in depth with several other API review members. The general consensus is that a new type is ultimately more inconvenient and introduces more problems than it solves.

It is challenging to imagine something that is "ultimately more inconvenient" than the current API in System.Guid. Is there a recording of this discussion available somewhere? I want to make sure it really happened and that you did not make the decision on your own not to add a new data type. Also, I want to make sure that the decision was not opinionated, and that during the discussion there were arguments presented as to why introducing a new data type is worse than implementing new public APIs with scary-sounding formulations like BigEndian and LittleEndian. .NET is open-source, and decision-making should be transparent to the community.

DaZombieKiller commented 1 year ago

implementing new public APIs with scary-sounding formulations like BigEndian and LittleEndian.

If you are working with binary serialization, endianness is not a concept you can just ignore. I think it would be a bad idea to attempt to hide this from the API consumer -- that ambiguity is partially what caused the serialization of System.Guid to be as confusing as it is today. Documentation can and should solve any confusion here, it's no different to how one would serialize primitives such as short, int, etc.

tannergooding commented 1 year ago

Is there a recording of this discussion available somewhere?

There will be a publicly available live-streamed discusison where community can join in via Youtube when the other proposal goes to API review. The offline discussion was one simply to determine the general premise and whether it was worth marking this proposal as "ready-for-review" or if it we instead wanted to solve this the way I proposed above based on my many years of experience working on the libraries team and with API review around what API review will and will not approve; and what is and is not recommended by the Framework Design Guidelines

I want to make sure it really happened and that you did not make the decision on your own not to add a new data type.

This comment is insulting and not being made in good faith. Not only have the general discussions on the pro's/con's been made in public here, but various other API review members have given thumbs up on the comments I made above and the general summation of the two points was given in the new proposal, including linking to this thread to ensure context was not lost.

Also, I want to make sure that the decision was not opinionated, and that during the discussion there were arguments presented as to why introducing a new data type is worse than implementing new public APIs with scary-sounding formulations like BigEndian and LittleEndian.

Endianness is not "scary formulations". They are a basic concept that must be considered across a myriad of stacks and technologies.

This ultimately comes down to:

  1. The official UUID spec does not itself have a de-facto layout*. It defines and supports both variant 1 and variant 2.
  2. The difference between variant 1 and variant 2 comes in two parts. The primary difference being the endianness of the layout. The other is that in creation of the guid, there may be a specific pattern required for the 4-bit N specifier to differentiate which variant it is, but not all systems follow that.
  3. Given the above, any new System.Uuid type would itself need to support the exact new API surface being proposed for Guid in https://github.com/dotnet/runtime/issues/86798 such that it could be used for either variant 1 or variant 2 scenarios
  4. Given the above, we are down to a scenario where users are requesting a new type that only differs in behavior in how new Uuid(byte[]) and byte[] ToByteArray() behave. The difference is that one uses Read/WriteInt32BigEndian and the other uses Read/WriteInt32LittleEndian
  5. Introducing a new type simply to handle a minor behavioral difference on reading/writing raw byte sequences is generally undesirable. Not only is this not how we handle any other built-in type, but it introduces the risk of confusing users as to which type should be used and when.
  6. It introduces interchange and back-compat problems, particularly for existing APIs that are already using Guid because its been around for 20 years and has been the thing to use for both variant 1 and variant 2 types. Such APIs now have to decide to support one, the other, or both and must determine how to interop between other systems that are already taking one, the other, or both.
  7. The general consideration of which to take in managed code doesn't matter. The only time it does matter is when you are converting to or from a raw byte sequence, such as for serialization purposes.

Edit: The spec does largely detail itself following variant 1 and describes it as "network order". With most of the callouts to variant 0/2 being noted as backwards-compatible, and variant 3 being reserved. But, that does not preclude the need to work with the other variants/versions nor the general descriptions/support that exists in the spec covering them

tannergooding commented 1 year ago

To maybe give just a tiny bit more clarification on why this proposal was not considered the right approach

What you've effectively asked is that the .NET BCL expose:

public readonly struct Guid{ }
public readonly struct Uuid { }

You could give these any number of names:

public readonly struct UuidLittleEndian { }
public readonly struct UuidBigEndian { }

public readonly struct UuidVariant1 { }
public readonly struct UuidVariant2 { }

etc

The Uuid spec also covers that it encodes "version" information (the 4 M bits) in addition to the "variant" information (the 4 N bits). This does not mean we would or should also expose UuidVersion5 just to handle that semantic. We would likewise not want to add or enforce validation that Uuid or Guid only allow in their respective variants.

This is not how .NET exposes types in the BCL today, and its not something that we want to do moving forward either. We want to grow and expand existing types to support new scenarios instead.

vanbukin commented 1 year ago

@tannergooding

This comment is insulting and not being made in good faith.

I apologize if my statements appeared to you as not being made in good faith.

There will be a publicly available live-streamed discusison where community can join in via Youtube when the other proposal goes to API review.

However, it will be a different API Proposal, not the current one.

The offline discussion was one simply to determine the general premise and whether it was worth marking this proposal as "ready-for-review" or if it we instead wanted to solve this the way I proposed above based on my many years of experience working on the libraries team and with API review around what API review will and will not approve; and what is and is not recommended by the Framework Design Guidelines

Not only I, but also my team, my colleagues, many of my acquaintances and even strangers, have many years of experience using these libraries in production, created by the libraries team. That's why this API Proposal exists. As users, we want a more convenient API, which is available everywhere except for .NET.

Not only have the general discussions on the pro's/con's been made in public here, but various other API review members have given thumbs up on the comments I made above and the general summation of the two points was given in the new proposal, including linking to this thread to ensure context was not lost.

Thank you for leaving a link to the current API Proposal in the new one, so that the context is not lost.

Endianness is not "scary formulations". They are a basic concept that must be considered across a myriad of stacks and technologies.

This may hold significance when working with native APIs. However, it is not something that an average developer would want to be concerned with in a managed environment while developing on .NET

  1. The official UUID spec does not itself have a de-facto layout*. It defines and supports both variant 1 and variant 2.

The UUID format is 16 octets; some bits of the eight octet variant field specified below determine finer structure - this is a direct quote. As we discussed before, if you try to use an API that works with bytes, which is provided by System.Guid, it must take into account the internal structure of the binary representation of System.Guid, which is a leaky abstraction in terms of API design.

  1. The difference between variant 1 and variant 2 comes in two parts. The primary difference being the endianness of the layout. The other is that in creation of the guid, there may be a specific pattern required for the 4-bit N specifier to differentiate which variant it is, but not all systems follow that.

As we discussed before, this structure must be implemented "somehow" internally, and the API should provide methods that give consistent string and binary representations. In the case of binary representation, it should be an array that corresponds to the string that was passed as input.

  1. Given the above, any new System.Uuid type would itself need to support the exact new API surface being proposed for Guid in https://github.com/dotnet/runtime/issues/86798 such that it could be used for either variant 1 or variant 2 scenarios
  2. Given the above, we are down to a scenario where users are requesting a new type that only differs in behavior in how new Uuid(byte[]) and byte[] ToByteArray() behave. The difference is that one uses Read/WriteInt32BigEndian and the other uses Read/WriteInt32LittleEndian

It is rather unfair to close an issue and instead open the one that you consider better (and immediately label it as "api-ready-for-review"), based on your own experience, without even allowing the current API Proposal to be reviewed in the public API Review. Afterward, appealing to the alternative API Proposal opened by you as an argument for why the current API Proposal should not even be considered.

  1. Introducing a new type simply to handle a minor behavioral difference on reading/writing raw byte sequences is generally undesirable. Not only is this not how we handle any other built-in type, but it introduces the risk of confusing users as to which type should be used and when.

Uuid is usually used as a primary key in databases, and in this particular case, it is really important because it is a data structure that serves as an identifier of an entity and can go through hundreds (!) of serialization and deserialization iterations during its lifecycle.

  1. It introduces interchange and back-compat problems, particularly for existing APIs that are already using Guid because its been around for 20 years and has been the thing to use for both variant 1 and variant 2 types. Such APIs now have to decide to support one, the other, or both and must determine how to interop between other systems that are already taking one, the other, or both.

That is an indicator that such a data type should have been made 20 years ago, but it has not happened yet. However, if it were to appear, it would be a reason to move forward and start addressing the technical debt that has accumulated over these two decades.

  1. The general consideration of which to take in managed code doesn't matter. The only time it does matter is when you are converting to or from a raw byte sequence, such as for serialization purposes.

Yes, that is the primary use case of this data type

The spec does largely detail itself following variant 1 and describes it as "network order". With most of the callouts to variant 0/2 being noted as backwards-compatible, and variant 3 being reserved. But, that does not preclude the need to work with the other variants/versions nor the general descriptions/support that exists in the spec covering them

This would only be relevant if we are reading or writing the binary representation of this data type 'as is.' However, the specific implementation of the public API for this data type does not have to be implemented in that way, especially in a managed environment.

DaZombieKiller commented 1 year ago

Endianness is not "scary formulations". They are a basic concept that must be considered across a myriad of stacks and technologies.

This may hold significance when working with native APIs. However, it is not something that an average developer would want to be concerned with in a managed environment while developing on .NET

Whether you are working with native APIs or not doesn't change the fact that endianness is a fundamental part of binary serialization, so I'm not sure why native APIs are relevant here. Endianness is a well-documented concept that is exposed all over the .NET API surface for binary serialization.

vanbukin commented 1 year ago

@DaZombieKiller

If you are working with binary serialization, endianness is not a concept you can just ignore. I think it would be a bad idea to attempt to hide this from the API consumer -- that ambiguity is partially what caused the serialization of System.Guid to be as confusing as it is today. Documentation can and should solve any confusion here, it's no different to how one would serialize primitives such as short, int, etc.

Whether you are working with native APIs or not doesn't change the fact that endianness is a fundamental part of binary serialization, so I'm not sure why native APIs are relevant here. Endianness is a well-documented concept that is exposed all over the .NET API surface for binary serialization.

The managed environment tries to hide the nuances of dealing with endianess from us. This should and can be hidden in the implementation details of such a data type. Those who actually need to take care of endianess - know what they are doing and can perform an equivalent of the reinterpret_cast operation on bytes to convert them to the desired data structure. The question of whether to create public APIs for a data type that allow such operations should be discussed after a decision has been made that the data type is truly necessary.

DaZombieKiller commented 1 year ago

The managed environment tries to hide the nuances of dealing with endianess from us.

This is not true. Endianness is not hidden from you in C# and .NET any more than it is in C, C++, Rust, etc. System.Guid.ToByteArray is one of the few exceptions where endianness is ambiguous, and that's the cause of this whole issue to begin with.

The proposed System.Uuid is exactly the same as System.Guid except it serializes in big endian instead of little endian. This is no better than what we have today with System.Guid because it's still confusing in the same way: the endianness to expect from binary serialization is not obvious.

If you are serializing to binary, you NEED to agree on the endianness on both sides: serialization and deserialization. Being explicit about the endianness is how you ensure that consistency is maintained here.

tannergooding commented 1 year ago

However, it will be a different API Proposal, not the current one.

API review explicitly discusses alternatives and linked issues. We will not review the other proposal without bringing up the fact that users originally asked for this. Many API reviewers are also already familiar with the context, as per my callout of the internal checks before I closed this issue in favor of the new one.

As we discussed before, this structure must be implemented "somehow" internally, and the API should provide methods that give consistent string and binary representations.

That is not how any type across the BCL works. Binary layout and string layout are not consistent and explicitly do not match by default for the vast majority of hardware that exists.

Most modern machines (x86, x64, Arm32, Arm64, RISC-V, etc) are exclusively or at least primarily little-endian. They all represent their values with the least significant byte first in memory. However, ToString, Parse, and most other APIs all must follow the same format that the current (or explictly requested) culture follows. For English and invariant languages, this is that text and numbers are read from left to right, top to bottom, with the most significant digit appearing first. Thus, 65534 is 0xFFFE and is ordered in memory on most machines as 0xFE, 0xFF. If you were to access the raw bytes whether from some ToByteArray() method, from using a BinaryWriter, or many other scenarios, you would find that it serialized inversely from how ToString prints it.

Guid here is really no different and in fact if we exposed some Uuid we could safely define its internal layout to likewise be identical to Guid. We could define it to be a Guid field itself. Because the internal layout of the type doesn't matter. What matters is what the API that reads/writes the byte sequence does and there are two valid behaviors with exactly which being correct depending on whether you are expecting a variant 1 or variant 2 UUID.

It is rather unfair to close an issue and instead open the one that you consider better (and immediately label it as "api-ready-for-review"), based on your own experience, without even allowing the current API Proposal to be reviewed in the public API Review.

That is not how API review in .NET works. It is the responsibility of the area owners, me in this case, to make an initial determination on whether something is even worth bringing to API review in the first place. We get literally thousands of API proposals, from all kinds of users, covering all ranges of scenarios. It would be impossible to truly review them all in depth.

This was a case where I had my own initial feeling of how API review would react and given the number of users asking for it, I did an initial offline check to confirm my suspicions. The new proposal was then opened based on the feedback from the API review members that a new type is indeed something we would not be willing to do; particularly given the scenario involved, how .NET has handled similar scenarios up until this point, etc.

Uuid is usually used as a primary key in databases, and in this particular case, it is really important because it is a data structure that serves as an identifier of an entity and can go through hundreds (!) of serialization and deserialization iterations during its lifecycle.

Yes, which also means that it could pass through many APIs in .NET. Some of which would take Guid and some of which would take Uuid. You would then have to convert at each of the boundaries, doing additional byte-swapping, fixups, and more things which are normally only a consideration at the actual serialization boundaries.

The proposed APIs to be exposed on Guid still handle the problem. They make it explicit which variant you are expecting and whether you need it to be BigEndian (variant 1) or LittleEndian (variant 2).

That is an indicator that such a data type should have been made 20 years ago, but it has not happened yet. However, if it were to appear, it would be a reason to move forward and start addressing the technical debt that has accumulated over these two decades.

If we were doing this today, with no concern of back-compat. We would likewise have 1 type. We would likely call it System.Uuid instead. It would then have a constructor that required specifying isBigEndian, rather than it being an additional overload. We would similarly reject a request for System.Guid or System.MsGuid under the same principles.

However, the specific implementation of the public API for this data type does not have to be implemented in that way, especially in a managed environment.

Raw sequences of bytes fundamentally must be told what order they are in to be read correctly.

If you don't want to work with raw bytes, use strings. If you do work with raw bytes, you must understand the endianness of them or you risk the wrong thing happening on mismatch.

The same issue would be present for one application calling ToByteArray() on Uuid and another calling new Guid(byte[]) on the other end. This is something that will happen in a number of scenarios for a new Uuid type. It will also impact users targeting .NET Framework or .NET Standard where such a Uuid type doesn't exist and where we typically do not ship polyfill packages due to .NET Framework no longer being versioned.

It introduces a magnitude of additional complexity, considerations, integration concerns, and general failure points above and beyond simply having overloads on Guid that handle the relatively minor difference which only exists as part of serialization.

vanbukin commented 1 year ago

@DaZombieKiller

This is not true. Endianness is not hidden from you in C# and .NET any more than it is in C, C++, Rust, etc. System.Guid.ToByteArray is one of the few exceptions where endianness is ambiguous, and that's the cause of this whole issue to begin with.

Yes, and the problem lies in how the binary and string representations work.

The proposed System.Uuid is exactly the same as System.Guid except it serializes in big endian instead of little endian. This is no better than what we have today with System.Guid because it's still confusing in the same way: the endianness to expect from binary serialization is not obvious.

The problem is that during binary (de)serialization, there is a "raw dump" of the internal representation of the Guid. In such a case, what happens when ToString is called does not correspond at all to what was passed in the constructor that takes an array of bytes. This is because the constructor that takes bytes expects the binary representation of the Guid, taking into account the details of its internal structure, and not the binary representation of the hex string, which is accepted by the constructor that takes a string.

If you are serializing to binary, you NEED to agree on the endianness on both sides: serialization and deserialization. Being explicit about the endianness is how you ensure that consistency is maintained here.

All this would not be a problem if System.Guid did not have methods to construct it from an array of bytes or to return its contents as an array of bytes.

DaZombieKiller commented 1 year ago

ToString is not intended for binary serialization, it is intended for display and parsing. You should not expect the results of ToString and ToByteArray to be equivalent, just like you shouldn't expect int.ToString() and BitConverter.GetBytes(int) to be equivalent. They serve very different purposes.

All this would not be a problem if System.Guid did not have methods to construct it from an array of bytes or to return its contents as an array of bytes.

The same fundamental problem would still exist, if you are serializing to binary then endianness cannot be avoided. I agree it would be less confusing though, because it would likely have dedicated serialization methods on BitConverter or BinaryPrimitives instead of the ambiguous constructor and ToByteArray method.

tannergooding commented 1 year ago

The problem is that during binary (de)serialization, there is a "raw dump" of the internal representation of the Guid.

This is not the problem nor is it a "raw dump". For example, when running on a machine such as an IBM System z9 (which is one of the few Big Endian machines), the raw byte sequences as read from memory will not match what is emitted by ToByteArray

In such a case, what happens when ToString is called does not correspond at all to what was passed in the constructor that takes an array of bytes.

ToString exactly corresponds to what was passed into the constructor. The disconnect is that users don't have an option to specify whether those bytes should be interpreted as being in big endian or little endian format and so when the format coming from some external source was in big endian format, it doesn't match what the user expected.

This is because the constructor that takes bytes expects the binary representation of the Guid, taking into account the details of its internal structure, and not the binary representation of the hex string, which is accepted by the constructor that takes a string.

It does not take the binary representation of the Guid. It takes them as little endian. This is why the actual implementation uses ReadInt32LittleEndian and not simply Unsafe.ReadUnaligned<Guid>(ref source[0]). This difference is extremely meaningful and shows up on real big endian systems, like the IBM System z9 I called out above.

All this would not be a problem if System.Guid did not have methods to construct it from an array of bytes or to return its contents as an array of bytes.

It still would exist and likely in a worse setup. Not only would users who need binary serialization try to do it themselves, they would be manually trying to read/write the bytes and so they would have to take a dependence on the internal layout rather than it being abstracted as it is today.