Read/Write Big/LittleEndian support for Guid

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.47k stars 4.76k forks source link

Read/Write Big/LittleEndian support for Guid #86798

Closed tannergooding closed 1 year ago

tannergooding commented 1 year ago

Rationale

System.Guid represents .NETs support for Globally Unique Identifiers or GUIDs (sometimes also referred to as Universally Unique Identifiers or UUIDs).

This type represents a 128-bit value in the general format of xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx where each x represents a hexadecimal digit, where the 4 bits of M represent the version and the 4 bits of N represent the variant number. This is sometimes referred to as the 8-4-4-4-12 format string. And while the string representation is "well-defined", the actual underlying order of these bytes has a few different representation and there are several variants of the general RFC 4122 that may require a specific ordering or even may limit specific bytes to have a particular meaning.

.NET's System.Guid follows a field layout best matching variant 2 which is identical to variant 1 outside the endianness. In particular, variant 1 is "big endian" and variant 2 is "little endian". Variant 1 and 2 are used by the current UUID specification and are by far the most prominent while variant 0 is largely considered deprecated. Outside of the endianness these variants are differented by minor bit pattern requirements.

Given this is largely an endianness difference and is otherwise just a minor difference in the bits used for M and N, we would prefer to not introduce a new type just to handle this and would instead prefer to introduce explicit APIs and overloads on Guid that help identify and handle these differences.

Proposed APIs

namespace System
{
    public partial struct Guid
    {
        // public Guid(byte[] value);
        // public Guid(ReadOnlySpan<byte> value);
        public Guid(ReadOnlySpan<byte> value, bool isBigEndian);

        // public byte[] ToByteArray();
        public byte[] ToByteArray(bool isBigEndian);

        // public bool TryWriteBytes(Span<byte> destination);
        // public bool TryWriteBytes(Span<byte> destination, out int bytesWritten); -- new in .NET 8
        public bool TryWriteBytes(Span<byte> destination, bool isBigEndian, out int bytesWritten);
    }
}

namespace System.Buffers.Binary
{
    public static partial class BinaryPrimitives
    {
        public static Guid ReadGuidBigEndian(ReadOnlySpan<byte> source);
        public static Guid ReadGuidLittleEndian(ReadOnlySpan<byte> source);

        public static bool TryReadGuidBigEndian(ReadOnlySpan<byte> source, out Guid value);
        public static bool TryReadGuidLittleEndian(ReadOnlySpan<byte> source, out Guid value);

        public static bool TryWriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
        public static bool TryWriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);

        public static void WriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
        public static void WriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);
    }

Drawbacks

As discussed on https://github.com/dotnet/runtime/issues/86084, there is a general concern that users may not be aware that these other overloads exist -or- may not be aware that the difference between variant 1 and variant 2 is endianness and that .NET defaults to variant 2.

However, the same general considerations exists from exposing a new type such as System.Uuid. There are then additional considerations on top in that it further bifurcates the type system, doesn't easily allow polyfilling the support downlevel without shipping a new OOB package, and may further confuse users due to the frequent interchange of the GUID and UUID terminology.

After discussion with a few other API review team members, the general consensus was that shipping a new type is undesirable and we should prefer fixing this via new APIs/overloads and potentially looking into additional ways to surface the difference to users (such as analyzers, API documentation, etc).

Additional Considerations

Given the above, we may want to consider how to help point users towards their desired APIs given the overloads on Guid that do not require specifying endianness.

We can clearly update the documentation, but an analyzer seems like a desired addition that can help point devs towards specifying the endianness explicitly. Obsoleting the existing overloads was also proposed, but may be undesirable since the current behavior isn't "wrong", it just may be the undesired behavior in some scenarios.

We may also want to consider whether a static Guid NewGuid() overload that allows conforming to Version 4, Variant 1 is desired. The docs only indicate it is version 4 and calls into the underlying System APIs. It does not indicate if it produces Variant 1, Variant 2, or truly random bits for N.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details

### Rationale `System.Guid` represents .NETs support for `Globally Unique Identifiers` or `GUIDs` (sometimes also referred to as `Universally Unique Identifiers` or `UUIDs`). This type represents a 128-bit value in the general format of `xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx` where each `x` represents a hexadecimal digit, where the 4 bits of `M` represent the version and the 4 bits of `N` represent the variant number. This is sometimes referred to as the `8-4-4-4-12` format string. And while the string representation is "well-defined", the actual underlying order of these bytes has a few different representation and there are several variants of the general [RFC 4122](https://datatracker.ietf.org/doc/html/rfc4122) that may require a specific ordering or even may limit specific bytes to have a particular meaning. .NET's `System.Guid` follows a field layout best matching `variant 2` which is identical to `variant 1` outside the endianness. In particular, `variant 1` is "big endian" and `variant 2` is "little endian". Variant 1 and 2 are used by the current UUID specification and are by far the most prominent while variant 0 is largely considered deprecated. Outside of the endianness these variants are differented by minor bit pattern requirements. Given this is largely an endianness difference and is otherwise just a minor difference in the bits used for `M` and `N`, we would prefer to not introduce a new type just to handle this and would instead prefer to introduce explicit APIs and overloads on `Guid` that help identify and handle these differences. ### Proposed APIs ```csharp namespace System { public partial struct Guid { // public Guid(byte[] value); // public Guid(ReadOnlySpan value); public Guid(ReadOnlySpan value, bool isBigEndian); // public byte[] ToByteArray(); public byte[] ToByteArray(bool isBigEndian); // public bool TryWriteBytes(Span destination); // public bool TryWriteBytes(Span destination, out int bytesWritten); -- new in .NET 8 public bool TryWriteBytes(Span destination, bool isBigEndian, out int bytesWritten); } } namespace System.Buffers.Binary { public static partial class BinaryPrimitives { public static Guid ReadGuidBigEndian(ReadOnlySpan source); public static Guid ReadGuidLittleEndian(ReadOnlySpan source); public static bool TryReadGuidBigEndian(ReadOnlySpan source, out Guid value); public static bool TryReadGuidLittleEndian(ReadOnlySpan source, out Guid value); public static bool TryWriteGuidBigEndian(ReadOnlySpan destination, Guid value); public static bool TryWriteGuidLittleEndian(ReadOnlySpan destination, Guid value); public static void WriteGuidBigEndian(ReadOnlySpan destination, Guid value); public static void WriteGuidLittleEndian(ReadOnlySpan destination, Guid value); } ``` ### Drawbacks As discussed on https://github.com/dotnet/runtime/issues/86084, there is a general concern that users may not be aware that these other overloads exist -or- may not be aware that the difference between variant 1 and variant 2 is endianness and that .NET defaults to variant 2. However, the same general considerations exists from exposing a new type such as `System.Uuid`. There are then additional considerations on top in that it further bifurcates the type system, doesn't easily allow polyfilling the support downlevel without shipping a new OOB package, and may further confuse users due to the frequent interchange of the `GUID` and `UUID` terminology. After discussion with a few other API review team members, the general consensus was that shipping a new type is undesirable and we should prefer fixing this via new APIs/overloads and potentially looking into additional ways to surface the difference to users (such as analyzers, API documentation, etc). ### Additional Considerations Given the above, we may want to consider how to help point users towards their desired APIs given the overloads on `Guid` that do not require specifying endianness. We can clearly update the documentation, but an analyzer seems like a desired addition that can help point devs towards specifying the endianness explicitly. Obsoleting the existing overloads was also proposed, but may be undesirable since the current behavior isn't "wrong", it just may be the undesired behavior in some scenarios. We may also want to consider whether a `static Guid NewGuid()` overload that allows conforming to `Version 4, Variant 1` is desired. The docs only indicate it is version 4 and calls into the underlying System APIs. It does not indicate if it produces `Variant 1`, `Variant 2`, or truly random bits for `N`.

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Runtime`, `api-ready-for-review`
Milestone:	-

danmoseley commented 1 year ago

I don't think I see a reasoning for why these are worth adding other than they are in use. When is this format needed? Is there any relationship with execution on big endian architecture, or not particularly?

tannergooding commented 1 year ago

When is this format needed?

The current UUID spec has 2 commonly used formats variant 1 and variant 2. COM and therefore most of Windows uses variant 2. Many other domains use variant 1 instead and many of them were called out on the other thread linked above.

The difference between the two, in terms of layout, is variant 1 is big endian and variant 2 is little endian.

Thus, these functions are needed for users to be able to correctly interact with such systems and to support the full UUID spec.

Is there any relationship with execution on big endian architecture, or not particularly?

The relationship is to how a sequence of raw bytes is interpreted. While machines may operate in big or little endian mode, endianness comes up in many contexts. Networking is a large one where bytes are almost exclusively sent in big endian format (so much so that a common description is "network order").

Kirill-Maurin commented 1 year ago

It seems that the history of the worst feature of the .NET BCL (DataTime.Kind) has not taught anyone anything

tannergooding commented 1 year ago

This is not a Kind, it is an endianness concern.

The proposed Uuid from https://github.com/dotnet/runtime/issues/86084 would have the same consideration because the official UUID spec defines and supports both variant 1 and variant 2. Thus, you could not expose a type called Uuid and have it only support 1.

You could have some UuidVariant1, but that still comes with the same general considerations and problems. It still introduces additional confusion to end users on which to use and when and surfaces what is effectively a serialization concern into the exposed type system.

This ultimately comes down to:

The official UUID spec does not itself have a de-facto layout.* It defines and supports both variant 1 and variant 2.
The difference between variant 1 and variant 2 comes in two parts. The primary difference being the endianness of the layout. The other is that in creation of the guid, there may be a specific pattern required for the 4-bit N specifier to differentiate which variant it is, but not all systems follow that.
Given the above, any new System.Uuid type would itself need to support the exact new API surface being proposed for Guid in https://github.com/dotnet/runtime/issues/86798 such that it could be used for either variant 1 or variant 2 scenarios
Given the above, we are down to a scenario where users are requesting a new type that only differs in behavior in how new Uuid(byte[]) and byte[] ToByteArray() behave. The difference is that one uses Read/WriteInt32BigEndian and the other uses Read/WriteInt32LittleEndian
Introducing a new type simply to handle a minor behavioral difference on reading/writing raw byte sequences is generally undesirable. Not only is this not how we handle any other built-in type, but it introduces the risk of confusing users as to which type should be used and when.
It introduces interchange and back-compat problems, particularly for existing APIs that are already using Guid because its been around for 20 years and has been the thing to use for both variant 1 and variant 2 types. Such APIs now have to decide to support one, the other, or both and must determine how to interop between other systems that are already taking one, the other, or both.
The general consideration of which to take in managed code doesn't matter. The only time it does matter is when you are converting to or from a raw byte sequence, such as for serialization purposes.

Edit: The spec does largely detail itself following variant 1 and describes it as "network order". With most of the callouts to variant 0/2 being noted as backwards-compatible, and variant 3 being reserved. But, that does not preclude the need to work with the other variants/versions nor the general descriptions/support that exists in the spec covering them

DaZombieKiller commented 1 year ago

It seems that the history of the worst feature of the .NET BCL (DataTime.Kind) has not taught anyone anything

That isn't quite comparable, DateTime.Kind represents information about an instance. These APIs are about the binary representation of a GUID at a serialization boundary only. The internal byte order of a System.Guid isn't something that can differ per instance.

tannergooding commented 1 year ago

This is not a Kind, it is an endianness concern.

We likewise do not expose the UUID versions or other information in the type system, nor would we.

aloraman commented 1 year ago

Several nitpicks: 1) Guids (Uuids) in the wild do support RFC-4122, but in practice they are just containers for 16 bytes worth of data, i.e., any xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-like value can be used, despite not conforming to requirements for variant/version flag allowed values. 2) Internal structure of Guid (Uuid) is not just an implementation detail, because it is externally observable. And I'm not talking about difference between byte array and string construction - there's another observable behavior - sorting. System.Guid implements IComparable and, therefore, has a natural order. And that order is the same in .NET application and in SqlServer, but doesn't match the order in every other place. Major PITA when trying to use it to implement bounding-box optimizations for containment checks. 3) All "It's not .NET way" is rather nonsensical. When I look on proposed API - all I can think about is that it has the same eerily feel as support for unsigned arithmetic operations in Java. Thankfully, we have System.UInt32, not a bunch of XXXUnsigned methods in System.Int32. And Int32/UInt32 don't have a minor difference in bit patterns - they are the same! Well, it was 20+ years ago, yes... But the same goes for System.DateOnly/System.TimeOnly, which are a new thing. And when DateOnly/TimeOnly were designed, it was stated that new types are preferrable over a maze of obscurely named methods on System.DateTime.

So, IMHO, separate Uuid type is preferrable over the proposed API. I don't think System.Guid should be obsoleted, rather both Guid and Uuid should be available, so one or the other can be used in a specific scenario. If only we could come with better naming. GUID sound so much nicer than UUID.

DaZombieKiller commented 1 year ago

System.Guid implements IComparable and, therefore, has a natural order.

I'm not sure how this is relevant to byte order (endianness). Can you clarify? The proposal is not to change System.Guid's layout, that will remain the same. This is about offering new APIs to use at the serialization boundary.

System.Uuid would be completely identical to System.Guid with the exception of the serialization methods, which is where some of the concerns regarding its usefulness come from. It has minimal benefit over just using System.Guid for anything but serialization (and that benefit only applies if you expect to serialize/deserialize in big endian.)

Internal structure of Guid (Uuid) is not just an implementation detail, because it is externally observable.

That said, this part is still accurate. System.Guid currently has the same layout as the Win32 GUID structure, which allows it to be used in blittable interop. That does not affect this proposal because it's not about changing that layout, but it is part of why System.Guid's layout will most likely not change.

And when DateOnly/TimeOnly were designed, it was stated that new types are preferrable over a maze of obscurely named methods on System.DateTime.

As mentioned earlier, the problems with DateTime aren't quite comparable to this situation. DateTime's issues stem from it representing more than one kind of data, while this is about serialization. System.Guid can still only represent one kind of data: a GUID/UUID.

tannergooding commented 1 year ago

Internal structure of Guid (Uuid) is not just an implementation detail, because it is externally observable

This has nothing to do with the internal structure, but rather has everything to do with the implementation of IComparable. We could choose to treat that as UInt128 or 2x UInt64 to be "more efficient" if we really wanted to.

We could choose to implement Guid as a byte[] or as 16x doubles each containing a byte between 0 and 255. Those are implementation details that don't matter to the consumer of the API and how it operates. We don't do these other approaches because they aren't efficient and limit the broader usability of the types.

Things like IComparable just have to be consistent and the behavior of IComparable on a Guid that is created using new Guid(byte[], isBigEndian: true) would be identical to the proposed Uuid type.

And Int32/UInt32 don't have a minor difference in bit patterns - they are the same!

Yes, but they have a strong semantic difference that appears as part of comparisons, addition, subtraction, multiplication, division, remainder, conversions, min, max, and almost every other operation you can think of.

Guid vs Uuid have a minor semantic difference in that when reading from or writing to a raw byte sequence, you swap the endianness. We correspondingly do not have UInt32LittleEndian and UInt32BigEndian.

But the same goes for System.DateOnly/System.TimeOnly, which are a new thing.

This also comes about from perf advantages and significantly reduced complexity that frequently comes up in common usages of the types.

What you've effectively asked is that the .NET BCL expose:

public readonly struct Guid { }
public readonly struct Uuid { }

You could give these any number of names:

public readonly struct UuidLittleEndian { }
public readonly struct UuidBigEndian { }

public readonly struct UuidVariant1 { }
public readonly struct UuidVariant2 { }

public readonly struct UuidMachineOrder { }
public readonly struct UuidNetworkOrder { }

etc

The Uuid spec also covers that it encodes "version" information (the 4 M bits) in addition to the "variant" information (the 4 N bits). This does not mean we would or should also expose UuidVersion5 just to handle that semantic. We would likewise not want to add or enforce validation that Uuid or Guid only allow in their respective variants.

This is not how .NET exposes types in the BCL today, and its not something that we want to do moving forward either. We want to grow and expand existing types to support new scenarios instead.

If we were designing this today, without any prior concerns, we'd probably name it System.Uuid and it would have the exact same methods as Guid including those covered in this proposal. We would then similarly reject a proposal to expose some System.Guid or System.MsGuid type.

jkotas commented 1 year ago

Do you have examples in our own libraries or in the code out there that would use these APIs?

We do provide efficient allocation-free access to underlying Guid bytes. If needed, anybody can write their own binary serializer/deserialized that fits the given format in just a few lines. I do not think we need to be providing all possible variants of binary serializers/deserializers for BCL types. For example, we do not provide helpers for different variants of RLE-encoding of integers either.

https://github.com/dotnet/runtime/issues/85891 is very similar proposal, it is also asking for providing a specific binary serializers/deserializers helpers.

tannergooding commented 1 year ago

Do you have examples in our own libraries or in the code out there that would use these APIs?

This was opened because we got an issue (linked in the top proposal) that had a massive influx of upvotes/support.

Our default behavior for new Guid/ToByteArray/TryWriteBytes is efficient, but can also be error prone and less efficient for users. Such users working with variant 1 UUIDs then need an additional ReadInt32LittleEndian/WriteInt32BigEndian, slice, ReadInt16LE/WriteInt16BE calls to fix it up the span prior to new Guid() or post ToByteArray/TryWriteBytes. They then need to pass the original span back through to their target API. This can often require additional copying or work beyond the proposed APIs above. This is different from say DateTime or TimeSpan where the way to roundtrip is to use the same APIs over the Ticks property, so users can already do it efficiently and with no loss of perf.

These APIs cover the core need, efficiently, and help make it visible that the existing APIs may not do exactly what the user may be expecting.

tannergooding commented 1 year ago

It certainly would not be an end of the world scenario if these weren't exposed. But given the number of upvotes on the original issue asking for a System.Uuid type and the additional clarity this can bring, I do think it's worthwhile.

-- There is notably still a decent number of users from the System.Uuid proposal that don't want this new functionality, they only want System.Uuid, so the exact number of satisfied users won't quite be the same; but exposing Uuid is not something that would have passed API review at all and it would have resolved down to this proposal or do nothing anyways.

tannergooding commented 1 year ago

The user implemented code would effectively be:

public byte[] ToBigEndianByteArray(Guid guid)
{
    byte[] tmp = guid.ToByteArray();
    WriteInt32BigEndian(tmp, ReadInt32LittleEndian(tmp)):

    Span<byte> tmp2 = tmp.Slice(4);
    WriteInt16BigEndian(tmp2, ReadInt16LittleEndian(tmp2)):

    tmp2 = tmp2.Slice(2);
    WriteInt16BigEndian(tmp2, ReadInt16LittleEndian(tmp2)):

    return tmp;
}

and similar for the inverse direction

aloraman commented 1 year ago

I'm not sure how this is relevant to byte order (endianness). Can you clarify?

It affects the implementation of sorting/serialization (see https://github.com/dotnet/runtime/pull/81650/files ) and corresponding performance cost.

System.Uuid would be completely identical to System.Guid with the exception of the serialization methods, which is where some of the concerns regarding its usefulness come from.

Well, maybe author of https://github.com/dotnet/runtime/issues/86084 just wants consistency between string and byte serialization, but I want more. That is, if System.Uuid is to be added, let's make it as feature rich as it is in different programming languages, e.g., let it be a network-ordered set of 16 octets, with vectorizable initialization from string and byte array, have it to be transparently convertible to UInt128, so it will be easy to do arithmetic, sorting and so on. It is such widely used primitive - it's shameful there's no similar primitive available out-of-box in .NET. Let System.Guid continue to be GUID - that's fine, but let's not pretend it doesn't have a myriad of quirks and inconsistencies.

That isn't quite comparable, DateTime.Kind represents information about an instance.

But the trouble is just the same. It takes just a single element in a large chain of interacting code fragments with multiple serialization boundaries to mishandle or loose the DateTimeKind data to completely break the process. Which results in garbage data and countless hours spent at bugfixing.

This is about offering new APIs to use at the serialization boundary.

But introduction of these new APIs won't solve the problem. (The aforementioned problem being inconsistency between variant-1 and variant-2 compatible treatments of binary and string serialization). You can always implement such API yourself (and people do implement similar API). But 'ordinary' enterprise programmer rarely interacts with serialization boundary directly - it's handled by 1st and 3rd party libraries for serialization/model binding/ object-relational mapping. The problem comes from inconsistencies in handling Guid/Uuid data within these libraries - which is not observable without digging through internals of these libraries. Well, maybe some libraries will replace their implementations with these API's, but even if they do - the problem will persist. Solving the problem will cause them either to break backward compatibility, which isn't what good libraries do often, or to pass BE/LE switches up to the API surface, which is, honestly, looks ugly and javaesque. Separate System.Uuid type will, however, allow a parallel construction process - treat System.Guid in backwards compatible way, treat System.Uuid in variant-1 native way.

This was opened because we got an issue (linked in the top proposal) that had a massive influx of upvotes/support.

There's not much of a surprise here. The issue touches on a major sour subject for modern enterprise programming. Back in the day you had full MSFT-stack, where GUID behaved consistently in OS, DB (SqlServer) and codebase. Nowadays, in the clouds and heterogenous applications, inconsistencies with other programming stacks are a consistent source of PITA. A programmer can understand the difference between Uuid V1 vs V2, but it doesn't stop the stream of questions/bugs, regularly risen by QA, PM/PO, Analysts and Power Users. "Why the order is different in-app and in-db?", "Why the order of digits in Swagger and DbExplorer is different?". Difference between uuid and uniqueidentifier is like the third largest problem with database primitives for migration between SqlServer and PostgreSql (the first two being the lack of DateTimeOffset and different treatment of text-like types)

kolosovpetro commented 1 year ago

I have experienced the following case using GUIDs. When someone calls Guid.NewGuid or passed it not as GUID, but as a parameter in the request, forming it as

1) through a call to ToByteArray 2) through a call to ToByteArray that called Convert.FromHex 3) through a call to ToString

instead of a special correct generation algorithm, then problem appears as sequence of bits produced by ToString() method differs from ToByteArray() etc.

Generally, GUID serves its purpose fine and there is no need to touch or modify it. As I see new API methods will just confuse GUID's API. Also, the motivation of such API extension is not clear from first glance.

Instead, the UUID type may be added to BCL along with MSDN documentation that explicitly states the problem UUID type solves. It is much simpler just to fetch out of box methods ToString(), ToByteArray() etc. to produce similar sequence of bits in both, byte array and string representation of 16-bit number.

Szer commented 1 year ago

5 cents regarding my recent experience with UUIDs on other platforms.

I needed UUID v7 support (sortable one). In JVM it just works with java.util.UUID because if I would parse some API response where UUID is serialized as a string and then store it in PGSQL, I know that property of UUID v7 won't be lost on an Application <-> DB boundary because of byte order.

With System.Guid and the proposed API I need to go into every single driver source code (ADO.Net, NpgSQL, DataStax, etc) to make sure that driver serializes my Guid correctly, otherwise, my data will be corrupted. Different byte order will make the sortable column non-sortable and the index magically becomes non-clustered.

Moreso, UUID spec for new UUID versions mentions network order https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#section-6.7-4

UUIDs created by this specification are crafted with big-ending byte order (network byte order) in mind. 
If Little-endian style is required a custom UUID format SHOULD be created using UUIDv8.

How UUID v7 can be implemented correctly with the existing System.Guid and the new proposed API?

Szer commented 1 year ago

The business case for UUIDv7 - cursor-based pagination which requires sorting.

Why UUID in the first place? I need a random, unique, opaque identifier. Cursor-based pagination requires sortability as well, which v4 doesn't have as a property. Workaround - introduce additional cursor tokens/columns/ids for such pagination, but why?

Snowflake is dead, UUID v7 is the industry standard now. So I would really like to hear how this proposal will help us build industry-graded applications with new UUID versions in mind

tannergooding commented 1 year ago

It affects the implementation of sorting/serialization (see https://github.com/dotnet/runtime/pull/81650/files ) and corresponding performance cost.

Endianness has a relatively minor impact on the implementation in that it determines which fields are compared first vs last and in a couple edge cases involving construction, it determines if any byte swapping is required.

That is, the difference between a big endian layout and little endian layout is in the worst case scenario no more than 1 additional instruction on any hardware which RyuJIT supports that has shipped in the last 17 years. We have slightly less efficient implementations in a few cases, but there's nothing stopping us from optimizing those if it is considered a core need.

For sorting, Guid today already implements itself as such that 00000001-0000-0000-0000-000000000000 is less than 00000002-0000-0000-0000-000000000000 and so on. That is, from the printed string you could functionally remove the dashes and treat it as a single UInt128 integer literal. Uuid would function identically as the desire is to not compare based on the field layout, but rather on the type as a whole. Regardless of the layout, the Guid is functionally a 128-bit unsigned integer and it compares itself from most significant byte to least significant byte.

e.g., let it be a network-ordered set of 16 octets

This has no impact on the actual vectorizability of the code and in fact is the least efficient way to set up the bytes. The "ideal" layout performance wise is 2x uint64 fields that are stored in the same endianness format as the native machine (typically little endian). This is true across all possible operations that the type could support except for serialization to or from a network order byte sequence where it requires 1 additional instruction. Inversely, storing as forced big endian layout requires 1 additional instruction as part of every other operation being doing, including comparisons.

with vectorizable initialization from string and byte array,

We already can do this on Guid and are doing it in some of the locations today. We could optimize even more of the scenarios, but it hasn't bubbled up as a core need thus far. User input can help direct that to happen.

have it to be transparently convertible to UInt128

This is a non-starter, we do not transparently convert between non-equivalent types like that as it breaks type safety, introduces a range of ambiguities and risks of breaking change, etc. It goes against many of the core Framework Design Guidelines.

but let's not pretend it doesn't have a myriad of quirks and inconsistencies.

There are no real quirks or inconsistencies in Guid. It represents a Uuid and can represent any 128-bit sequence of data. It supports all the core operations one would expect a Uuid to support, it behaves correctly for equality, comparison, string formatting, hashing, and every other API we expose.

The one piece of functionality it is missing is the ability to easily serialize it as a big-endian sequence of bytes (that is, serialize it in a format equivalent to a variant 1 UUID). This proposal adds that one piece of functionality.

As has been stated many times, some proposed System.Uuid would be literally identical to System.Guid except the constructor would use ReadInt32BigEndian rather than ReadInt32LittleEndian and the ToByteArray method would use WriteInt32BigEndian rather than WriteInt32LittleEndian.

There are some minor adjustments that could be made to field layout, but nothing that is actually meaningful to the implementation or to the user-observable semantics of the type in safe code and nothing that would meaningfully impact the performance or usability of the type.

It takes just a single element in a large chain of interacting code fragments with multiple serialization boundaries to mishandle

This is really a non-argument. The same "concern" exists for serializing any primitive type to the network (char, decimal, double, short, int, long, nint, float, ushort, uint, ulong, nuint, etc). The same "concern" exists every time someone uses BinaryReader to interact with an ELF/PE file, or when reading a UTF16-BE file on disk. The same "concern" would exist for interacting between code that takes Uuid and code that takes Guid, particularly for the 20 years of code that is already taking Guid and then manually doing the endianness fixups so that it serializes in network order (big endian).

It takes just a single element in a large chain of interacting code fragments with multiple serialization boundaries to mishandle

There is no data to lose here. The serialization process already includes all bytes, the only thing that a user can mess up is that they use isBigEndian: true on one side and isBigEndian: false on the other. The same issue would exist for someone using Uuid on one side and Guid on the other (which will happen). The same issue would exist when working with another language and picking the wrong one of their types/methods.

This is something that is trivially caught and handled by a single test for the APIs that perform serialization. The Guid 00010203-0405-0607-0809-0A0B0C0D0E0F is sufficient for validating that endianness remains correct/expected as you simply need to validate that input.ToString() == output.ToString(). If it does, then the bytes were successfully serialized using the same endianness on both sides regardless of what the underlying field layout happens to be.

But introduction of these new APIs won't solve the problem. (The aforementioned problem being inconsistency between variant-1 and variant-2 compatible treatments of binary and string serialization).

There is no inconsistency with regards to string serialization. String serialization is deterministic regardless of field layout, regardless of endianness, etc.

There is currently today user error in the use of the binary serialization APIs because some users don't pick up that ToByteArray() and new Guid(byte[]) are exclusively expecting the bytes to be in a little-endian format.

This is functionally no different from a developer using BinaryReader.ReadInt32 on the byte sequence of a network packet and now complaining that value.ToString() is "inconsistent". It isn't, the developer used the wrong API and didn't account for the fact that BinaryReader reads using LittleEndian but network packets are transmitted as BigEndian.

The fix is for the user to use the correct BinaryPrimitives API and to explicitly read the Int32 as BigEndian such that the read byte is now correct for the little endian machine they are running against.

But 'ordinary' enterprise programmer rarely interacts with serialization boundary directly - it's handled by 1st and 3rd party libraries for serialization/model binding/ object-relational mapping

Exposing Uuid is directly surfacing that serialization difference in the type system where instead it should in fact be hidden and only handled by the serialization boundaries by correctly passing isBigEndian based on the requirements of the specification they are interoperating with.

If the system requires a variant 1 UUID, they use isBigEndian: true. If the system requires a variant 2 UUID, they use isBigEndian: false

Ultimately which is used doesn't really matter provided that both the producer and consumer agree on which is being used. In the majority case this isn't something to surface to the user. In the case you have a general purpose serialization library, then it is something to surface much as it would be for int32

treat System.Guid in backwards compatible way, treat System.Uuid in variant-1 native way.

System.Guid is already being used, succesfully, for both variant 1 and variant 2 scenarios. Users who are utilizing it for variant 1 currently have to roll their own equivalents to the methods proposed above.

Exposing a new System.Uuid type doesn't fix the general problem. It compounds the existing problem and there will be developers who take System.Uuid and utilize it for variant 2 as well. It will likely make the scenario even worse in practice due to UUID being the more cross platform/modern terminology. Developers will have to rationalize the subtle differences between these two types that use historically interchanged terms. They will have to rationalize that there is 20 years worth of code already utilizing System.Guid for both variant 1 and variant 2 scenarios. They will have to rationalize the interop and exchange between these two types. They will have to rationalize that the interchange of these types directly exposes a serialization concern to the type system. They will have to rationalize that converting between the types may introduce additional bugs.

A programmer can understand the difference between Uuid V1 vs V2, but it doesn't stop the stream of questions/bugs, regularly risen by QA, PM/PO, Analysts and Power Users. "Why the order is different in-app and in-db?"

The formatted string should never differ between the two systems. If it does, you have a bug. The formatted string would never differ between equivalent Guid and Uuid.

The byte sequence of a Uuid written using little-endian representation would be identical to the byte sequence of a Guid written using little-endian representation. The same is true for Uuid vs Guid written using big-endian representation.

The bugs occur because developers do not correctly account for the fact that they are reading a big-endian (network-order, variant 1, etc) ordered byte sequence. The exact same, but inverse, scenario would exist if we always serialized as big-endian. That is, there would be developers that pass in something that is in little-endian (variant 2) order and then be confused that the byte order as visualized by ToString is different from what they expected.

Part of this comes from us not having an API that allows them to trivially work with big-endian data. That is what this proposal covers exposing.

The other part comes from the existing APIs not clearly surfacing that there is potentially an endianness concern users should be aware of. That endianness concern will not be addressed by System.Uuid, it will likely only be compounded given the reasons above, particularly that UUID variant 2 is itself an endianness swapped byte sequence to UUID variant 1. It may be partially addressed by having the new overload but would likely only be truly addressed by obsoleting the current APIs and requiring that users always pass in the desired endianness instead.

A subtle but potential alternative would be to call these Guid.CreateFromVariant1(byte[]), Guid.CreateFromVariant2(byte[]), Guid.ToVariant1ByteArray(), and Guid.ToVariant2ByteArray(). That is, however, inconsistent with how we expose other APIs where the real concern is endianness and does raise potential user considerations around strictness and whether or not such APIs might validate it is a conforming variant 1/2 byte sequence.

tannergooding commented 1 year ago

As I see new API methods will just confuse GUID's API

Instead, the UUID type may be added to BCL along with MSDN documentation that explicitly states the problem UUID type solves.

This is not how we typically view such things in API review.

We do generally consider the concerns around additional overloads causing potential confusion. However, new overloads to existing APIs are most frequently considered significantly less confusing than a new but similar type.

We do also factor in the chance for user confusion. In the case of DateTime vs DateOnly and TimeOnly. The names make it fairly clear that one is a combination and the others are "only" the respective part.

In the case of Guid vs Uuid they are frequently and historically interchanged terms. We then have to account for the 20 years of history in which Guid has been used for both variant 1 and variant 2 purposes. We then also have to account for the fact that Uuid is itself not a term (generally or even in a domain specific context) that exclusively means variant 1. The official spec excplicitly covers variant 0, variant 1, variant 2, and variant 3 UUIDs, the latest version of that spec largely covers variant 2 but it is not exclusive to it and thus Uuid itself is equally as ambiguous as Guid and will take all the current concerns and then compound on them.

On the other hand, exposing ToByteArray(bool isBigEndian) and .ctor(byte[], bool isBigEndian) methods as proposed is not a new thing. We have successfully done this on a great number of types. We have successfully surfaced these endianness concerns for the various primitive types, we have surfaced them as part of the new generic math feature, we have surfaced them on types such as BigInteger and more.

So while some users are surfacing the potential concern about ambiguity, it does not at all match the concrete experience we have from other types that have already done this exact thing to great success. And its worth noting according to API usage metrics (https://apisof.net/), such APIs are actually used by developers in a non-trivial number of projects, so its not like we exposed them and simply no one uses them.

is much simpler just to fetch out of box methods ToString(), ToByteArray() etc. to produce similar sequence of bits in both, byte array and string representation of 16-bit number.

That is not how any other type provided by the BCL works and is not how types work in most other languages/ecosystems either.

Numeric strings are functionally always displayed in big endian format. On the other hand, most machines natively operate in little-endian format (there are relatively few exceptions such as IBM System z9) and thus the raw byte sequence in memory is swapped compared to the byte sequence displayed by string formatting for almost every type.

tannergooding commented 1 year ago

How UUID v7 can be implemented correctly with the existing System.Guid and the new proposed API?

You pass isBigEndian: true. Network Order == Big Endian

As I've repeatedly detailed above, ToString() always displays data in big endian format with the most significant byte printed first. This is irrespective of the underlying field layout.

Likewise, System.Guid is always compared considering the most significant byte and thus 00000001-0000-0000-0000-000000000000 is less than 00000002-0000-0000-0000-000000000000 which is less than 10000000-0000-0000-0000-000000000000

The behavior of Guid vs the alternatively proposed Uuid would be identical for ToString, for Equals, for CompareTo, and every other API except for ToByteArray and new Uuid(byte[]) where the subtle difference is that Guid defaults to isBigEndian: false and Uuid defaults to isBigEndian: true.

We are not going to expose an entirely new type just to minorly differentiate a serialization/deserialization only concern. We would not ship a UInt32BigEndian just to guarantee a big endian field layout just because UInt32 is itself typically little-endian on most hardware.

So I would really like to hear how this proposal will help us build industry-graded applications with new UUID versions in mind

All the different variations of UUID are 16-byte/128-bit integers. They minorly differ in terms of how they should be serialized (that is what order the bytes should be emitted) and in some cases what values are expected for particular nibbles in the byte sequence to identify version and variant.

Today, Guid is already capable of supporting any different 16-byte sequence. The one real issue is that we don't have any APIs that make it trivial to serialize/deserialize the one alternative byte sequence that currently exists. That is, today we only make it easy to serialize/deserialize little-endian ordered data.

The new APIs make it easy to also serialize/deserialize big-endian ordered data (network order) and thus make it much simpler for developers to correctly handle UUIDv7/UUID variant 2/etc values at the serialization/deserialization boundary. They likewise make it trivial to continue working with the 20 years of existing types/APIs, many of which already use System.Guid for the same purpose.

The new APIs completely remove the abiguity that a second type would introduce and the general interchange problems that would arise. They completely remove the consideration of whether existing APIs taking System.Guid for UUIDv7/`variant 2/etc values would need to deprecate or obsolete their APIs.

Code that is already working and already doing the right thing continues to work and do the right thing. They can potentially simplify their own existing wrappers that are fixing up the endianness to simply use these new APIs. New or existing code that finds the byte order is not what they expected can now trivially use the new APIs to do the right thing.

aneteanetes commented 1 year ago

First of all: I urge you not to compare a data type that is unambiguous in many languages with an exaggerated representation of a particular case. By translating such comparisons, you divert our conversation from the point.

As for unambiguous behavior: "Except" is the opposite of "identical behavior." If you expose a new type, you can accurately describe its behavior. If you add flags, there will be additional cognitive load and ambiguity. And it's not just serialization. It's about compatibility with: data stores (not just RDBMS), other languages, interoperability with other languages, making language input easier for people with experience using the uuid type.

I just can't explain to a colleague (python) what a GUID is and how it differs from a uuid.

It seems to me that the current proposal can only distract us from creating applications with new versions of uuid, precisely because we will encounter type assignment ambiguities. Conversely, if we have two types guid and uuid, we can accurately separate them, understand the purpose of each separately and develop them in parallel, taking into account the needs of these particular types. Such an API does not make it easier for developers to process values, but only expands the field for decision-making, forcing each time to think about the ambiguity of design decisions 20 years ago.

I'm an average developer and I have no idea why guid supports any other byte sequence. If we talk about serialization, then there are a lot of ambiguities for me too, the absence of an API at all creates a situation where I have to think about things that I cannot know in advance. And by the way, as an average developer, I believe that the new API adds a few methods that can be added to a regular nuget package, I don't see the point of adding them to bcl like this.

If we talk about a new type as an alternative to this api (which is not correct), then creating a new type does not replace the old one in any way, but only adds new features. No one insists on marking the GUID as obsolete, it would be wrong, besides, there is a decision field in the GUID where it cannot be replaced. But the type is already overloaded with internal information, and the need to add new flags and methods only speaks to the need for a "new", unambiguous data type, which is in all other languages.

I would like to emphasize that the current interfaces work with the GUID type and do not require a transition to a new type. Scenarios are defined where the "correct" byte sequence needs to be used (and it's good that they are different, not specific - that way we can see a whole group of needs, not just one) - serialization, integration and disambiguation with other languages, use of different types guid for enterprise database, uuid for small projects) for different databases.

In general, in any design, new flags and methods only add to the ambiguity. A type whose constructor needs to be passed multiple flags does not become unambiguous with the addition of a new flag. In addition to the fact that we separate the types of numbers into different bit depths to determine their size, they can also be stored in completely different ways!

P.S. I cannot but agree with the thesis about the working code. The code really works, and really does the right thing. The proposed API (which fits perfectly into the nuget package) will help solve the serialization problem, which has already been solved one way or another, but not because the problem is serialization, but because the problem is in the type. And a more elegant and simple solution seems to be replacing the type with a new one, rather than adding flags, methods, and nuget packages.

DaZombieKiller commented 1 year ago

I just can't explain to a colleague (python) what a GUID is and how it differs from a uuid.

There is no difference, that is one of the biggest reasons why introducing System.Uuid is a bad idea (imo). A GUID is a UUID, this is entirely about how you read/write it to a file or send/receive it over the network. Every other aspect is identical.

It's like introducing several different alternative types for int that cover all the different ways you can store one in a file, even though that's completely unrelated to how you actually use an int once you have one.

Szer commented 1 year ago

As I've repeatedly detailed above, ToString() always displays data in big endian format with the most significant byte printed first. This is irrespective of the underlying field layout.

If consistent behaviour could be achieved only through System.String traversal, that would leave .NET with suboptimal decisions in the core of the BCL.

Spec mentions that particular case as well

So, to get consistent UUID behavior do we need to always pass System.Guid as a string to DB even when the spec mentions that it really should be a byte sequence?

tannergooding commented 1 year ago

As for unambiguous behavior: "Except" is the opposite of "identical behavior." If you expose a new type, you can accurately describe its behavior. If you add flags, there will be additional cognitive load and ambiguity. And it's not just serialization. It's about compatibility with: data stores (not just RDBMS), other languages, interoperability with other languages, making language input easier for people with experience using the uuid type.

This is identical in consideration to UInt128. The behavior of an UInt128 is always the same, regardless of byte layout. 1 + 1 == 2, regardless of whether that is stored as 0x0000_0000_0000_0001, 0x0000_0000_0000_0000 or as 0x0000_0000_0000_0000, 0x0000_0000_0000_0001 or as some alternative sequence.

The only consideration is that when creating a UInt128 from a byte sequence, you have to account for whether the byte sequence is little-endian or big-endian. If you get it wrong, then you will have the inverse value from intended. That is, if you expected 1 you will instead get 18446744073709551616.

We would never expose a UInt128BigEndian or a UInt128LittleEndian. This concept of creation from a byte sequence is a serialization concern best handled by relevant constructor or factory like methods.

In the same way, the proposal for Uuid is equivalent to an ask to expose GuidBigEndian because the current Guid acts as a GuidLittleEndian. It is a non-starter, it is not going to happen. It goes against the core Framework Design Guidelines, is inconsistent with the rest of the BCL. If you want this, you will need to roll your own third party package and deal with all the complex fallout that comes about as a result.

I just can't explain to a colleague (python) what a GUID is and how it differs from a uuid.

There is no difference. A GUID is a UUID and a UUID is a GUID, they are interchangeable terms. Literally from the RFC 4122 specification: "This specification defines a Uniform Resource Name namespace for UUIDs (Universally Unique IDentifier), also known as GUIDs (Globally Unique IDentifier)."

A UUID has several documented and well supported layouts. None of these layouts are exclusive to a particular operating system or platform. None of these layouts are the definitive source of truth or "one way" to do things.

The difference between the two primary layouts is that one is encoded in big endian format (variant 1) and the other is encoded as little endian format (variant 2).

The serialization APIs exposed by Guid today exclusively deal with variant 2. Some users need to deal with variant 1 layouts as well. This proposal exposes the APIs necessary so that users can trivially handle them.

I'm an average developer and I have no idea why guid supports any other byte sequence. If we talk about serialization, then there are a lot of ambiguities for me too, the absence of an API at all creates a situation where I have to think about things that I cannot know in advance

Handling endianness is a fundamental requirement of dealing with binary like serialization. You cannot perform networking, read binary encoded file formats, or other scenarios without taking it into account. This includes gifs, jpegs, other images, pdfs, PE, ELF, ZIP, TAR, GZ, ISO, and almost any other non ASCII/UTF8 based format. Even handling UTF-16 or UTF-32 based text requires taking endianness into account (that includes basic emoji processing).

And a more elegant and simple solution seems to be replacing the type with a new one

This is a fundamental disagreement between the alternative proposal and the relevant owners of these types in the BCL/API review members that have been consulted.

DaZombieKiller commented 1 year ago

So, to get consistent UUID behavior do we need to always pass System.Guid as string to DB even when spec mentions that it is really should be byte sequence?

The behavior of System.Guid is already consistent. ToByteArray is explicitly for serialization and will give you a binary UUID in little-endian format. This proposal adds the functionality you need to get it in big-endian instead.

ToString is explicitly intended for display and will give you a hexadecimal string with the values in big-endian order -- this is consistent with hexadecimal display of values in general. If you convert an int, long, etc to a hexadecimal string then you will get the same behavior.

System.Uuid would be exactly the same thing as System.Guid, except the default value for the isBigEndian parameter would be true instead of false. This just creates even more confusion because this is implicit behavior that should really be explicit.

tl;dr you seem to want big-endian bytes, so what you want is guid.ToByteArray(isBigEndian: true).

tannergooding commented 1 year ago

So, to get consistent UUID behavior do we need to always pass System.Guid as a string to DB even when the spec mentions that it really should be a byte sequence?

Using string is not "required" at all. Storing as a byte sequence simply requires agreement on the byte sequence layout between the producer and consumer. That's all.

The current issue presented is that Guid currently only allows trivially producing a little-endian formatted byte sequence and some workloads require instead a big-endian formatted byte sequence.

The alternative proposal was to introduce a new Uuid type to handle this serialization difference, which is not how we handle such scenarios for any other type in .NET

Even where other languages have been brought up, they typically only provide 1 behavior out of the box. For example, Java's UUID type behaves as big endian. They do not provide a little-endian type and simply expect users to manually fix their type up.

The fact that they only support big endian out of the box means that users who opened the Uuid proposal are happy, but there is an inverse camp of developers who need the little-endian format that have to fix things up themselves.

That is, Java has the "same problem", just inverted and I strongly expect they would equally reject a proposal to expose a GUID or MSGUID type and would instead either say "fix up the values yourself or they'd expose a similar helper APIs (as in this proposal) to produce a correctly ordered byte sequence that is compatible with the little-endian representation.

Szer commented 1 year ago

tl;dr you seem to want big-endian bytes, so what you want is guid.ToByteArray(isBigEndian: true).

I do want it in BE by default in a lot of cases. That creates a similar situation as with Tasks where 99% of the code should write .ConfigureAwait(false)

Simple language/BCL defaults should work for the majority of cases and lead to the pit of success. Adding such a flag in my very humble opinion leads to the pit of suffering (described below)

That is, Java has the "same problem", just inverted and I strongly expect they would equally reject a proposal to expose a GUID or MSGUID type and would instead either say "fix up the values yourself or they'd expose a similar helper APIs (as in this proposal) to produce a correctly ordered byte sequence that is compatible with the little-endian representation.

I'm definitely biased (I'm just a mere human), but in my world of BE services with a cloud tech stack, I only needed BE order (read: network order).

So as a consumer of .NET, I just want sane defaults which will keep me happy. What I would like to avoid - having to write some boilerplate all the time using System.Guid until the end of life (like .ConfigureAwait)

DaZombieKiller commented 1 year ago

That creates a similar situation as with Tasks where 99% of the code should write .ConfigureAwait(false)

.ConfigureAwait(false) is an extra method call though. Here you are just providing a bool to a method you are already calling, and that bool should always be provided regardless (because failure to use the right endianness will lead to all kinds of serious serialization problems -- this is something that ideally should be enforced or encouraged through obsoletion or an analyzer that warns when endianness is not specified).

tannergooding commented 1 year ago

I'm definitely biased (I'm just a mere human), but in my world of BE services with a cloud tech stack, I only needed BE order (read: network order).

I understand this, and it wouldn't be unreasonable to ask that we expose the API as new Guid(byte[], bool isLittleEndian) instead.

The issue and why that is unlikely is that no other type in the .NET BCL (that I'm aware of, certainly not the primitive and frequently serialized types) defaults to BE. Today, you must always be explicit when you want to read/write bytes as a specific endianness.

This comes about because we have two categories of APIs in .NET. Those that operate assuming machine endianness (in which case it is typically little endian, unless you're in a specialized scenario targeting IBM System z9 or another BE system) -or- they explicitly default to little endian (because most hardware itself is little endian). Guid falls into the latter camp today and assumes LE. We have resolved this issue on other such APIs by versioning them to allow passing in the appropriate endianness and having the default match what would happen if no parameter was passed in.

So as a consumer of .NET, I just want sane defaults which will keep me happy.

Right, and we do typically strive to achieve this. However, you must also consider that what is a sane default for you is not necessarily a sane default for another developer. You also need to consider that sometimes what is a good default can change over time.

ConfigureAwait(false) is a good example because you have cases where it is the correct default and you have just as many cases in other domains where it is the wrong default. When the type was initially designed, the input showed very heavy skew towards ConfigureAwait(true) being the better default. Over time, many developers found that the need to specify ConfigureAwait(false) was super prevalent in the domain they were operating in, potentially to the point that it would've been better to make it the default and require ConfigureAwait(true) to be what had to be explicitly specified instead.

Likewise for Guid, when it was initially designed 20 years ago it was very heavily oriented towards Windows and COM like scenarios, so much so that being little-endian made sense as the right default.

In today's more modern cross-platform world where UUID won out as the preferred of the two terms and where they are used in many more contexts than just COM like scenarios, there is also a heavy need for big endian serialization/deserialization support. It isn't clear if that is to the extent that defaulting to big endian would've been correct. Instead, I would guess we simply would've required the endianness always be specified instead as we do with Int32 or Int128.

As Zombie pointed out, there is a small additional difference in that ConfigureAwait(false) is something you have to specify while ConfigureAwait(true) is not. That is not the case for getting a byte[] out of a Guid. You must always call ToByteArray and the question is whether you specify true/false as the parameter to it. This is similar in that you must always call WriteInt32LittleEndian -or- WriteInt32BigEndian. -- In an ideal world, we'd remove the historical quirk that ToByteArray() where you don't specify a parameter exists at all. But, removing it is a non-starter. Marking it Obsolete might be possible but that is a source breaking change that will require additional discussion.

aneteanetes commented 1 year ago

And a more elegant and simple solution seems to be replacing the type with a new one, rather than adding flags, methods, and nuget packages.

A machine translation error crept in here!

I did not mean to replace the guid with the uuid, I meant instead of adding flags to the current type (expanding its responsibilities), add a new type that will solve other problems.

So, with this 'edit' i am dissagree with 'fundamental': extend current type like this tottaly wrong way. I think it's necessary find another solution instead of flags and "extension" methods.

DaZombieKiller commented 1 year ago

expanding its responsibilities

The responsibilities of Guid are not being expanded. This is about the serialization of Guid, which is separate from Guid itself. That is why new APIs are being added on BinaryPrimitives instead -- the additional ToByteArray overloads are most likely for consistency because ToByteArray and TryWriteBytes already exist on Guid (even though they probably shouldn't).

sandersaares commented 1 year ago

To add some context about the value, in my experience 90%+ of Linux tools use "big endian" GUID serialization, so writing portable code can be a bit tricky. Some web/media standards such as anything DRM related use "key IDs" which are serialized in "big endian" format, which has caused many historical problems for online video and DRM service providers, especially as Microsoft's DRM server SDK is .NET based and so produces the "wrong" type of GUID if you are not careful.

Even more important than supporting it would be to even surface the fact that there are multiple serialization formats for GUIDs! I bet most people using the Guid class have no idea and are going to be in for a surprise when they finally have to integrate something that speaks Guid the other way around.

We have IPAddress.HostToNetworkOrder and we have BinaryPrimitives.XyzBigEndian for numbers because they have two serialization formats, and we should likewise have something similar for Guid as it also has two serialization formats.

tannergooding commented 1 year ago

To add some context about the value, in my experience 90%+ of Linux tools use "big endian" GUID serialization, so writing portable code can be a bit tricky. Some web/media standards such as anything DRM related use "key IDs" which are serialized in "big endian" format, which has caused many historical problems for online video and DRM service providers, especially as Microsoft's DRM server SDK is .NET based and so produces the "wrong" type of GUID if you are not careful.

Right. Since the current APIs only expose support for serializing/deserializing as "little endian", it is very easy to get wrong and can be non-obvious compared to other languages that opted for "big endian" to be the default instead. The same is true in those other languages when doing things such as COM interop (say working with Direct3D for games) since that expects little-endian by default.

Exposing the functionality that allows the correct thing to happen is the first step. Updating the documentation and helping surface that users may want these other overloads or alternative ways of doing serialization is the next step.

From the discussions that I've had with others on the API review team, exposing a new type to handle this goes against the general design goals of the BCL and is inconsistent with how we in .NET handle this elsewhere. It is believed that exposing a new type in this case will simply compound the problem and introduce overall more confusion and bifurcation of the ecosystem.

We have had great success over the entire 20 year lifetime of .NET with the approach laid out in this proposal even as recently as new APIs in .NET 6/7 for core feature areas.

We have IPAddress.HostToNetworkOrder and we have BinaryPrimitives.XyzBigEndian for numbers because they have two serialization formats, and we should likewise have something similar for Guid as it also has two serialization formats.

Right, that's exactly what this proposal is doing. It is adding explicit Read/Write GuidBigEndian and GuidLittleEndian APIs on BinaryPrimitives and it is expanding the existing new Guid(byte[]), ToByteArray(), and TryWriteBytes() methods to take an explicit isBigEndian parameter. We could alternatively obsolete the APIs on Guid and push users towards explicitly using the BinaryPrimitives APIs, but that is a source breaking change and requires much more discussion. It also seems unnecessary to do given we have other APIs in .NET that are successfully used without issue that follow the same pattern and have ToByteArray(bool isBigEndian) and .ctor(byte[], bool isBigEndian) overloads.

As an additional piece of compelxity, this also applies to the string form and should be equally supported there. ToBigEndianString() and ParseBigEndian() or similar, because the string form is just a hex encoded bytes - the same concern about the byte ordering in the serialization format applies in both cases.

Strings are always printed with the most significant byte first, there isn't an endianness concern here much as there isn't for Int32. The only way that the string would be "incorrect" is if you used the wrong API when reading the Guid from a byte sequence (e.g. you read as little endian when you should've read as big endian).

Kirill-Maurin commented 1 year ago

I am confused by one thing The critics refer to their practical experience of using guid as a uuid But proponents are not referring to their own similar experience, excluding @sandersaares

aloraman commented 1 year ago

This is really a non-argument. The same "concern" exists for serializing any primitive type to the network...

There is no data to lose here. The serialization process already includes all bytes...

Yes, the concern exists, no need to scary quote it, but there's also a difference in magnitude. You need to know about source endianness to correctly handle binary data, that's true - but the textual representation case (and therefore, textual serialization) does not depend on it. However, there were issues with roundtripping with textual serialization for floating point numbers - that was addressed in NetCore 2.1 - with breaking changes galore. DateTime.Kind has issues with both binary and textual representation - so it is a constant source of roundtripping failures (just look up EF Core repo and stackoverflow.com, there're questions a plenty of "why does my DateTime was written as Utc but read asUnspecified" in issues). Guid does not lose the bytes of data when serialized. But there's a chance of endianness flipping when switching between binary and textual representations, and that's a blessing and a curse - there's no way to preserve initial endianness of source data, but you are fine, as long as you have even number of endianness flips along the chain of transformations.

There are no real quirks or inconsistencies in Guid.

Yes, there are. For starters, there's a problem of terminology, both GUID and UUID can refer to the same thing, and yet they both can also refer to Microsoft-specific or Everywhere-Else*-specific implementation details.

Then, there's a question compatibility with current version of RFC-4122, and questionable compatibility with future drafts of RFC-4122. Despite the fact that .NET is said to default to Variant 2, and it does in fact have Variant 2 layout - Guid.NewGuid() produces values that claim to be Variant 1 Version 4 UUIDs!

Then, there's all that "mixed-endianness debacle". Few enterprise programmers know about endianness at all, but even fewer know that Guids are actually little-endian. Why? Because even though actual elements are all little endian (33221100-5544-7766-88-99-aa-bb-cc-dd-ee-ff), the hyphenated format produces the illusion of mixed endianness (33221100-5544-7766-8899-aabbccddeeff). Then again, originally last six bytes were a MAC address, which is big-endian, so mixed-endian is not entirely incorrect. By the way, no uppercase letters in UUID, that's a GUID-only thing.

Then, there're real world usage scenarios. You can construct Guid/Uuid from any compatible string or set of 16 bytes - ignoring all format limitations, making it just a 128-bit identifier with specific text representation. If you can compress Airline Prefix, Flight Number, Airwaybill Prefix/Number and date and time of landing into 16 bytes - than you can use Guid/Uuid format and call it Guid/Uuid - nobody will care about Variant/Version number anyway.

picking the wrong one of their types/methods... the developer used the wrong API... If you get it wrong... so produces the "wrong" type of GUID if you are not careful... it is very easy to get wrong... if you used the wrong API... If it does, you have a bug...

And that's the crux of the problem. It's very easy to "get it wrong". Current API is error prone. When you are limited to MSFT-only stacks, it's not a problem. But when you have a heterogenous stack - the situation is different. You mostly never actually interact with serialization boundary yourself, application framework of 3rd party libraries do it for you. In some cases, the 3rd party libraries do a good job and hide the difference from you, so you're mostly fine - as long as nobody made a mistake in other parts of a solution. Some other libraries don't do so good of a job or provide multiple knobs for you to configure (and hope not to make a mistake) - that leads to frustration and bugs, and wrongs, and lions and tigers and bears. For example:

MySqlConnector requires you to provide a correct composition of OldGuids (two values) and GuidFormat (seven values) configuration knobs.
Azure CosmosDB for MongoDB also provides knobs for you to configure. Which almost always ends in "GuidRepresentation Standard is only valid with subType UuidStandard, not with subType UuidLegacy" error for a new user.

Major drawback of this proposal is that it adds more knobs to an exising api. And if we've learned anything from cryptographic APIs, it's that the more knobs you have - the more ways you can fail. Also, if some library now has an endianness-flipping bug, and this bug will be fixed - it can lead to more bug reports from end users, who now will have odd number of endianness-flippings in their stack, if they had even number before. Also, adding newer APIs doesn't force anyone to use them - so it won't make it easy to get things right. Simple obsoletion won't suffice because people tend to ignore it (E.g., AppDomain.GetCurrentThreadId()was obsoleted in around 2005, still, there are 900+ usages on Github, even in MSFT libraries, even for NetCore - where it has breaking changes and actually serves no purpose, I wish to propose at least a fix for the breaking change - but I'm just scared of API review process). And real obsoletion, with AppContext switches and codebase deprecation will either force a break of NetStandard compatibility, or a full-blown backport of API changes - which will probably never happen.

In the case of Guid vs Uuid they are frequently and historically interchanged terms...

Yes, they are interchangeable in everyday speech, just as people still refer to Dictionaries as Hashtables or Maps. But in the code almost every other programming platform refers to them exclusively as UUIDs in the API and uses GUID/LE-variant to indicate a MSFT-compatible case. By the way, all that potential confusion between Guid and Uuid never stopped Microsoft from shipping both guidgen and uuidgen tools in the same SDK.

Yes, .NET never tended to provide essentially the same type under two different names with minor quirks. But never often turns up to become eventually. .NET will never work natively on Linux, .NET will never support full AOT, JIT will never have interpreter-mode, JIT will never have tiered-compilation, et cetera... And, in a way, System.Int32 and System.UInt32 are the same residue class modulo 2^32, just with different subset of operations exposed (more so at bytecode level). XmlDocument and XDocument are essentially the same thing. So having both UUID and GUID in BCL will be a thing unheard of, but not that unheard of.

In conclusion, I'd be glad if API review team will seriously consider not only drawbacks of separate System.Uuid type, but also look at advantages:

Clean slate. It will be possible to design a new type for UUID without a burden of compatibility dragging it down.
Fewer knobs to implement - less error prone API
Compatibility with other programming languages - simplify a training curve for newcomers Also, I would ask to look not only at current implementation of Guid in .NET/Uuid in other programming languages/current iteration of a spec (RFC-4122), but also at newer drafts and implementation proposals

vanbukin commented 1 year ago

@tannergooding

We could alternatively obsolete the APIs on Guid and push users towards explicitly using the BinaryPrimitives APIs, but that is a source breaking change and requires much more discussion.

I would like to bring up a point for further discussion. If we divide users into two groups: 1 - those who use Guid with WinAPI and COM for which Guid was originally made, 2 - those who use it as a container for Uuid. Due to the care for group 2, this breaking change brings a negative user experience to group 1. This will make them unhappy, as they are legitimate users of this structure and API, which were created precisely for their use cases and have been working without breaking for 20 years. If this breaking change is not made, the user experience of group 2 will not improve.

A separate data type allows addressing the issues of group 2 without affecting the user experience of group 1.

tannergooding commented 1 year ago

Lets start by explicitly covering what is being asked for as a patch. To expose System.Uuid is simple:

Copy Guid.cs and name it Uuid.cs
Find/replace in the file Guid with Uuid and various case sensitive alternatives (GUID->UUID, guid->uuid, etc)

Apply what is functionally the following patch


@@ -16,12 +16,12 @@
-        public Guid(ReadOnlySpan<byte> b)
+        public Uuid(ReadOnlySpan<byte> b)
     {
         if (b.Length != 16)
         {
             ThrowArgumentException();
         }

if (BitConverter.IsLittleEndian)
if (!BitConverter.IsLittleEndian) { this = MemoryMarshal.Read(b); return; }
// slower path for BigEndian:
// slower path for LittleEndian: _k = b[15]; // hoist bounds checks
_a = BinaryPrimitives.ReadInt32LittleEndian(b);
_b = BinaryPrimitives.ReadInt16LittleEndian(b.Slice(4));
_c = BinaryPrimitives.ReadInt16LittleEndian(b.Slice(6));
_a = BinaryPrimitives.ReadInt32BigEndian(b);
_b = BinaryPrimitives.ReadInt16BigEndian(b.Slice(4));
_c = BinaryPrimitives.ReadInt16BigEndian(b.Slice(6)); @@ -844,13 +844,13 @@ private static ReadOnlySpan EatAllWhitespace(ReadOnlySpan str) public byte[] ToByteArray() { var g = new byte[16];
if (BitConverter.IsLittleEndian)
if (!BitConverter.IsLittleEndian) {
MemoryMarshal.TryWrite(g, ref Unsafe.AsRef(in this));
MemoryMarshal.TryWrite(g, ref Unsafe.AsRef(in this)); } @@ -861,7 +861,7 @@ public byte[] ToByteArray() // Returns whether bytes are successfully written to given span. public bool TryWriteBytes(Span destination) {
if (BitConverter.IsLittleEndian)
if (!BitConverter.IsLittleEndian) { return MemoryMarshal.TryWrite(destination, ref Unsafe.AsRef(in this)); } @@ -871,9 +871,9 @@ public bool TryWriteBytes(Span destination)
BinaryPrimitives.WriteInt32LittleEndian(destination, _a);
BinaryPrimitives.WriteInt16LittleEndian(destination.Slice(4), _b);
BinaryPrimitives.WriteInt16LittleEndian(destination.Slice(6), _c);
BinaryPrimitives.WriteInt32BigEndian(destination, _a);
BinaryPrimitives.WriteInt16BigEndian(destination.Slice(4), _b);
BinaryPrimitives.WriteInt16BigEndian(destination.Slice(6), _c);

This is the minimum amount of changes required to achieve exactly what developers are asking for. Given the layout would be guaranteed to be "big endian", there are of course some other tweaks that could made to the source code to "simplify" things, change field layout to be more "efficient", etc. However, those are implementation details that won't be visible and to the end user won't be observable outside of reflecting over the private field state of the type and so are unimportant/irrelevant to the discussion.

Thus, this 12 line change is the functionality difference between the existing System.Guid and the proposed System.Uuid type.

Next, lets touch a little bit on the conceptual difference between this proposal and the proposal to expose System.Uuid and how the API review process works.

At a high level, both allow developers to achieve success. Each allows developers to work with any type of GUID/UUID. The only difference is in "how" developers interact with things:

In this proposal developers pass in a parameter indicating how the raw byte sequence should be interpreted.
In the System.Uuid proposal the interpretation of the raw byte sequence is functionally encoded into the type system as part of the name instead.

There are then of course other ways this could be supported and other names as well. We could expose static Guid ReadVariant1(byte[]) and static Guid ReadVariant2(byte[]), we could exposes struct UuidVariant1 { } and static struct UuidVariant2 { }, and so on. We're programmers and we have nearly an unlimited number of ways that this support could be achieved. We simply need to pick what we believe is the best way to achieve that. Not all ways are as good as others, not all ways will necessarily be consistent with other languages, or even necessarily with how .NET typically does things itself. There are tons of tradeoffs that have to be considered overall.

API review's responsibility is to take the proposals from users, to ensure that they meet the Framework Design Guidelines, are generally consistent with other APIs we expose in .NET or the area under which they're being exposed if they are domain specific, and that they will cause the minimum amount of friction across the entire ecosystem and userbase.

Much of this is explicitly laid out in our docs: https://github.com/dotnet/runtime/blob/main/docs/project/api-review-process.md. We're currently on 4 where the owner (i.e. myself) has made a decision and is trying to explain to the community why that decision was made. I also, notably, felt that this scenario was special enough that I did jump ahead and did a preliminary offline check to ensure my gut feeling was correct. That is, as indicated on the other thread, I already ensured to get a secondary opinion from other members of API review of the problem and proposed solution; as well as others outside the API review space, and the consensus was that exposing a new type was not the right direction and was altogether inconsistent with how .NET does things.

The hardest part of API review is honestly the part I'm doing right now. Trying to explain to explain why a proposal is functionally "won't fix as proposed". There are an innumerable number of opinions on how to do things and there are developers that follow every paradigm imaginable. Some developers think that everything should be immutable/functional, some developers think that everything should be extensible or mockable, some developers think that the way Java or Swift or Rust or Go does their BCL is the "one true way". It all comes down to preference and there is really no "correct" answer. As area owner and API review member, it's simply my job to find the closest thing to "correct" for .NET

I've already iterated many points above throughout the thread, but, I'll try to summarize them again here on why a new type is considered undesirable.

Guid vs Uuid is different from many of the other examples that have been called out because the functionality supported by the type doesn't change between them. The only proposed difference is in how the two types are created from or converted to a sequence of 16-bytes.

If you consider UInt32 vs Int32 they are the same number of bits, but how those stored bits are interpreted differ for every single operation they expose. The actual implementation of them is substantially different and they have unique code paths for everything (it isn't simply copy/paste the file, find/replace a name, and apply a 12 line patch). We don't however, have Int32LittleEndian and Int32BigEndian APIs because their is no functional difference between the two concepts, the difference comes about in how the value is created from or converted to a sequence of 4-bytes and that is handled today by named APIs instead since its a matter of type creation not a matter of type behavior.

If you consider DateTime and its DateTimeKind field, you'll find that it likewise differs in that the DateTimeKind is explicitly used to cause different functionality to happen and there is literally a different code path that is picked based on the value of DateTime.Kind. There is validation that happens based on DateTime.Kind and there is a functional difference in the stored fields if the Kind is changed.

Now, that being said the Uuid proposal could have gone that route. What is defined by RFC-4122 does of course have what is technically a Variant and Version fields. They are the 3-bits representing N (Variant) and the 4-bits represneting M (Version) in the sequence xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx.

However, the other proposal isn't to expose a System.Uuid type that strictly conforms to RFC 4122 and which only allows variant 1 UUIDs. The other proposal is not one where we actually assert that N only ever equals 0b10x nor to validate/conform to the Version field such that we could extract the time/version/random data/etc.

If we did go down that route, it would open the doors to other proposals asking for UuidVariant0, UuidVariant2, and UuidVariant3. We'd then need to change the name in the original proposal from Uuid to UuidVariant1 and consider the implications of whether Version also needs to be part of the type system, which would in the worst case expand to UuidVariant1Version0 through UuidVariant1Version5 and so on.

This quickly becomes obviously unnecessary and a general pain to deal with. It also unnecessarily limits the exposed types because while RFC 4122 does define a general interpretation of the data, in practice many tools don't truly conform and they really just treat UUID as a sequence of 128-bits that are stored as either little-endian or big-endian. -- I'd actually speculate that if UInt128 had been a more common type 20 years ago, many systems may have opted to use it instead since that was simply a better fit and they didn't actually want a 121-bit integer with 7-bits of information encoding a variant/version.

Now let's pretend this is a new language/ecosystem and users are giving feedback on how to best support UUID in today's world. When we break it down what is essentially being proposed in the other issue is that this new language/ecosystem expose UuidLittleEndian (Guid) and UuidBigEndian (Uuid).

Such a proposal would simply not happen. It would be shot down for any type it was proposed for because endianness is not a matter of function, it is a matter of creation, and therefore is relegated to a named method or parameter.

Since such a proposal would not happen we'd then continue iterating on how best to support it. Given it's "day 1" and there are no back-compat concerns, we'd look at the context of other languages, we'd look at what is most commonly used across all platforms, and we'd likely determine that variant 1 is the most common and so it is a better default. We'd also look at what we're doing for other types and we'd likely have the discussion that having an implicit endianness behavior at all is confusing and will lead to downstream user error, so we should require it to always be explicitly specified. So we'd end up with Uuid(byte[], bool isBigEndian)/ToByteArray(bool isBigEndian) -or- ReadUuidLittleEndian/ReadUuidBigEndian and WriteUuidLittleEndian/WriteUuidBigEndian.

We might've had some discussion on whether the parameter should have been named isLittleEndian or isBigEndian. It would've come down to consistency with other types. Most machines are little-endian while many file formats and networking itself is mostly big-endian. There would likely be some heated debates on the "right way".

We might've even had some discussion on whether we should've used the terms Variant1 or Variant2 instead. We likely would've rejected this for the same reason that exposing UuidVariant1 and UuidVariant2 would've been rejected in that we aren't actually variant1/variant2. That is we aren't doing validation on the inputs and throwing on mismatch if N isn't exactly 0b10x (variant 1) or 0b110 (variant 2). That means using the terminology may cause unnecessary confusion or false expectations around how the type behaves/operates.

So, what we end up with is what is being proposed in this issue. We end up with a single type and named methods that handle the endianness difference that comes up only for serialization/deserialization of a sequence of bytes.

Now, we have to snap back and understand that this isn't day 1 in a new language/ecosystem and we do have 20 years of back-compat to consider. This means we can't end up with a "perfect" solution because breaking existing users is almost always worse.

There are many different types of breaking changes ranging from "binary" to "source" to "behavioral":

binary - attempting to run your existing binaries now fails, typically due to a removed type or API
source - attempting to compile your existing code now fails, typically due to new warnings or errors
behavioral - attempting to run your code (whether an existing or new binary) may result in different behavior than before

Each of these have different implications as to the general impact. binary is the worst and the one we try to avoid the most. source is dependent and while we try to avoid it, its often the easiest for developers to handle because it surfaces when you try to recompile. behavioral is the one we make the most and that can come about from exposing new APIs causing a change in overload resolution to cases like the floating-point formatting changes that happened in .NET Core 2.1.

For the behavioral changes, we do try to carefully consider the implications of such a change and how it will impact existing code. Most often they happen due to fixing a bug and moving towards compliance for a given spec. For the floating-point changes, they were done because .NET was performing a lossy conversion and was losing data for some floating-point values such that they would not roundtrip back through Parse and produce the same value.

For Guid vs Uuid there is no loss in data to justify a behavioral break. There is no implementation difference for the core APIs on the type nor is there a required layout change for the fields of the type. The change comes about that as part of creating the type from a byte[] or creating a byte[] from the type there is a need to swap the bytes and thus it is a concern of endianness. It is not mixed endianness because UUID is not a single field. Even for variant 1 is is formally described as the following 4.1.2 Layout and Byte Order:

private uint time_low;
private ushort time_mid;
private ushort time_hi_and_version;
private byte clk_seq_hi_res;
private byte clk_seq_low;
private fixed byte node[6]; // this is technically described as a 48-bit integer

Thus, the difference between variant 1 (System.Uuid) and variant 2 (System.Guid) is simply whether time_low, time_mid, and time_hi_and_version are stored in big-endian (variant 1) or little-endian (variant 2) format. There is no mixed endianness involved, there is no mixed endianness considerations, etc.

This means that we don't have an argument under which to justify a break and we don't have an argument under which a new type should be exposed as the difference is only one of creating a UUID from a raw sequence of bytes and how those bytes are interepreted. Specifically it is the 12 line patch at the top of this reply using Read/WriteBigEndian rather than Read/WriteLittleEndian and it is how we would apply the same logic to any other type in the entirety of the BCL.

The last point I want to touch on is that exposing Uuid is not a conflict free change.

Even if GUID is most often used to refer to "microsoft uuid" and UUID is most often used to refer to "network order uuid", that isn't exclusively the case and there is still quite a lot of overlap. Having both terms side by side will introduce confusion for users.

Second, System.Guid has been around for 20 years and has been used for both scenarios. There are users today that are using it to hold variant 1 and users using it to hold variant 2. There are even users that have been using it to simply hold arbitrary sequence of 128-bits.

If System.Uuid is exposed there will be APIs and scenarios where Uuid and Guid need to interact. There will be scenarios where the types need to be converted to each other. There will be users who get the endianness wrong here as well. There will be users who do the wrong number of conversions. There will be users who serialize as one and deserialize as the other. There will be users who get this wrong just as there are users who are already getting Guid wrong today.

Due to this, adding System.Uuid becomes a strictly worse option because:

It preserves every single issue and problematic scenario that users can already hit with System.Guid
It compounds on the issues by there now being two very similar types that only differ in how they read/write a sequence of bytes
It compounds on the issues by it now being said to be the thing to use for variant 1 scenarios when there is 20 years of code already successfully using Guid for the same
It compounds on the issues by requiring developers to now consider that two types exist and which is the right one to use in the total context of what users expect, what is correct, what they need to interoperate with, etc

System.Guid having a new overload to an existing API doesn't introduce these additional problems. It doesn't break existing code, it doesn't introduce the chance for more cases of invalid conversions, etc. It simply gives users the ability to make their own behavioral change if that is right for their scenario. Many of those users will have no behavioral change at all because they were already swapping the bytes to be correct themselves and will just be able to simplify their code instead.

The one scenario that this proposal to add new overloads to System.Guid introduces is that if a producer decides to switch to using isBigEndian: true where a consumer has not made the corresponding change, the consumer will get the wrong value. However, the same issue exists in the System.Uuid proposal where the producer switches to Uuid but the consumer is still using Guid (and vice versa for each; that is if the consumer switches before the producer switches). Thus, there are no points of failure that are unique to new overloads on System.Guid but there are many additional points of failure that are unique to introducing System.Uuid

We have effectively 2 options here:

We do nothing, we leave the world as it is
We do this proposal to make it easier for developers who need to read/write a byte sequence in big-endian format to do so

Exposing a System.Uuid type is effectively a non-starter and would need a massive amount of justification to prove that it is worth the compounding of issues given above and to actively refute the points given above in a way that justifies exposing what is effectively UuidLittleEndian and UuidBigEndian in the same framework.

If users truly feel that exposing those two such types (just with different names) is the "right choice" for them, then they should feel empowered to roll their own package, publish it to NuGet, and maintain it. It is entirely possible that API review is "wrong" here and it is entirely possible that exposing this alternative type is the "right choice" for some users. We are only human after all. If that happens, we can always revisit the decision later when there is more data showing that is the case. But until such a time, we will not be exposing such a type in the BCL. That decision is being influenced by 20 years of .NET history and many more years of general history in programming and various other languages/ecosystems spread across a number of very experienced engineers that do this as their daily job and have a massive amount of context and insight into the general problem space to help justify it.

rhuijben commented 1 year ago

If users truly feel that exposing those two such types (just with different names) is the "right choice" for them, then they should feel empowered to roll their own package, publish it to NuGet, and maintain it. It is entirely possible that API review is "wrong" here and it is entirely possible that exposing this alternative type is the "right choice" for some users. We are only human after all. If that happens, we can always revisit the decision later when there is more data showing that is the case. But until such a time, we will not be exposing such a type in the BCL. That decision is being influenced by 20 years of .NET history and many more years of general history in programming and various other languages/ecosystems spread across a number of very experienced engineers that do this as their daily job and have a massive amount of context and insight into the general problem space to help justify it.

Great explanation on how the BCL library design works!

Not sure if it would work, but this answer deserves converting into a blog post!

vanbukin commented 1 year ago

@tannergooding

At a high level, both allow developers to achieve success. Each allows developers to work with any type of GUID/UUID. The only difference is in "how" developers interact with things:

In this proposal developers pass in a parameter indicating how the raw byte sequence should be interpreted.

In the System.Uuid proposal the interpretation of the raw byte sequence is functionally encoded into the type system as part of the name instead.

Let's consider "how" developers interact with things. If you are going to use Guid as a container for Uuid (variant 2), then you should:

Remember that part of the public API cannot be used directly for your needs without special adapter methods that properly prepare binary representation.
Know about the internal workings of the libraries that you use
If it concerns a database, check the correctness of the parameter value set in the ConnectionString (in addition to developers, spread this information among DevOps/SysOps/SecOps/Data engineers, and control how they work with it).
Use special methods to work with binary representation (The current proposal partially solves this problem and instead of using custom methods or NuGet packages, it will be possible to use methods from the BCL. However, it does not eliminate the need to use them)
Constantly monitor all of the above during code reviews
Make all of this a part of the onboarding process for teams working with code, configuration, or data.

If you want to ensure a reliable way of working with the technical aspects of using Guid as a container for Uuid variant 2, you should establish processes that do not allow people to make mistakes.

The real problem that needs to be solved is to make the daily routine of ordinary developers who deal with UUIDs easier.

As mentioned above by @Szer

So as a consumer of .NET, I just want sane defaults which will keep me happy.

In my opinion, what was described above cannot be classified as "sane defaults".

vanbukin commented 1 year ago

Guid vs Uuid is different from many of the other examples that have been called out because the functionality supported by the type doesn't change between them. The only proposed difference is in how the two types are created from or converted to a sequence of 16-bytes. If you consider UInt32 vs Int32 they are the same number of bits, but how those stored bits are interpreted differ for every single operation they expose.

(De)serialization is the only reason why Guid exists at all. And it constitutes a large part of its public API. And it is precisely this process that causes difficulties.

If Guid did not have a built-in API for working with binary data, there would be no problem at all. But it has existed since .NET Framework 1.0.

The actual implementation of them is substantially different and they have unique code paths for everything (it isn't simply copy/paste the file, find/replace a name, and apply a 12 line patch).

A single if statement cost Knight Capital Group $460 million. The removal of 'left-pad', which consisted of only 11 lines of code, broke half of the internet.

The patch illustrates well what the problem is, but appealing to the number of lines of code is not a good argumentation.

vanbukin commented 1 year ago

However, the other proposal isn't to expose a System.Uuid type that strictly conforms to RFC 4122 and which only allows variant 1 UUIDs.

But this is exactly the way in which the problem of the existence of Guid and Uuid in the Linux kernel has been solved. They have both of them.

If we look at the very first commit, we can see that originally these types were called uuid_be and uuid_le, and today they are referred to as uuid and guid.

tannergooding commented 1 year ago

Remember that part of the public API cannot be used directly for your needs without special adapter methods that properly prepare binary representation.

Most of those have nothing to do with binary serialization and behave identically regardless of whether you consider Guid or Uuid. The only ones that differ are the ones that take a raw byte sequence (e.g. byte[] or Span<byte>).

public Guid (uint a, ushort b, ushort c, byte d, byte e, byte f, byte g, byte h, byte i, byte j, byte k) exposes the raw fields laid out by the UUID spec and since it is the raw parts, not the raw bytes, endianness does not come into account. That is, it matches 1-to-1 with what is printed by ToString and new Guid(0x00010203, 0x0405, 0x0607, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F) prints 00010203-0405-0607-0809-0a0b0c0d0e0f on all platforms and in all scenarios because there is no endianness involved.

The endianness concerns (and therefore the proposed difference between Uuid and Guid) only comes about when dealing with raw byte sequences.

Know about the internal workings of the libraries that you use

You have no need to understand the internals of the library. A Guid is deterministic and if you want 00010203-0405-0607-0809-0a0b0c0d0e0f there is only one way to represent that for the type, whether you are using Guid or Uuid. There are of course many different possible ways to lay out the fields, but that is an implementation detail that has no impact on the actual value represented by the Guid/Uuid. If you pass this into the library then they will only see this Guid. The only consideration comes about when you are serializing or deserializing this as a raw byte sequence, in which case it is not an internal implementation detail of the library. It becomes part of the public contract of that library so that other libraries can consume that output.

If it concerns a database, check the correctness of the parameter value set in the ConnectionString (in addition to developers, spread this information among DevOps/SysOps/SecOps/Data engineers, and control how they work with it).

The same consideration exists for Uuid vs Guid. You must understand how the raw bytes are stored in the database and then know which of Uuid or Guid is the right thing to use for that database. You must likewise then handle interoperability with APIs that take the opposite of what the database exposed.

Use special methods to work with binary representation (The current proposal partially solves this problem and instead of using custom methods or NuGet packages, it will be possible to use methods from the BCL. However, it does not eliminate the need to use them)

Right, which is the same as working with binary representation of any type in the entirety of .NET. You must understand the binary representation to correctly work with binary serialized data because the machine or language that did the serialization may not be the same as the machine that does the deserialization.

A type that does not expose a way to get its binary representation is then not safe to use for such a purpose as it may be copied incorrectly on some platforms.

Spec'ing the format of your binary serialized data is a fundamental requirement to working with binary data.

Constantly monitor all of the above during code reviews

Again, same exists with Guid vs Uuid except now you need to understand the relationship between the two types rather than only auditing the points of binary serialization/deserialization.

If you want to ensure a reliable way of working with the technical aspects of using Guid as a container for Uuid variant 2, you should establish processes that do not allow people to make mistakes.

This is spec'ing your serialization format and having tests covering that data correctly roundtrips. This is extremely trivial to do.

The real problem that needs to be solved is to make the daily routine of ordinary developers who deal with UUIDs easier.

Exposing a secondary Uuid type does not make this easier for all the reasons laid out above and repeatedly throughout this and the other issue. They only serve to make the issue worse and overall more complicated.

We have many years of experience in this space both in and outside of .NET. We have significant reason and evidence to believe that the explicitly Big/LittleEndian APIs is the right approach for this scenario and that it works for users across a range of types and scenarios.

(De)serialization is the only reason why Guid exists at all. And it constitutes a large part of its public API. And it is precisely this process that causes difficulties.

That is not true. Guid serves many purposes including as a general way to work with and handle the type. Most of those APIs are not about serialization/deserialization and are instead about creation (most constructors), user display (ToString), or handling user input (parse). The APIs that are explicitly for serialization/deserialization are Guid(byte[]) and ToByteArray() (with the Span<byte> overloads allowing the same in an allocation free manner, consistent with how we've handled span in much of the rest of .NET). You may be able to use other types for use in some serialization scenarios, but that is not strictly their intended or primary purpose.

If Guid did not have a built-in API for working with binary data, there would be no problem at all

The problem would still exist and be compounded because users could not get the binary data out of the Guid without relying on an implementation detail. The explicit APIs allow getting the binary data out in a well-defined manner (variant 2, little-endian) regardless of internal field layout. The new overloads proposed in this allow users to get the binary data out in an alternative well-defined format (variant 1, big endian) that is suitable for the other common use-case

A single if statement cost Knight Capital Group $460 million. The removal of 'left-pad', which consisted of only 11 lines of code, broke half of the internet.

Yes, small changes can have big impact and have large cost. The same can be true of large changes.

Taking historical data and evidence into account, particularly in the domain your operating in can be very important. The .NET team has done this for the Uuid proposal and remains steadfast that exposing it would be incorrect for .NET

But this is exactly the way in which the problem of the existence of Guid and Uuid in the Linux kernel has been solved. They have both of them.

And that may work for Linux. It was already covered above that different developers have different opinions on what is "correct". The Linux kernel was designed very differently from the Windows kernel; but neither is necessarily "better". Each has their own strengths and weaknesses. Each has areas where developers feel the API surface could be better.

For .NET, exposing a UuidLittleEndian and UuidBigEndian type (regardless of actual name exposed) would be entirely inconsistent and the wrong choice for .NET

PJB3005 commented 1 year ago

I don't know why people are even bringing up the database argument? Any application/toolkit that can go byte <-> string for a GUID has a canonical byte order for the left side, and wouldn't have any "keeping connection strings correct" troubles anywhere.

As an example, MySQL was brought up earlier as having 7 (!) options for how to serialize Guid. Yet MySQL 8.0 has UUID_TO_BIN and such now, so clearly MySQL has a canonical way to represent GUIDs/UUIDs by default. There is only one correct way that your database connector should serialize Guid by default, and it's whatever makes the equivalent string representation of the GUID/UUID round-trip correctly across the database.

As I understand it, MySQL 5.7 and earlier did not have the UUID binary functions I mentioned. I assume that means that before 8.0, MySQL didn't acknowledge UUIDs stored as byte blobs in a special way, and as such did not have a canonical byte layout for them.

It sounds to me that the options on MysqlConnector are not, in fact, due to .NET's Guid being a bit unorthodox. I assume there are at least 7 ways that UUIDs are stored in various MySQL database schemas around the world, and MysqlConnector is providing these properties to make it easier for programmers to integrate with any one of those. After all, if those properties were needed because of this exact ticket, wouldn't you only have like... exactly two options at most?

But this is exactly the way in which the problem of the existence of Guid and Uuid in the Linux kernel has been solved. They have both of them.

I'm not an expert on Linux, but my understanding is that the kernel doesn't have a stable internal API. It is much easier to make API decisions like this when you aren't going to bogged down by mistakes for the next 20 years.

aloraman commented 1 year ago

Well, it seems the discussion still continues. Even though it was mostly fait accompli from the start. So, I guess I'll finalize my thoughts on the matter. I hope someone will actually read it through, not just glance around or roll their eyes in disdain. Sounds harsh, I know, but the larger discussion about this issue wears everyone down from all these never-ending arguments, pointless divisions and general lack of shared context - so, in hopes to get some closure, I'll grit my teeth and ignore all that, and share my finalized thoughts. All this story of UUID vs GUID, #86084 vs #86798 thematically can be separated into three essential parts:

The proposition of System.Uuid. It was proposed as a solution in the original issue (#86084) for specific problems with System.Guid, but it also touched on problems of 3rd party library development, and role of BCL in that. That is the part I'm most interested about.
General shape of System.Guid API surface, it's error-proneness and rigidity, and ways to fix it. And here lies the place of most of disagreements, arguable assumptions and misunderstandings. That is the part I'm most aggravated by.
The proposition of enhancing System.Guid API. The simplest of the three. This part is the most boring, the only interesting thing is how it affects other two parts.

Now, let's address each part individually. Naturally, in reverse order - from most boring and simple to most difficult.

Current API Proposal

namespace System.Buffers.Binary
{
    public static partial class BinaryPrimitives
    {
        public static Guid ReadGuidBigEndian(ReadOnlySpan<byte> source);
        public static Guid ReadGuidLittleEndian(ReadOnlySpan<byte> source);

        public static bool TryReadGuidBigEndian(ReadOnlySpan<byte> source, out Guid value);
        public static bool TryReadGuidLittleEndian(ReadOnlySpan<byte> source, out Guid value);

        public static bool TryWriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
        public static bool TryWriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);

        public static void WriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
        public static void WriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);
    }
}

This is fine, doesn't break anything, doesn't interfere with anything, follows naming conventions for BinaryPrimitives.

namespace System
{
    public partial struct Guid
    {
        public Guid(ReadOnlySpan<byte> value, bool isBigEndian);

        public byte[] ToByteArray(bool isBigEndian);

        public bool TryWriteBytes(Span<byte> destination, bool isBigEndian, out int bytesWritten);
    }
}

The bool isBigEndian parameter is questionable:

It refers to endianness. Endianness is not a thing an ordinary programmer encounters every day (Despite the contrary opinions, see second part), so it will be a source of confusion.
It is a bool parameter, which are generally not recommended

I'd prefer something akin to that:

namespace System
{
    public enum GuidByteOrder
    {
        Default, //little-endian
        Rfc4122 //big-endian
    }

    public partial struct Guid
    {
        public Guid(ReadOnlySpan<byte> value, GuidByteOrder byteOrder);

        public byte[] ToByteArray(GuidByteOrder byteOrder);

        public bool TryWriteBytes(Span<byte> destination, GuidByteOrder byteOrder, out int bytesWritten);
    }
}

I doubt the expansion of API will once and for all solve all the problems with System.Guid - but that's out of scope of this proposal. Nevertheless, having additional knobs are always a good thing - but then again, that's the reason to have a discussion in the first place.

Malebolge

As Dante Alighieri descended into 8th Circle of Hell, to the Evil Pits of Malebolge - so have we descended into this discussion, to the endless Pits of Despair.

The first pit is the API of System.Guid itself, the inconsistency between binary and textual (de)serialization. You see.... No, disregard that, I don't want to start another round of restatements of opinions.
The second pit is the history of GUIDs and UUIDs, different variants and versions, differences between stated and observed behavior, inconsistencies between GUID-first and UUID-first worlds
The third is the pit of past experiences. Too many times we have got burnt with GUID vs UUID, so the desire for fix is understandable.
The next is the pit of disappointment. To witness your proposal be abruptly closed only to observe alternative proposition to be placed instead, without much of an explanation (initially), with the reasoning hidden behind closed doors, and with dry formal tone at that - that's disappointing to say the least.

And there's even more pits, even more despair, but I should stop somewhere, so I'd follow it with some propositions for the review team and other participants as well:

Open up the proposal review process, so we won't end up with "that was decided that behind closed doors" or "that was explained in a private email" - add some context, keynotes.
Allways question your own assumptions. Even if some CS grad student in US is definitely aware of endianness - that doesn't hold true for every programmer all other the world - and people seem to be genuinely surprised by that. Designing APIs with wrong assumptions lead to more pits of despair.
Aim for the Pit of Success - make an API such that it's easy to do right and difficult (but not impossible) to do wrong. The System.Guid's API is easy to do wrong, so people make mistakes when using it. You can't just say "just don't create bugs".

Sine Qua Non

And now we get to System.Uuid. First, imagine we don't have System.Guid, no .NET, no C#, just an Dirak Sea of Abstractions. If we read RFC-4122, current version and fresh drafts, we can see there is a reference in some form "underlying 128 bit binary value". Now, imagine, we have a type for that:

type UUID = struct
   let bytes ...
end

That's just that, no versions, no variants, just raw state. A primitive container type. But it allows us to construct new types for specific versions and variants, with reinterpreting these bytes as specific set of fields. A set of refinements.

type UuidVersion = 
  | V1
  | V2
  | V3
  | V3

type UUID<UuidVersion> = UUID

System.Guid is a poor replacement for such container - it is already a refined type. What the author of #86084 asked, and what I'm interested in - is just the container type. Just the storage (two ulong fields, sixteen byte field, one Vector128<byte> - every option can actually be encountered in the wild inside NuGet packages), minimum to no behavior. So it won't be that strawman copy of System.Guid with 12 lines changed - you can drop majority of GUID-specific parsing and formatting as well, also no need to the same field layout - it won't be the same between UUID v6/v7/v8 anyway. Why the interest? Let's dig through various GUID/UUID libraries on NuGet! We'll see attempts a plenty to support different versions and variants of UUID, support database-specific scenarios and other RFC-4122 incompatible ones, where Guid/Uuid is treated as just a UInt128 number. Majority of these still build all their refinements atop the System.Guid - because it is the only container type available from the BCL. Obviously, all these projects are forced to hack around internal details of System.Guid to construct their refinements. Yet there is a minority of them, who implemented their own UUID type. It is that case:

they should feel empowered to roll their own package, publish it to NuGet, and maintain it. It is entirely possible that API review is "wrong" here and it is entirely possible that exposing this alternative type is the "right choice" for some users. We are only human after all. If that happens, we can always revisit the decision later when there is more data showing that is the case

But there is a problem, a very large one - integration with other first and third-party libraries. You can always roll your own extensions to integrate - but it's not always possible. For example, ORM framework will allow you to use such UUID type as a property type, but it won't allow to specify a key property of that type, defeating all the purpose of such custom UUID type. The only other way is for the other library to take a hard dependency on UUID library - which is a no go. People aren't into taking hard dependencies anyway, unless the dependency is that good (see Newtonsoft.Json, NodaTime, StackExchange.Redis and so on). That's the basic idea why System.Uuid as a primitive container type in BCL is so desired - so every library in ecosystem can have the same foundation primitive type to build upon. Maybe it will be better to add System.Uuid, maybe it will be better to enhance support from BCL tooling so System.Guid can be such type, without the need to hack into its internals - I don't know, that's to be discussed further. But it's not just System.Guid with 12 lines changed.

The End. Roll Credits.

tannergooding commented 1 year ago

The bool isBigEndian parameter is questionable:

It does, however, match an existing convention we've already done on other types (such as BigInteger which faced a similar problem several years ago).

In an ideal world we would indeed not have this API at all and we'd only have the explicit named APIs on BinaryPrimitives or equivalent named APIs directly on Guid (as might be done for a type that cannot live in the BCL).

But, we're not in an ideal world. The ToByteArray() API exists and it is most likely the thing most users will see, so ensuring their is a valid overload to do what users require becomes desirable.

It is a bool parameter, which are generally not recommended

Bool parameters exist and are used in a plethora of places, we've introduced several this release cycle and generally introduce a few every release cycle. The decision to use an enum vs bool generally comes down to if there will only ever be two states or if we may need to expand to more states in the future.

In this case, we only have 2 states. The named enum values do not provide additional clarity and may in fact cause additional confusion on top for some users. It is likewise inconsistent since it's not "strictly conforming" to RFC-4122 (which has restrictions on the value and interpretation of some bits)

the inconsistency between binary and textual (de)serialization. You see.... No, disregard that, I don't want to start another round of restatements of opinions.

There is no inconsistency. Binary representation does not directly equate to represented value and the default state for most machines is that it strictly does not match. A 32-bit integer 1 is represented on most hardware as the byte pattern 0x01, 0x00, 0x00, 0x00 after all. There are relatively few pieces of hardware that natively operate as big-endian.

The second pit is the history of GUIDs and UUIDs, different variants and versions, differences between stated and observed behavior, inconsistencies between GUID-first and UUID-first worlds

GUID and UUID are interchangeable terms, so much so that the RFC-4122 spec explicitly calls out GUID as an alternative naming for the same thing in its introductory sentence.

There is a historical split around whether such types are little-endian (primarily Microsoft stacks) or big-endian (most other stacks). But, that doesn't change how the GUID functions or the value it represents. It only impacts binary serialization/deserialization of the values

The next is the pit of disappointment. To witness your proposal be abruptly closed only to observe alternative proposition to be placed instead, without much of an explanation (initially), with the reasoning hidden behind closed doors, and with dry formal tone at that - that's disappointing to say the least.

There was significant explanation of the topic and reasoning before closing. It was in no way abrubt.

The new proposal included a cross link and restatement of the general reasonings when it was opened and the prior was closed.

There has likewise been a continued explanation and deep dive into the why's and why nots.

Open up the proposal review process, so we won't end up with "that was decided that behind closed doors" or "that was explained in a private email" - add some context, keynotes.

This has already been touched on multiple times. Our API review process is extremely open and extremely well documented. Every API that officially goes through review is live streamed and allows general involvement by the community. API review itself typically happens twice a week and we have a public schedule and ordered list covering this. Links to the process docs and other relevant information have been provided a few times.

The other API did not officially go to review as it is up to the area owner (i.e. myself) to make an initial determination of whether it is worth bringing up to API review in the first place (that is, does it even have a chance of passing API review).

My own view was that the other proposal would not pass API review. This view came about from having been on the broader .NET team for nearly 10 years; of being on API review for over 5 years now, of being a Senior Engineer on the .NET team and owning many complex areas in the lowest levels of the stack, and with my general knowledge and experience of the BCL APIs and how we do things.

Despite that view, I still felt it pertinent to get a secondary opinion before closing the issue down entirely. And so, as indicated, I reached out to several other API review members and general members of the .NET team. The consensus was completely in alignment with my initial view.

Allways question your own assumptions. Even if some CS grad student in US is definitely aware of endianness - that doesn't hold true for every programmer all other the world - and people seem to be genuinely surprised by that. Designing APIs with wrong assumptions lead to more pits of despair.

The assumption here is not whether a typical programmer understands endianness. The assumption is that a developer who is doing binary serialization must have some understanding of endianness.

A developer who is doing binary serialization without understanding endianness is writing bugs. Those bugs may not occur in most scenarios, but they will be trivially hittable under the right conditions. The simplest is typically when running the same code on a big endian machine such as the IBM System z9 which Mono supports.

Aim for the Pit of Success - make an API such that it's easy to do right and difficult (but not impossible) to do wrong. The System.Guid's API is easy to do wrong, so people make mistakes when using it. You can't just say "just don't create bugs".

System.Guid is not perfect and this is working to address some of that imperfection so that it becomes less problematic. Namely by doing what we know to already work in other scenarios and avoiding what we know does not work and does not fit into the general .NET design. It is strongly believed that exposing System.Uuid is its own pit of failure, with many more potential pitfalls.

Now, imagine, we have a type for that:

Such a type that is only 16 sequential bytes is already not compatible with RFC-4122. RFC-4122 explicitly defines itself as a sequence of named fields:

private uint time_low;
private ushort time_mid;
private ushort time_hi_and_version;
private byte clk_seq_hi_res;
private byte clk_seq_low;
private fixed byte node[6]; // this is technically described as a 48-bit integer

with reinterpreting these bytes as specific set of fields

Interpreting bytes as a set of fields requires understanding endianness. You cannot extract a uint from a byte[] without understanding endianness because the value 1 can be represented as either 0x01, 0x00, 0x00, 0x00 -or- 0x00, 0x00, 0x00, 0x01 depending on the platform you're running against. On an x86, x64, or Arm64 machine it will be the former.

Likewise, you must understand the format that the byte[] actually carries the state in because it may mismatch from the machine default. In the case of RFC-4122 where the default is big endian then it will be inverted from typical hardware.

Just the storage (two ulong fields, sixteen byte field, one Vector128 - every option can actually be encountered in the wild inside NuGet packages), minimum to no behavior

Not all of these are created equal nor are all of them trivially valid. Some of them (Vector128<byte>) may change the default packing of the struct and impact the layout of other types for example.

So it won't be that strawman copy of System.Guid with 12 lines changed

This wasn't a strawman, it was explicitly stated to be the minimum set of viable changes to achieve what was being asked for. It was then likewise stated that additional changes could of course be made, but since they aren't required it doesn't impact the discussion because the same types of changes could in fact be made to System.Guid.

can drop majority of GUID-specific parsing and formatting as well

We would then just get requests for that support to be added back.

also no need to the same field layout - it won't be the same between UUID v6/v7/v8 anyway.

Per the RFC, the field layout is the same between variant 1 (0b10x) and variant 2 (0b110). The difference is the endianness of the uint32 and uint16 fields.

That's the basic idea why System.Uuid as a primitive container type in BCL is so desired - so every library in ecosystem can have the same foundation primitive type to build upon.

That is also why such a type will not happen. It only compounds to the user confusion to the risk of breaking changes a user must deal with, etc. The difference between Uuid and Guid is not one of functionality it is one of binary serialization and binary deserialization. Every other aspect of the types remains identical.

vanbukin commented 1 year ago

Most of those have nothing to do with binary serialization and behave identically regardless of whether you consider Guid or Uuid. The only ones that differ are the ones that take a raw byte sequence (e.g. byte[] or Span).

The difference is not only in the methods that take bytes as input, but also in the methods that output bytes.

The endianness concerns (and therefore the proposed difference between Uuid and Guid) only comes about when dealing with raw byte sequences.

And this is what ToByteArray, TryWriteBytes, and the constructor that takes bytes force you to do.

You have no need to understand the internals of the library.

When the output artifact of a library's functionality is 16 bytes, it is important to know exactly how they were created, whether simply by calling ToByteArray or using a construction like this.

static byte[] ConvertToBytesTakingTheStringRepresentationAsThePrimarySource(Guid guid)
{
    var output = new byte[16];
    var dst = output.AsSpan();
    var src = MemoryMarshal.Cast<Guid, byte>(new ReadOnlySpan<Guid>(in guid));
    BinaryPrimitives.WriteInt32BigEndian(dst, BinaryPrimitives.ReadInt32LittleEndian(src));
    BinaryPrimitives.WriteInt16BigEndian(dst[4..], BinaryPrimitives.ReadInt16LittleEndian(src[4..]));
    BinaryPrimitives.WriteInt16BigEndian(dst[6..], BinaryPrimitives.ReadInt16LittleEndian(src[6..]));
    BinaryPrimitives.WriteInt64BigEndian(dst[8..], BinaryPrimitives.ReadInt64BigEndian(src[8..]));
    return output;
}

Some projects do this by default.

In that case, developers should be aware that they cannot simply call the constructor that takes bytes or use ToByteArray/TryWriteBytes methods. They must prepare the byte sequence beforehand because the code in these libraries uses the string representation as the source value.

Some of them have customizable behavior.

And it all depends on the combinations of the database driver settings and what is written in your code.

Spec'ing the format of your binary serialized data is a fundamental requirement to working with binary data.

And this is something that the System.Guid API does not do.

Again, same exists with Guid vs Uuid except now you need to understand the relationship between the two types rather than only auditing the points of binary serialization/deserialization.

And this falls into the realm of explicit differences between two types, rather than remaining in the implicit realm of which serialization API was called.

This is spec'ing your serialization format and having tests covering that data correctly roundtrips. This is extremely trivial to do.

Fixing technicalities using processes is not an efficient approach.

tannergooding commented 1 year ago

The difference is not only in the methods that take bytes as input, but also in the methods that output bytes.

Yes, sorry. This is a place I forgot to reiterate both sides of serialization/deserialization.

And this is what ToByteArray, TryWriteBytes, and the constructor that takes bytes force you to do.

Yes. APIs that deal with raw byte sequences require you to think about endianness. Not thinking about endianness when dealing with raw byte sequences will only lead to bugs.

When the output artifact of a library's functionality is 16 bytes, it is important to know exactly how they were created, whether simply by calling ToByteArray or using a construction like this.

Yes, you must understand the endianness when dealing with raw byte sequences. The same would be true for Uuid.

In that case, developers should be aware that they cannot simply call the constructor that takes bytes or use ToByteArray/TryWriteBytes methods. They must prepare the byte sequence beforehand because the code in these libraries uses the string representation as the source value.

Yes, they must be aware that ToByteArray and Guid(byte[]) currently require the bytes to be little-endian. The overloads give them the option of saying the bytes are instead in big-endian format, removing the need for them to write or maintain their own custom logic.

And this is something that the System.Guid API does not do.

It is explicitly something that System.Guid does. The APIs that deal with raw byte sequences are explicitly documented to be little-endian today.

From ToByteArray: https://learn.microsoft.com/en-us/dotnet/api/system.guid.tobytearray?view=net-7.0

You can use the byte array returned by this method to round-trip a Guid value by calling the Guid(Byte[])) constructor.

Note that the order of bytes in the returned byte array is different from the string representation of a Guid value. The order of the beginning four-byte group and the next two two-byte groups is reversed, whereas the order of the last two-byte group and the closing six-byte group is the same. The example provides an illustration.

The wording could be improved in some cases for the constructors.

The new APIs then allow developers to specify which format their raw byte sequence is in.

And this falls into the realm of explicit differences between two types, rather than remaining in the implicit realm of which serialization API was called.

There is no more difference between Guid and Uuid than there would be between a UInt128BigEndian and UInt128LittleEndian. There is only one way to represent a given value for the type.

Fixing technicalities using processes is not an efficient approach.

There is a fundamental requirement for developers doing binary serialization to use the same interpretation on both sides of the serialization/deserialization boundary. Having different types does not solve this problem.

jeffhandley commented 1 year ago

With the ongoing conversation here, I've removed the api-ready-for-review label and put this back into api-suggestion (after accidentally clicking api-needs-work at first).

PJB3005 commented 1 year ago

In that case, developers should be aware that they cannot simply call the constructor that takes bytes or use ToByteArray/TryWriteBytes methods. They must prepare the byte sequence beforehand because the code in these libraries uses the string representation as the source value.

If I am understanding correctly, this whole use case seems to rely on the developer getting a byte[16] representing a GUID/UUID from somewhere (such as a binary protocol or file format[^1]), and then wanting to consume that as a Guid in .NET. I would expect developers working with such things to be aware of what endianness is. If they do not, I do not think any of the proposals given so far would ever save them.

The use case of "writing it to a database" is certainly one that has been brought up much in this discussion, however there is nothing special about it. I have already explained in my previous comment how there is (in most cases) only one valid thing for your database layer to do once it encounters a GUID. Judging by the code links posted, all of them do exactly that. You would have the exact same issues with a use case as simple as converting a binary file format to a textual format. It just happens that databases sometimes have switches that allow you to make a right here with two wrongs. The bug here happened the moment the developer passed the wrong endianness to new Guid(), and no combination of connection string madness is a valid way to fix it.

(This entire post can be inverted to go from reading a binary GUID to writing one)

[^1]: If you're getting a GUID from anywhere else like NewGuid() or Parse(), you wouldn't be running in into this.