Extend System.Guid with a new creation API for v7

tannergooding commented 5 months ago

Rationale

The UUID specification (https://datatracker.ietf.org/doc/rfc9562) defines several different UUID versions which can be created and which allow developers to produce and consume UUIDs that have a particular structure.

As such, we should expose helpers to allow creating such UUIDs. For .NET 9, the set of potential versions is detailed below, but only UUIDv7 is proposed for the time being.

As per the UUID spec, GUID is a valid alternative name and so the name of the new APIs remains NewGuid* for consistency with our existing APIs:

This specification defines UUIDs (Universally Unique IDentifiers) -- also known as GUIDs (Globally Unique IDentifiers)

API Proposal

namespace System;

public partial struct Guid
{
    // Alternatively: MaxValue -or- AllBitsSet
    public static Guid Max { get; }

    public int Variant { get; }
    public int Version { get; }

    // v1
    //     60-bit timestamp 100ns ticks since 00:00:00.00, 15 October 1582 UTC
    //     14-bit clock sequence, intended to be random once in the lifetime of system
    //     48-bit node identifier, as in IEEE 802 Node IDs

    // v2
    //     DCE Security UUIDs

    // v3
    //     128-bit MD5 of namespaceId + name converted to canonical octet sequence
    //     6-bits then replaced with the version/variant fields

    // v4
    //     Already exposed as NewGuid()

    // v5
    //     192-bit SHA1 of namespaceId + name converted to canonical octet sequence, taking the most significant 128-bits
    //     6-bits then replaced with the version/variant fields

    // v6
    //     60-bit timestamp 100ns ticks since 00:00:00.00, 15 October 1582 UTC
    //     14-bit clock sequence, intended to be random for every new UUID, can be compatible with v1
    //     48-bit node identifier, intended to be random for every new UUID, can be compatible with v1

    // v7
    //     48-bit timestamp millisecond ticks since Unix Epoch
    //     12-bits of random data -or- submillisecond timestamp (M3)
    //         M3 - Scale remaining precision to fit into the 12-bits at even intervals
    //     62-bits of random data -or- carefully seeded counter (M1 or M2)
    //         M1 - Dedicated counter, simply increment per UUID created within a given tick stamp
    //         M2 - Monotonic random, seed random then increment by random amount within a given tick stamp
    public static Guid NewGuidv7(); // uses DateTime.UtcNow
    public static Guid NewGuidv7(DateTime timestamp);

    // v8
    //     48-bits of custom data
    //     12-bits of custom data
    //     62-bits of custom data
}

julealgon commented 5 months ago

@tannergooding shouldn't it be cased GuidV7 instead of Guidv7?

On another note, what about having the method take an enum parameter speficying the version (instead of having all the New method variations)?

Could then have something like:

var guid = Guid.NewGuid(GuidVersion.Version7);

Which would look much cleaner with:

https://github.com/dotnet/csharplang/issues/2926

of course:

Guid guid = NewGuid(Version7);

tannergooding commented 5 months ago

shouldn't it be cased GuidV7 instead of Guidv7?

I'll let API review decide if I got it "wrong" or not. I prefer the look of Guidv7 personally.

On another note, what about having the method take an enum parameter speficying the version (instead of having all the New method variations)?

This doesn't work because there are unique parameters/overloads per version. i.e. v7 takes a DateTime while v5 does not (it would presumably take a string namespace, string name, ignoring the name/keyword conflict)

Ilchert commented 5 months ago

Hello @tannergooding, a few questions:

How to handle Local and Unspecified date time conversion to Unix milliseconds?
Maybe add DateTimeOffset overload?
Please look at provided code, how to handle comparation of guid? Should BCL provide Guidv7Comparer for proper sorting/comparation?

var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);

Console.WriteLine(g1); // 03040506-0102-0000-0000-000000000000
Console.WriteLine(g2); // 03040505-0202-0000-0000-000000000000
Console.WriteLine(g1 < g2); // False
Console.ReadKey();

static Guid Create(DateTime dateTime)
{
    var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
    var ms = (ulong)dto.ToUnixTimeMilliseconds();
    ms &= (1ul << 49) - 1;
    Span<byte> data = stackalloc byte[16];
    BinaryPrimitives.WriteUInt64LittleEndian(data, ms);
    return new Guid(data);
}

tannergooding commented 5 months ago

How to handle Local and Unspecified date time conversion to Unix milliseconds?

That's largely an implementation detail and is not really relevant to the API proposal. DateTime tracks enough information to know what point of time it represents, the DateTimeOffset constructor knows how to get a correct offset for local and utc. It treats unspecified the same as local, by design.

Should BCL provide Guidv7Comparer for proper sorting/comparation?

There is no such thing as "proper" sorting/comparison, that is there is no formal definition of how to compare UUIDs. The way .NET does it is to treat it effectively as an unsigned 128-bit integer represented in hex form. The output string is already in big endian format.

The algorithm you've used to create the Guid in the sample code is notably incorrect, however. The actual definition is as follows, where the RFC lists it with most significant bytes first and in terms of simple octets:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           unix_ts_ms                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          unix_ts_ms           |  ver  |       rand_a          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |var|                        rand_b                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            rand_b                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The internal layout used by Guid, which is allowed to differ as per the spec, is then:

int32 a
int16 b
int16 c
int8 d, e, f, g, h, i, j, k

Given a 48-bit Unix Timestamp of 0x010203040506, the Guid string should always appear as 01020304-0506-7???-8???-???????????? (where the 8??? is [8000, BFFF]). Thus, you need to store the 48-bit timestamp as:

_a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
_b = (short)(msSinceUnixEpoch); // store the least significant 2-bytes of the timestamp

// Fill c, d, e, f, g, h, i, j, and k with random bits

_c = (short)(_c & ~0xF000) | 0x7000; // Set the version to 7
_d = (byte)(_d & ~0xC0) | 0x80; // Set the variant to 2

This ensures that the Unix timestamps are already effectively sorted. The exception is for the random data that exists for two UUIDs created in the same "Unix tick". The spec allows for these random bits to contain better structed data, however that is extended functionality and can be provided at a later point in time if/when there is enough ask for it. I'd guess that we'd likely do that via an additional overload that looks something like Guid NewGuidv7(DateTime timestamp, bool trackSubmillisecond, long counter)

Maybe add DateTimeOffset overload?

It can be discussed in API review, but isn't really necessary to get the core functionality supported. It is also likely a far stretch from what the typical use-case would be.

The actual implementation is likely going to be effectively: long msSinceUnixEpoch = (long)((timestamp.ToUniversalTime() - DateTime.UnixEpoch).TotalMilliseconds), which is about as cheap as you can get the actual computation work.

Ilchert commented 5 months ago

Given a 48-bit Unix Timestamp of 0x010203040506, the Guid string should always appear as 01020304-0506-7???-8???-???????????? (where the 8??? is [8000, BFFF]).

Thanks for the explanation, you will perform some additional operation to make string, sorting and binary big-endian representation consistent.

var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);

Console.WriteLine(g1); // 01020304-0506-7000-8000-000000000000
Console.WriteLine(g2); // 02020304-0505-7000-8000-000000000000
Console.WriteLine(g1 < g2); // true

Console.WriteLine(Convert.ToHexString(g1.ToByteArray(true))); // 01020304_0506_7000_8000000000000000
Console.WriteLine(Convert.ToHexString(g2.ToByteArray(true))); // 02020304_0505_7000_8000000000000000

Console.ReadKey();

static Guid Create(DateTime dateTime)
{
    var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
    var msSinceUnixEpoch = (ulong)dto.ToUnixTimeMilliseconds();

    var a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
    var b = unchecked((short)(msSinceUnixEpoch)); // store the least significant 2-bytes of the timestamp

    var c = (short)0x7000; // Set the version to 7
    var d = (byte)0x80; // Set the variant to 2

    return new Guid(a, b, c, d, 0, 0, 0, 0, 0, 0, 0);
}

tannergooding commented 5 months ago

you will perform some additional operation to make string, sorting and binary big-endian representation consistent.

There's not really anything "additional" to do here.

0x00, 0x00, 0x01, 0x23 (big endian) and 0x23, 0x01, 0x00, 0x00 (little endian) both represent the same value 0x123 which will always be less than 0x124 (which would be 0x00, 0x00, 0x01, 0x24 as big endian and 0x24, 0x01, 0x00, 0x00 as little endian)

Which is to say, the underlying storage format doesn't matter except for when it applies to serialization/deserialization. The actual value stored is what matters and is what is used in the context of doing operations such as ToString or CompareTo.

You can use any storage format you'd like, provided that the underlying operations understand how to interpret it as the actual value. Different platforms then use different formats typically based on what is the most efficient or convenient (hence why most CPUs natively use little-endian and why networking typically use big-endian). .NET happens to use a format for Guid that is compatible with the Microsoft _GUID structure and which historically worked well for COM and other scenarios but that doesn't make the actual value it represents any different.

huoyaoyuan commented 5 months ago

How to handle Local and Unspecified date time conversion to Unix milliseconds?

That's largely an implementation detail and is not really relevant to the API proposal. DateTime tracks enough information to know what point of time it represents, the DateTimeOffset constructor knows how to get a correct offset for local and utc. It treats unspecified the same as local, by design.

DateTimeOffset is actually more "correct" about timestamp. It's tolerant from changes in local time zone, including DST changing. However it also has more overhead.

vanbukin commented 5 months ago

There is no such thing as Guidv7, v2, v8 or any other v-something. There's Uuid and there's Guid, which Microsoft developed. They're literally different structures. Uuid is 16 consecutive bytes. For Uuid, there's only one way to roundtrip from binary to string representation and back. Guid is a structure of the same length, but with a specific layout (int, short, short, byte, byte, byte..byte) and an API that "masks" its layout. Because of this, there are 2 ways to roundtrip from binary to string representation and back. We all know this perfectly well, this topic was previously discussed in #86084 and #86798.

Guid does not exist outside of technologies related to Microsoft in one way or another, or technologies that it has had a hand in, to some extent. Uuid, on the other hand, is a universally accepted standard that has those very versions, variants, etc., which determine what exact values should be written at certain places within the Uuid.

Since there's no Uuid in BCL (and @tannergooding specifically insisted that no Uuid should be introduced, closing #86084), the existing ecosystem is forced to use Guid as a container for Uuid. The only safe way to do this is to rely solely on the string representation.

And now let's look at how this (doesn't) work in the real world.

RFC 9562, 6.13. DBMS and Database Considerations

For many applications, such as databases, storing UUIDs as text is unnecessarily verbose, requiring 288 bits to represent 128-bit UUID values. Thus, where feasible, UUIDs SHOULD be stored within database applications as the underlying 128-bit binary value. For other systems, UUIDs MAY be stored in binary form or as text, as appropriate. The trade-offs to both approaches are as follows:

Storing in binary form requires less space and may result in faster data access.

Storing as text requires more space but may require less translation if the resulting text form is to be used after retrieval, which may make it simpler to implement.

Using a specialized data type provided by a specific RDBMS in conjunction with a particular way of generating Uuid can increase data access speed. This is precisely the reason for Uuidv7's existence.

Uuidv7 stores the number of milliseconds that have passed since the start of Unix time in the first 48 bits (unix_ts_ms), followed by 4 bits for the version (ver), 12 bits for the first random part (rand_a), 2 bits for the variant (var), and 62 bits for rand_b. It's important that in Section 6.2 (Method 3) the use of part of rand_a to store the time-based part is allowed in order to increase time precision (from millisecond to sub-millisecond), thereby bringing the time-based part up to 60 bits. This allows for the use of unix_ts_ms and rand_a to store the number of 100-nanosecond intervals that have passed since the start of the unix-epoch, specifically - Ticks (overflow will occur on June 18, 5623 at 9:21 UTC - this is far enough in the future to directly use ticks in the time-based part when generating Uuidv7).

An important point is that the time-based part is stored in big-endian. Since RDBMS typically indexes binary data from left to right, this method of generation ensures monotonically increasing values, just like an integer counter. This allows for maintaining low levels of index fragmentation, fast search, and constant insertion time.

And now, having finished with the introductory part, let's dive into the peculiarities of how popular RDBMS and their .NET drivers work with Uuid and Guid (which is used as a lousy transport for Uuid).

Welcome to hell.

Uuidv7

Let's write a simple function for generating Uuidv7, which will use the fields unix_ts_ms and rand_a to store the number of ticks since the start of the Unix epoch.

static string GenerateUuidV7()
{
    Span<byte> uuidv7 = stackalloc byte[16];
    ulong unixTimeTicks = (ulong)DateTimeOffset.UtcNow.Subtract(DateTimeOffset.UnixEpoch).Ticks;
    ulong unixTsMs = (unixTimeTicks & 0x0FFFFFFFFFFFF000) << 4;
    ulong unixTsMsVer = unixTsMs | 0b0111UL << 12;
    ulong randA = unixTimeTicks & 0x0000000000000FFF;
    // merge "unix_ts_ms", "ver" and "rand_a"
    ulong hi = unixTsMsVer | randA;
    BinaryPrimitives.WriteUInt64BigEndian(uuidv7, hi);
    // fill "rand_b" and "var"
    RandomNumberGenerator.Fill(uuidv7[8..]);
    // set "var"
    byte varOctet = uuidv7[8];
    varOctet = (byte)(varOctet & 0b00111111);
    varOctet = (byte)(varOctet | 0b10111111);
    uuidv7[8] = varOctet;
    return Convert.ToHexString(uuidv7);
}

The hexadecimal representation is used here as the base because for Guid, only the string representation is considered valid. This string can be passed to the Guid constructor, and when calling ToString, we will get the same value.

PostgreSQL

God bless the developers of PostgreSQL and Npgsql. This is the only database and driver where everything works without any problems. We take the string representation of Uuidv7, pass it to the Guid constructor, write the Guid to the database (uuid column), and obtain the records in the same order in which they were written. Without modifying the connection string or any strange behavior. It just works.

MySQL

It can only use binary(16) or varbinary(16) because it does not have a dedicated data type for storing Uuid.

Okay, let's test how it works in practice.

docker run --name mysql -e MYSQL_ROOT_PASSWORD=root -p 3306:3306 --cpus=2 --memory=1G --rm -it mysql:latest

Somehow connect to the server and create a database and its schema there.

CREATE DATABASE `dotnet`;

And after selecting our newly created database:

CREATE TABLE `uuids`
(
    `uuid` BINARY(16) NOT NULL PRIMARY KEY,
    `order` BIGINT NOT NULL
);

Let's write a simple program that generates a Uuidv7, inserts its value along with an ordinal number (for validation of sorting). Annnd... it doesn't work. Because out of the box, you can't write a Guid to binary(16).

This happens because the MySQL driver has a connection string parameter with a default value of Char36. To make everything work correctly, you need to add the connection string parameter GUID Format=Binary16.

If we set the values to TimeSwapBinary16 or LittleEndianBinary16, everything will also work (for a while).

However, if you execute

SELECT * FROM uuids ORDER BY uuid ASC;

You will notice that for TimeSwapBinary16 and LittleEndianBinary16 values, the order in the order column is NOT sequential! Due to this, the data will become fragmented, resulting in degraded insertion time as the number of records in the table increases.

I would like to remind you that we are passing absolutely correct Uuidv7 (according to the specification) values as parameters, using Guid as a container for the value.

Okay, let's generate Uuidv7 and insert them into the database, using various GUID Format values, and construct graphs for visualizing the process. After all, everyone loves graphs.

For this, I wrote a pair of small programs:

The first one generates Uuidv7 and inserts its value (using Guid) along with its sequential order number (the order in which it was generated) into the database.
The second one reads all entries from the database using the SQL query: SELECT uuid, order FROM uuids ORDER BY uuid ASC; then it calculates the distribution statistics of records between the actual order number row (in which order it was read) and the one recorded during generation, grouping the deviations in buckets of 100k records.

And here are the results by insertion time:

mysql-inserts-compare

And here is the deviation:

mysql-deviation-all

The denser the points are to the left part and the higher the values there, the more such values resemble a monotonically increasing sequence.

Notably, when using Binary(16), the maximum deviation of the order in which the record was read from the order number at recording is 1. And this is easy to explain.

I ran BenchmarkDotNet and saw the following picture.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-FTFTEX : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Server=True

| Method         | Mean     | Error    | StdDev   | Gen0   | Allocated |
|--------------- |---------:|---------:|---------:|-------:|----------:|
| GenerateUuidV7 | 63.96 ns | 0.654 ns | 0.612 ns | 0.0001 |      88 B |

This means that sometimes 2 Uuids are generated within a 100-nanosecond interval (one Tick), and the random part of the second one outpaces the first.

Thus, Uuidv7 + Guid as transport + GUID Format=Binary16 in the connection string parameter give us the expected behavior. Only in this combination do we have a truly monotonic increasing sequence, which behaves like an auto-incrementing integer counter.

However, such visualization does not allow for a clear assessment of the situation for LittleEndianBinary16 and TimeSwapBinary16. Let's remove Binary16 and look at the deviation graph again without it:

mysql-deviation-timeswap-and-little-endian

It is seen that when using LittleEndianBinary16, values still resemble sequential ones (this is also noticeable by the insertion time), but nevertheless, they have dispersion, and with a large amount of data, the degradation of the insertion performance will still be there.

In the case of TimeSwapBinary16, everything is as bad as possible. Press F to pay respects to a table with such a primary key or unique index.

Microsoft SQL Server

The final boss of the next Doom will be the uniqueidentifier. This is the quintessence of obscure technologies multiplied by outright poor engineering decisions.

Let's start by generating a Uuidv7 with a correct text representation in the form of Guid, insert them into the database, and measure the insertion time.

We get the following graph:

mssql-inserts-raw

We see a significant slowdown in insertion over time. We observed a similar pattern with MySQL.

And here is the deviation:

mssql-deviation-only-raw

Let's check the state of our index:

SELECT * FROM sys.dm_db_index_physical_stats(db_id('dotnet'), object_id('uuids'), NULL, NULL, NULL);

and we see that avg_fragmentation_in_percent = 99.25588285567774 Our table is as bad as it can possibly be.

From here we dive into obscure technologies. This behavior occurs because the uniqueidentifier has its own sort order. But there is no documentation on this order. There is only one single description on the entire internet in an MSDN article from 2006, which has already been deleted. Fortunately, we have the internet archive and we can see what was written there:

More technically, we look at bytes {10 to 15} first, then {8-9}, then {6-7}, then {4-5}, and lastly {0 to 3}.

But this is insufficient.

This is where poor engineering decisions come into play, in the form of Guid's internal layout. This is because Microsoft.Data.SqlClient uses the ToByteArray() call on System.Guid. The ToByteArray() call simply dumps the internal contents of Guid into a byte array.

Now we multiply obscure technologies by poor engineering decisions. As a result, in order to write Uuidv7 to a table column of type uniqueidentifier and to get a monotonically increasing sequence in return, we need to make 2 permutations.

This is done as follows (the code is written as simply as possible so that anyone can understand what's going on):

string ReorderUuid(string uuid)
{
    var src = Convert.FromHexString(uuid);
    var dst = new byte[16];
    // reorder for SQL SERVER Sort order
    dst[0] = src[12];
    dst[1] = src[13];
    dst[2] = src[14];
    dst[3] = src[15];
    dst[4] = src[10];
    dst[5] = src[11];
    dst[6] = src[8];
    dst[7] = src[9];
    dst[8] = src[6];
    dst[9] = src[7];
    dst[10] = src[0];
    dst[11] = src[1];
    dst[12] = src[2];
    dst[13] = src[3];
    dst[14] = src[4];
    dst[15] = src[5];
    // reorder for guid internal layout
    var tmp0 = dst[0];
    var tmp1 = dst[1];
    var tmp2 = dst[2];
    var tmp3 = dst[3];
    dst[0] = tmp3;
    dst[1] = tmp2;
    dst[2] = tmp1;
    dst[3] = tmp0;
    var tmp4 = dst[4];
    var tmp5 = dst[5];
    dst[4] = tmp5;
    dst[5] = tmp4;
    var tmp6 = dst[6];
    var tmp7 = dst[7];
    dst[6] = tmp7;
    dst[7] = tmp6;
    return Convert.ToHexString(dst);
}

And if we construct a Guid in the following way new Guid(ReorderUuid(GenerateUuidV7())) and write it to the database by passing it as a parameter to the INSERT query, only in this case will we actually get a monotonically increasing sequence.

We get the following picture for insertion time:

mssql-inserts-compare

And for deviation:

mssql-deviation-compare

The maximum deviation is 1 (the reason is the same - generating 2 values in 1 Tick). The avg_fragmentation_in_percent value is 0.6696610861576153 (compared to 99.25588285567774 for insertion without reorder).

BINARY(16)

Despite the specification explicitly requiring the use of a specialized data type for storing UUIDs in the database, not everyone in the real world follows these recommendations. Or simply, a project could have started when there were neither recommendations nor .NET (let alone Core, even Framework). And databases may simply not have had a specialized type for working with UUIDs.

This is where binary(16) comes into play. The catch is that neither Npgsql nor Microsoft.Data.SqlClient allow the use of Guid as a parameter for values of this type.

Explicit conversion to a byte array on the calling side is required, or the parameter value from uuid / uniqueidentifier should be converted to binary(16) on the database side in the SQL query itself.

In case of binary(16), all three described RDBMS use the same sorting order - the value is interpreted as big endian as a whole. When it comes to a public API returning a Uuidv7 wrapped in Guid - to transform such a Guid into a byte array you will need to call only ToByteArray(bigEndian: true), and construct only through new Guid(bytes, bigEndian: true).

Only such combination of APIs will ensure a correct roundtrip of values and a monotonically increasing sequence of values at the database level.

Summary

In the case of a Guid as a container for Uuidv7 and using a specialized type on the database side:

MySql: it is mandatory to pass a specific value in the connection string (GUID Format=Binary16)
PostgreSQL: everything will work out of the box
MS SQL Server: manual rearrangement of content will be required to keep the database from breaking down

In the case of a Guid as a container for Uuidv7 and using binary(16) on the database side:

MySql: depending on the parameter value in the connection string, you are required to either pass a specific value in the connection string (GUID Format=Binary16), or convert Guid to byte array and back only using ToByteArray(bigEndian: true) and new Guid(bytes, bigEndian: true).
PostgreSQL and MS SQL Server: convert Guid to byte array and back only using ToByteArray(bigEndian: true) and new Guid(bytes, bigEndian: true).

Do we need this?

@tannergooding, given everything listed above, I have a question

Whose problems and how exactly will this API solve?

At the moment, it appears to be a feature only for PostgreSQL and MySql users (under certain conditions). If we generate Guidv7 (Uuidv7) optimized for the use of uniqueidentifier in MS SQL Server, this will firstly violate the specification (because neither the string nor the binary representation will correspond to the uuidv7 described in the specification), and secondly, it will automatically lead to performance degradation in other databases. Reordering will be necessary in any case. The only question is - in which scenarios: when working with Microsoft SQL Server or when working with PostgreSQL / MySQL.

IMHO: it should not be added. It's a minefield. This will definitely do more harm than good.

tannergooding commented 5 months ago

Whose problems and how exactly will this API solve?

The multitude of users, both internal and external, that have asked for such an API to exist and which are currently using 3rd party packages that are doing effectively what this API will be providing.

Since last year, https://github.com/dotnet/runtime/issues/88290 was opened explicitly asking for this on Guid with users continuing to come back and ask for this to be supported. This got renewed interest last month due to the new version of RFC 9562 having been published/finalized. It also has explicitly highlighted, including in new comments from the community, the problems that a custom Uuid represents and how it hurts integration with the ecosystem.

There are a multitude of NuGet packages that provide this as well, most of which (particularly UUIDNext which has 482k downloads) are explicitly doing so over System.Guid: https://www.nuget.org/packages?q=uuidv7&includeComputedFrameworks=true&prerel=true&sortby=relevance

I am not interested in getting to another elongated discussion around the pros vs cons of using System.Guid to represent this information.

GUIDs are not a Microsoft specific concept, they are an alternative name for UUIDs and that is explicitly covered in the official RFC 9562 in the very first sentence

This specification defines UUIDs (Universally Unique IDentifiers) -- also known as GUIDs (Globally Unique IDentifiers) -- and a Uniform Resource Name namespace for UUIDs.

The RFC further discusses layout and how GUIDs underlying values are represented and that this is distinct from the concept of saving it to binary format. The basic quote is as below, but I have given consistent in depth analysis and additional citations as replies other threads where this has been asked:

Saving UUIDs to binary format is done by sequencing all fields in big-endian format. However, there is a known caveat that Microsoft's Component Object Model (COM) GUIDs leverage little-endian when saving GUIDs. The discussion of this (see [MS_COM_GUID]) is outside the scope of this specification.

System.Guid is not a binary format, it is a strong type that represents a UUID, otherwise known as a GUID. It has the capability to serialize to a binary format and for the developer to pick whether that binary format should be big endian or little endian as per the needs of the consuming context.

Citing worst case performance characteristics is likewise not the correct basis for deciding whether or not a feature is suitable. We do not provide worst case or naive implementations of core APIs. We provide optimized routines that efficiently handle the data and which try to do a lesser number of operations where possible. Serializing to a big endian binary format requires up to 3x "byte swap" operations, which can be emitted as part of the general storage code. If it were found to be a significant performance bottleneck in real world code, we could optimize this further to be a single instruction handling the byte swap for all 8 bytes that need it simultaneously.

ImoutoChan commented 5 months ago

There are a multitude of NuGet packages that provide this as well, most of which (particularly UUIDNext which has 482k downloads) are explicitly doing so over System.Guid: https://www.nuget.org/packages?q=uuidv7&includeComputedFrameworks=true&prerel=true&sortby=relevance

This is a good example because this library actually creates "GUID-like" UUIDv7 by accepting the database where it would be used: Guid sequentialUuid = Uuid.NewDatabaseFriendly(Database.SQLite); // from readme

Why is that? It's because GUID fails to represent UUID in a single format, and each database needs its own representation of UUIDv7 within a GUID to function correctly.

vanbukin commented 5 months ago

@tannergooding

GUIDs are not a Microsoft specific concept

This is not true. Guid is a structure that appeared for COM/OLE. In its original form, with its layout and all subsequent advantages and disadvantages, it was invented by Microsoft and is used in the Microsoft technology stack. Literally all other languages use Uuid, which from the perspective of the public API is either 16 bytes in big endian, or a string, the format of which is defined in the specification.

We provide optimized routines that efficiently handle the data and which try to do a lesser number of operations where possible.

I am 100% sure that the .NET runtime team can implement the fastest and most optimized Uuidv7 generation algorithm in the world. But I did not raise the issue of generation performance. I highlighted the topic of what happens to such a Uuid AFTER generation. When it is used as an identifier in a database that will live there for 50 years.

What happens next - when it starts living its own life? That's the topic I was referring to. And I looked at this question through the prism of databases.

Uuidv7 is needed in order to NOT fragment indexes in databases. This is why it exists.

And from the perspective of PostgreSQL or MySQL (under certain conditions), a Uuidv7, packed into a Guid, will be written to the database in such a way that it won't cause index fragmentation.

In case of using Microsoft SQL Server and writing such a Uuidv7, packed into a Guid, into a column of type uniqueidentifier - index fragmentation will occur. For such a scenario, the Uuidv7 generation algorithm described in the RFC is not suitable (due to the specific sort order and the use by the SQL Server driver of an API that was originally intended for COM). A Uuidv7 with a changed byte order is required. Only in this case will the database get the benefits for which Uuidv7 was created. If the database does not receive them, then such a generation algorithm for Microsoft SQL Server makes no sense.

When changing the connection string in MySQL, data begins to be written to the database in a not optimal way. This leads to the loss of all benefits from such a generation algorithm. I would characterize the benefit of such an algorithm for MySQL as positive under a certain driver configuration.

For PostgreSQL, everything is fine.

So we have a situation, where the proposed API will generate Uuidv7, which when written to databases will have the following characteristics:

Microsoft SQL Server - no sense
PostgreSQL - makes sense
MySQL - makes sense under a certain configuration of the database driver.

As @ImoutoChan rightly noted, UUIDNext contains an API for generating Uuids, packed into Guids, that are specific to each database and its driver.

For the API to generate Uuidv7, packed into a Guid to make sense, it is necessary to specify for which database it is generated. But this is not something that can be added to the BCL.

So if the Microsoft platform can't make an algorithm that would work equally well for both the Microsoft-developed database and all other databases - then maybe it's not worth adding such a method at all and leave the implementations to the community?

It seems to have been doing pretty well all these years.

tannergooding commented 5 months ago

A UUID of f81d4fae-7dec-11d0-a765-00a0c91e6bf6 is always exactly f81d4fae-7dec-11d0-a765-00a0c91e6bf6 regardless of whether it is stored in big-endian format, little-endian format, or some arbitrary other format with say even/odd bytes swapped.

This is no different than the value 2 is always the value 2, regardless of whether it is an 8-bit integer 0x02, a little-endian 16-bit integer 0x02, 0x00, a big-endian 16-bit integer 0x00, 0x02, a 32-bit integer, a 64-bit integer, a 3-bit integer, etc.

Binary serialization and deserialization is fully independent of the value represented at runtime in the named type. If you have a destination that requires the data to be stored in a particular binary format, you should explicitly use the APIs which ensure that the data is serialized as that format and ensure the inverse APIs are used when loading from that format.

The underlying storage format used by the type at runtime is fully independent of the value represented by the UUID. It is not safe to rely on and is not something that is observable to the user outside of unsafe code. The RFC itself clearly dictates that GUID is an alternative name for UUID and that the spec lists it as a ordered set of 16 octets in big-endian format for simplicity and that it is the expected format for the variant 2 UUIDs discussed by the spec when saving to a binary format. It explicitly calls out the fact that the standard Microsoft format used by COM defaults to the inverse behavior (a little-endian storage format) and that it is out of scope of the spec to describe how to handle that. .NET explicitly states that it can be handled using the dedicated APIs to specify you would like to load or store the data as big-endian.

At this point, you appear to be explicitly ignoring how the code actually works, the considerations that actually exist, and what the official UUID specification actually calls out. That is not productive, it does not assist anyone, and it is blatantly pushing the discussion in a direction that makes it appear as though this is not a viable solution when there is no actual difference in how System.Guid is used by any database system provided you are correctly saving and loading using the appropriate endianness convention as required by the underlying database system. In all the examples you've listed so far, this is a requirement to save and restore as big-endian format, so if you decide call NewGuidv7() and then do not appropriately call TryWriteBytes(destination, bigEndian: true) when storing and new Guid(source, bigEndian: true) when loading, it is a bug in your code and it is your responsibility to fix it. The same would be true if you were reading data from a PE Header file and did not appropriately use ReadInt32BigEndian when reading data from the linker header; or if you failed to do the same when processing or creating network packets, etc.

It is purely a consideration of serialization.

vanbukin commented 5 months ago

Okay. Let's imagine a situation where such an API was added. Let's go through a User Story.

I, as an ordinary .NET developer, install the .NET 9 SDK, see that a new method NewGuidv7() has appeared in Guid. I go to read about what a Guidv7 is, find the specification. I start using it with Microsoft SQL Server. I use it as a primary key or in columns with a unique index. And I get avg_fragmentation_in_percent = 99.

And my friend, who uses this API for generating Guids and writing them to PostgreSQL, where values are stored in a column of the uuid type - everything is perfect. No fragmentation, excellent insertion.

At the same time, we both use specialized data types (as required by the specification) and pass the Guid as a query parameter without any preliminary conversion. Both of us have code like this:

await using var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO someTable (id, payload) VALUES (@id, @payload);";
cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());
cmd.Parameters.AddWithValue("payload", payload);
await cmd.ExecuteNonQueryAsync();

So it turns out that when I use Microsoft technology (.NET) with a recently added API in combination with a database driver developed by Microsoft and an RDBMS developed by Microsoft - I don't get the absence of fragmentation.

But when I use an OpenSource database (PostgreSQL) with an OpenSource driver (Npgsql) for this database, I do get it.

tannergooding commented 5 months ago

The problem would be caused by failure to properly serialize the data as big endian and therefore not storing the UUIDv7 that was generated, but rather a different GUID instead

It is not an issue with the Guid type nor an issue with how the underlying v7 UUID is generated or stored at runtime. It is solely an issue of the developer who with failed to serialize/deserialize the data in the format expected by PostgreSQL

vanbukin commented 5 months ago

You seem like you're not reading what I'm writing. In PostgreSQL everything is just fine. Everything goes to hell in combination with Microsoft SQL Server.

vanbukin commented 5 months ago

If I had a separate data type for Uuid, I could oblige the Microsoft SQL Server driver team to do automatic binary representation conversion at the driver level when writing and reading, so that it would be written into uniqueidentifier with the understanding that the value in Uuid is in big-endian, while the database sorting order is different. And implement the transformation like:

dst[0] = src[12]; 
dst[1] = src[13]; 
dst[2] = src[14]; 
dst[3] = src[15]; 
dst[4] = src[10]; 
dst[5] = src[11]; 
dst[6] = src[8]; 
dst[7] = src[9]; 
dst[8] = src[6]; 
dst[9] = src[7]; 
dst[10] = src[0]; 
dst[11] = src[1]; 
dst[12] = src[2]; 
dst[13] = src[3]; 
dst[14] = src[4]; 
dst[15] = src[5];

at the driver level.

Likewise, the MySQL and PostgreSQL driver teams could easily adapt such a type. But instead, we're being suggested to introduce even more workarounds, like what happened with bigEndian: true.

tannergooding commented 5 months ago

The same statement holds true in the inverse, I had merely misunderstood which of the two had the problem.

It is fundamentally the fault of the developer for not serializing/deserializing in the format expected by the database. If the database expects little endian format, you must use TryWriteBytes(destination, isBigEndian: false). If the database expects big endian format, you must use TryWriteBytes(Destination, isBigEndian: true)

The actual underlying storage format used by the Guid struct doesn't matter. We could choose to change it to be ulong _lower; ulong _upper, we could choose to change it to be fixed byte _data[16], we could choose to change it to be UInt128 _value, etc. It is never safe to assume the underlying data structure and there is already no guarantee the raw bytes are in little-endian order, as the data will be stored (internally) in big-endian order on a big-endian machine (such as the IBM z9). The internal storage format is how it is today because that was the most convenient throughout the history of .NET.

vanbukin commented 5 months ago

The problem isn't about the binary representation.

It's about how the existing ecosystem of RDBMS drivers works with the Guid data type. Because developers feed Guid into the driver. They do this either directly (through ADO) or indirectly (through Dapper, EF Core, or any other ORM, which in turn passes the value to ADO driver without changes).

This already exists and is already "somehow" working. And you can't change it without breaking a huge amount of code.

Due to how it works with Guid - different database drivers require different workarounds:

PostgreSQL does not need workarounds.
MySQL requires a certain parameter in the connection string.
Microsoft SQL Server, for uniqueidentifier, needs to generate a Guid and reshuffle its contents in such a way that it has the byte order in which the database itself sorts data internally.

Therefore, the presence of such an API might create a misconception in the minds of developers. That it's a silver bullet that allows you to generate IDs on the client-side and write them to the database without fragmenting indices. And this is indeed the case, but you can't just call the new API and feed the generated Guid into the driver. You need to know which database you're working with and what workarounds are needed for its driver.

With the current proposed implementation, this will be a feature for PostgreSQL and MySQL (remember about the parameter). But those who use Microsoft SQL Server need to know that their database and its driver require mandatory reshuffling to get a "Uuidv7 that doesn't ruin indexes" (remember it needs 2 reshuffles, one to compensate for ordering at the database level, another to compensate for using the COM-intended Guid API at the database driver level. Yes, they can be collapsed into one, but anyway). Because the Guid that the new API will generate can't be provided to the driver in its unchanged form - there will be index fragmentation.

vanbukin commented 5 months ago

@tannergooding

I expect synergy between Microsoft products. With the current state of affairs, there will be none.

If we go down the path of problem-solving, there are two ways. Either introduce a new data type or fix the Microsoft SQL Server driver. We will put aside the first option for reasons we both know. The second can be implemented in two ways - either through a breaking change in the driver, or a feature toggle. A breaking change is not an option, so let's consider the second one. It can be implemented in different ways - through environment variables or connection string parameters. And it seems that in this case, it becomes a problem of the database driver.

But if you add the proposed API BEFORE the Microsoft SQL Server driver has support for "alternative Guid handling mode", it turns out that you will roll out a feature that does not synergize with your own database.

As a developer, I expect that such a line of code:

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());

will generate identifiers for me that, when written to a database into a column of the uniqueidentifier type, will have an index that won't be fragmented (as stated in the specification). And it's the responsibility of the ADO driver to do something with the Guid to achieve such an effect.

Therefore, I suggest that you, as the author, create an issue in the Microsoft SQL Server driver repository, and discuss the possibility of adding support for "native Uuid" at the database driver level through some alternative mode toggle mechanism, or in some other way.

tannergooding commented 5 months ago

But if you add the proposed API BEFORE

Developers already can create a valid v7 UUID using System.Guid in many different ways, including parsing a string, passing in the individual fields to the constructor, or by reading a byte sequence (big or little endian) that some other piece of code created.

This API proposal changes nothing with that regard, it simply gives developers an easier way to generate a v7 UUID that will serialize as expected if stored using bigEndian: true or if converting to a string using ToString.

Therefore, I suggest that you, as the author, create an issue in the Microsoft SQL Server driver repository, and discuss the possibility of adding support for "native Uuid" at the database driver level through some alternative mode toggle mechanism, or in some other way.

.NET is producing correct UUIDs that serialize as expected when using the relevant APIs such as ToString (in which case there are multiple ways to separate the bytes but they are always in big-endian format) or TryCopyTo (in which case you decide if they should be serialized in bigEndian or littleEndian format). It is up to downstream components to consume System.Guid correctly using these APIs, just as they would be required to consume int, long, or Int128 (all of which have the same general endianness considerations).

If there is a scenario that you believe is not covered by the downstream component, then you as the interested party should be the one to file the feature request and to correctly articulate the problem space you believe exists and to optionally provide input as to how you believe it should be resolved.

Wraith2 commented 5 months ago

@vanbukin Have you filed any issues for your complaints about the Microsoft.Data.SqlClient performance? I regularly contribute performance improvements but if no-one tells me about them how would I know what needs improving to help your codebase?

vanbukin commented 5 months ago

@Wraith2 There's no problem with the driver's performance itself. The driver simply takes Guid as input, does something with it, and forms bytes, which are sent to Microsoft SQL Server over the TDS protocol. The problem is that when a Guidv7 formed through the proposed API gets into a database in a column of the uniqueidentifier type - there will not be a monotonically increasing sequence. I conducted a small study, made measurements, plotted graphs. You can read more about it here.

To make Guid.NewGuidv7() produce a sequence that would be optimal for index building specifically when working with Microsoft SQL Server, it's necessary to reshuffle the bytes. And there are two reasons to do this. 1) The order in which Microsoft SQL Server sorts uniqueidentifier (and consequently builds indexes, which is clearly seen by the dispersion of values, the fragmentation index indication, and insertion slowdown) 2) The order in which bytes are located in Guid (because the driver uses ToByteArryay() and absolutely disregards bigEndian: true and something else there). If we do not compensate for this with reshuffling, then when writing Guidv7 to uniqueidentifier, we will get complete index fragmentation. And Uuidv7, described in the RFC, was basically created for indexing optimization.

The irony of the situation is that Microsoft develops both .NET itself, the proposed API, the driver, and the database. However, a construct like

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());

will write data, the index of which will be completely fragmented right from the start.

vanbukin commented 5 months ago

Developers already can create a valid v7 UUID using System.Guid in many different ways, including parsing a string, passing in the individual fields to the constructor, or by reading a byte sequence (big or little endian) that some other piece of code created.

That's right. And the BCL can only provide one single way to do this, in order to meet the RFC. Yet, this option will not work with your own products right now. It's absurd.

This API proposal changes nothing with that regard, it simply gives developers an easier way to generate a v7 UUID that will serialize as expected if stored using bigEndian: true or if converting to a string using ToString.

The driver developed by Microsoft for its own database is not doing this right now. And it doesn't even have any options to change this behavior. You propose to create an API that won't work as expected. The very generation of Uuidv7, without regard to how it will be indexed by the database, is devoid of meaning. Because Uuidv7 was created to be optimized when inserted into the database. It doesn't matter who's to blame - the API of Guid, the driver developers, the creators of the TDS protocol, or the database developers. As a consumer, what matters to me is that it doesn't work as it should.

And this will only happen when I use the combination of Microsoft .NET together with Microsoft SQL Server. However, if I take PostgreSQL and its driver, I will have no problems.

So what am I paying for?

vanbukin commented 5 months ago

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());

This line of code will give me proper indexes in PostgreSQL. With the correct setup through connection string parameters, it will provide me with normal indexes in MySQL. In Microsoft SQL Server, it will NEVER give me normal indexes.

I have to write my own function to rearrange the internals of the Guid and call it before each insertion.

Yes, you will be following the RFC.

But there is zero synergy between your products.

tannergooding commented 5 months ago

And the BCL can only provide one single way to do this, in order to meet the RFC

It's not just to meet the RFC, it's the only valid way to implement the functionality. Anything else would not produce a UUIDv7 value.

You propose to create an API that won't work as expected.

The API will work exactly as expected. It will produce a correct UUIDv7 value that correctly serializes, deserializes, compares, etc. It is no different than any other API we expose on Guid today and will continue having the same existing correct behavior and handling out of the box.

Some library x consuming a type t from another library y in a particular way does not make t correct or incorrect. t is correct on its own regard and it would be the downstream library with the issue if it were consuming it incorrectly. -- That's also a very big if because there are many reasons why a library may decide to consume a type in a different manner, including for historical backwards compatibility requirements.

And it doesn't even have any options to change this behavior.

This would then be something to raise with the SQL Server team, which is fully external to the .NET Libraries team and has their own management, customers (including beyond .NET), developers, back-compat bars, API review process, and other considerations.

There is nothing for the BCL to do here as our handling of Guid is entirely correct. There is then accordingly no need for a new type, again because our own existing type is entirely correct and well-behaved. We will not limit the growth of our own already correct types and we will continue exposing additional convenience APIs that also behave correctly. It is the responsibility of any downstream consumers to then continue handling it correctly and to provide documentation, analyzers, new overloads, or custom types if that is necessary for their own domain. -- This is ultimately what is best for the ecosystem as a whole, which extends far beyond just the consideration of how a type may be consumed by a single downstream library, regardless of who produces it.

For example, if a downstream consumer expects the bytes to be serialized in big-endian format, it is their responsibility to serialize them that way (ideally using the official APIs we provide to make that simpler). If they cannot serialize them that way such as due to a long-standing backwards compatibility requirement, then it should be documented at a minimum. There may then also be appropriate consideration of an analyzer to flag the situation to users and the consideration of new APIs that can help users achieve correct results. Regardless of whether they expect the data to be stored as LE or BE format, they should then ensure any deserialization handles it the same way (if you serialize as BE, you deserialize as BE; if you serialize as LE, you deserialize as LE). The same consideration exists for deserialization in that if they cannot handle it correctly, such as due to a need for backwards compatibility, then it should be documented with potential for them to expose new analyzers or APIs (methods, enums members, etc) that allow users to more easily get the correct behavior.

osexpert commented 5 months ago

This would then be something to raise with the SQL Server team

I agree. This is not dotnet runtime fault. Please suggest here: https://aka.ms/sqlfeedback Imo: Ms sql should have supported a new type uuid (that sort uuidv7 correctly) in addition to uniqueidentifier.

aloraman commented 5 months ago

Variant and Version properties

This proposal adds two properties to System.Guid struct: Variant and Version - what's the expected behavior?

Strictly speaking, aforementioned v1, v4, v5, v7 versions are Uuid versions defined specifically for Uuid Variant 1, in other words, Version data is entangled with Variant data, so to speak. Also, there's always a possibility to initialize a Guid struct with raw 128 bit of random data, so Version and Variant could contain non-conforming values.

If the intention is to treat these values in a smart way - then contracts for underlying code should be formalized (i.e. what combinations of variant/version are supported, what return values signify an error et cetera...) If, on the contrary, the intention is to treat them in a simple, i.e. just read raw data from the structure with bitmasking - then there's a chance to read outright garbage, or misleading data (e.g. get_Version returns 4, but it's not really version 4 - because get_Variant returns 3). Also, nil Guid.Empty and proposed Guid.Max will produce unexpected results.

IMHO, the second case describes an advance-use scenario, akin to an extraction of mantissa and exponent from a floating-point number - and therefore should be accessed not from the structure itself, but from a satellite helper class, e.g.

public static partial class GuidUtilities // or part of BytePrimitives?
{
   public static int ReadVariant(in Guid value);
   public static int ReadVersion(in Guid value);
   // maybe even
   public static DateTime ReadV7Timestamp(in Guid value); 
}

Variant 7

The entire purpose of UUIDv7 existence is having natural time-based ordering (achieved by embedding of a time-based unix timestamp data), that is also in sync with byte ordering (as opposed to UUIDv1). I would, at least, expect natural time-ordering to be observed with values, produced byGuid.NewGuidv7(), as well ( IComparer<Guid>.Default should sort these guids according to timestamp values).

Then, there is a separate issue of SqlServer. It's already The Lament Configuration of very specific kind of pleasure to deal with index degradation and RGI-ordering tricks. And with addition of Guid.NewGuidv7() (and the possibility to do cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());, it will be even more broken, even hopeless without an escalation by a "first-party customer".

Wraith2 commented 5 months ago

How are you going to store the version information without increasing the size of the guid struct? You can't change the size without affecting massive amount of software. You can't derived from them because they're structs and thus sealed.

Wraith2 commented 5 months ago

In SqlClient we could quite easily define our own uuidv7 type that contains a guid and then add support for that type into GetFieldValue etc doing reordering on the way in and out. That support could then be surfaced through ef if they wanted.

What you're not going to be able to do is disrupt the entire established .net ecosystem by changing how the current guid struct works.

tannergooding commented 5 months ago

This proposal adds two properties to System.Guid struct: Variant and Version - what's the expected behavior?

To read and return the values of the bits from the represented value that are documented to contain these fields, plain and simple. That's all the behavior one could ever do with these properties given an arbitrary UUID.

Yes, you can initialize with random data and yes you can fill it with data that is nonsensical in the face of the underlying RFC. But that is exactly the same experience you'd have if you ever tried to read these bits regardless.

The spec itself versions over time and includes explicit callouts that Nil and Max define what are currently otherwise reserved version/variant definitions. It is therefore intentional that it is only reading the raw bits and letting users determine how to handle things from there.

The version/variant do not change the value represented, they do not change the handling or general processing of the type for the purposes of sorting, comparison, or serialization/deserialization. They simply imply a potential way you can interpret and extract additional information out of the raw bytes once you have serialized them.

I would, at least, expect natural time-ordering to be observed with values, produced by Guid.NewGuidv7(), as well ( IComparer.Default should sort these guids according to timestamp values).

The use of DateTime.UtcNow and normalization to a Unix epoch based timestamp will already ensure that NewGuidv7 will be naturally ordered. The same applies to the general sorting of these values since they represent the most significant bits of the represented value, there is no additional handling required and it is by default correct.

There is no additional handling required and no unexpected behavior here, because the general handling of System.Guid is already RFC compliant, it already considers the value in terms of the underlying 128-bit value represented by the string on all platforms regardless of endianness. It is deterministic, just as comparisons, equality, and sorting is for regular integers.

Then, there is a separate issue of SqlServer.

This proposal changes nothing with regards to downstream handling of System.Guid. They remain in exactly the same state as they always have and that would have already been handling the data for a manually initialized Guid containing the same values. They have absolutely zero impact on the decision to version and improve our already correct implementation with more correct APIs.

Again, if SqlClient has a particular quirk in its handling of System.Guid then this changes nothing with that regard and you can continue applying the same workarounds or general fixups to ensure the data fits any SqlClient needs explicitly, there is no change to that. The same applies to any and all downstream consumers of the type.

System.Guid stands correct as implemented and there is nothing to change or fix on the end of the BCL.

bartonjs commented 5 months ago

Video

NewGuidv7(DateTime) => CreateVersion7(DateTimeOffset)
Guid.Max => Guid.AllBitsSet
We discussed adding a CreateRandom, but the general consensus was that the alias was unnecessary/confusing so long as we didn't have NewGuid() paired with NewGuidV7().
We discussed leaving out the parameterless overload (always requiring the timestamp), but since 99% of callers of V7 create want DateTimeOffset.UtcNow, we should just give that ease of use.

namespace System;

public partial struct Guid
{
    public static Guid AllBitsSet { get; }

    public int Variant { get; }
    public int Version { get; }

    public static Guid CreateVersion7();
    public static Guid CreateVersion7(DateTimeOffset timestamp);
}

terrajobst commented 5 months ago

I'd like to explicitly point out that we felt we didn't want to create a new API that shares the NewGuid prefix because we thought that people will think that NewGuidV7 sounds better than NewGuid which in fact isn't the case. That's the lesson we learned from SHA3.

It's unfortunate that the underlying RFC uses version numbers to describe different ways to construct the GUID/UUID. At the same time, we thought that coming up with names for those formats would do the community a disservice as the people that care likely knew the format under the designator from the spec, which uses version numbers.

The conclusion was that we leave NewGuid as the well-established pattern to create a unique GUID (v4, using a proper random function) and have a separate set of APIs for version specific creation. For example, we could create CreateVersion1-6, if they add value. So far the conclusion was that those would not.

LeaFrock commented 5 months ago

A weak suggestion, how about adding a new struct VersionalGuid.

    public partial struct VersionalGuid
    {
        private const int DefaultVersion = 7;

        public static VersionalGuid AllBitsSet { get; }

        public int Variant { get; }
        public int Version { get; }

        public static VersionalGuid NewVersionalGuid(int version = DefaultVersion);
        public static VersionalGuid NewVersionalGuid(DateTimeOffset timestamp, int version = DefaultVersion);

        public Guid ToGuid();
        // OR...
        // public static explicit operator Guid(VersionalGuid guid);
    }

You may argue that,

it would make more code duplication

Yes, but it avoids confusing and ambiguous concepts, while I see people want to use Guid with UUIDv7 but also are required to understand that it's not the same as the current one(v4). A new type naturally clear the misunderstandings. More versions(maybe v8v9... one day) are under the scope of new type, and let current Guid be.

how to keep it compatible to `Guid`

Similar to the relation between DateTime and DateOnly/TimeOnly, first of all we can provide APIs which parse each other if possible.

Then, for the upper-level libraries like EF Core, they just need to adapt a new type instead of upgrading existing codes，which avoid making things more complicated with higher bug risks.

it grows the code size of BCL

Yes, it's an unsolvable side-effect.

tannergooding commented 5 months ago

System.Guid is not a v4 UUID, it is simply a UUID (one that uses the alternative name GUID, which is one of several and is explicitly referred to as an alternative name by the underlying UUID RFC). The "version" is determined by the value of the most significant 4 bits of octet 6 (bits 48 through 51, if bit 0 is the most significant bit of the represented value), much as the "variant" is determined by the value of the 4 most significant bits of octet 8 (bits 64 through 67).

It's a somewhat unfortunate piece of terminology from the underlying RFC. There will, under no circumstances, be a new type to represent a Guid or Uuid provided by the BCL, this had already been discussed and considered at length. The existing type is already correct and there is zero need.

The only point of discussion had been that the some users will be confused by the notion of "Version" and will believe that higher versions equates to better functionality. Given that the existing NewGuid() (which itself already has an unfortunate name that is largely inconsistent with the naming the rest of .NET uses and which users get confused in relation to new Guid())) specifically generates a UUIDv4 and is used by developers wanting 122-bit random guids, generated using the crypto APIs to ensure it is robust/secure, we wanted to avoid any confusion that NewGuidVersion7() was somehow better or a replacement for NewGuid().

Instead, we opted to make it named Create (following the more typical .NET pattern for static factory APIs) and then suffixed with the information relevant to the actual RFC (that it will set the version bits to 7, the variant bits to 0b10xx, and seed the remaining 122-bits as per the UUIDv7 requirements).

This exactly flows the intended behavior, removes the potential ambiguity, gives room for future growth, and continues expanding our already correct type with additional correct APIs that in no way change the existing semantics or meaning of the Guid type. It was RFC compliant prior to this proposal and it will remain RFC compliant after the proposal.

iSazonov commented 5 months ago

public int Variant { get; }

This hints that this type may in the future serve Variant-s other than 0b10. Yes? If yes should the new names (Create*) reflect this fact then? If no should the field be removed?

osexpert commented 5 months ago


        public static Guid AllBitsSet { get; }

Since the existing one is called Empty, would it not be consistent if the opposite was called Full? :-)

terrajobst commented 5 months ago

@osexpert

Since the existing one is called Empty, would it not be consistent if the opposite was called Full? :-)

We already introduced AllBitsSet in other APIs. Seems more descriptive anyway.

@LeaFrock

A weak suggestion, how about adding a new struct VersionalGuid.

Types aren't free; there is a concept count. Conversion methods only go so far because you need to call them. If you use those types in code that needs to be understood by other systems (such as serializers, OR mappers etc) you typically want the other side to handle those types directly because otherwise you need to create a mapping model just to change some types. That gets clunky fast.

And for Guid it doesn't seem warranted. It seems there is some disagreement on the name (i.e. that GUID is very Microsoft centric and that UUID is the industry term). However, that ship has sailed in 2002 when .NET Framework was first shipped. Naming alone isn't a strong enough reason to add new core types. The binary format of GUID and UUID is the same. In fact, the cited RFC states that they are the same concepts.

What differs here is the way we construct the value, not the type for the value. To me, the best way to model is having new methods to create them.

KennethHoff commented 5 months ago

What about renaming MewGuid to CreateVersion4 (which, for backwards compatibility reasons means obsoleting NewGuid, marking EB Never etc..)

Obsoleting methods that actually work as expected might be unorthodox, but if the unfortunate naming is a problem, then I think it should be considered, at least if v5, v6 etc.. releases.

It's definitely problematic that it's used a lot though (the existing Guid.NewGuid API that is).

Wraith2 commented 5 months ago

If you watch through the api review video you'll find that a lot of the alternatives were discussed and understand the reason they were rejected.

terrajobst commented 5 months ago

@KennethHoff

What about renaming MewGuid to CreateVersion4

I thought I answered that here.

danielmarbach commented 5 months ago

Any chance this would also get timeprovider support or is it always the responsibility of the caller?

stephentoub commented 5 months ago

Any chance this would also get timeprovider support or is it always the responsibility of the caller?

The caller would just do Guid.CreateVersion7(tp.GetUtcNow()) if that's what they wanted to use.

Timovzl commented 4 months ago

I'm late to the conversation, but have done a lot of research on this topic over the last couple of years.

One of my packages features a V7 UUID implementation (docs, core implementation) that may be worth a peek. Certain properties could be worth mimicking:

A significant random portion (75 bits), to help avoid collisions.
Intra-millisecond monotonicity and unpredictability by using significant pseudorandom increments (58 bits) on "reused" milliseconds.
Protection against backward clock adjustments of up to 1000 ms, to account for NTP and the like.
Use of the legacy variant "0x", which (A) frees up an extra bit for randomness (since the 65th bit is then allowed to take any value) and (B) allows a value to be stored as 2x (signed) BIGINT in SQL Server without introducing any negative values (thanks to the 64th bit being 0).
A carefully selected epoch that ensures that numeric representations (such as UInt128 or 2x UInt64) have the same number of digits for any ID generated between now and the year 4000, making their lexicographic order identical to their numeric order.
A carefully selected epoch that makes DECIMAL(38,0) a feasible numeric representation until at least the year 4000.
Irrelevant to the current proposal but useful to bear in mind: Easily converted between not just binary and hexadecimal but also numeric and alphanumeric (base62) representations, each with identical lexicographic order.

Ensuring monotonicity, unpredictability, and a reliable representability in as many sensible formats as possible has created a nice pit of success.

Please note that the implementation predates the RFC and is mainly inspired by this draft. I have not yet checked if the RFC imposes any constraints that are now violated, such as perhaps on the variant.

glen-84 commented 4 months ago

The RFC mentions under "6.4. Distributed UUID Generation":

Likewise, utilization of either method is not required for implementing UUID generation in distributed environments.

Do we know how many collisions are likely as the number of ID-generating nodes grows?

It might be useful to have an API for UUIDv8, to allow for customization (specifying a node ID, etc.).

(PS. I also think that this should have been a new type [Uuid]. Naming is important – it should match the RFC, as databases and other systems use and will use the term UUID.)

tannergooding commented 4 months ago

As per the top post (and the RFC):

v1 is largely considered deprecated and should be replaced with v7 where possible
v2 is for DCE security purposes and is outside the normal specification
v3 is largely considered deprecated and should be replaced with v5 where possible
v4 is for creating random UUIDs, it is supported already via Guid.NewGuid
v5 is for creating UUIDs from a string input, however due to using SHA-1 it is largely considered deprecated as well due to the potential for security based attacks
v6 is simple v1 with an alternative ordering to the bits, it is also largely considered deprecated and should be replaced with v7 where possible
v7 is what is being supported by this proposal via the new CreateVersion7 APIs
- There is some optional extended functionality that is not supported but which we can expand to support in the future
v8 is explicitly for experimental and vendor-specific use, there is no definition to the bits it contains aside from the version and variant fields
- This is indirectly supported via the normal new Guid(...) APIs which allow you to specify the value of all underlying bits

Also as per the top post and various other attached discussions, this is never going to be a new type. The RFC itself starts with the sentence:

This specification defines UUIDs (Universally Unique IDentifiers) -- also known as GUIDs (Globally Unique IDentifiers)

A GUID is a UUID, full stop. It is simply an alternative name that some domains have used, whether that be for historical reasons, domain specific reasons, etc. .NET has a 20 year history of using the term GUID and it is not going to change simply because some people don't prefer the name, that is a worst case scenario for the ecosystem and would hurt everyone more in the long term.

terrajobst commented 4 months ago

A GUID is a UUID, full stop.

Agreed. Also, global usings allow people to have their own names if it makes them happy:

global using Uuid = System.Guid

glen-84 commented 4 months ago

The type name message was a postscript, the main purpose of my comment was regarding distributed applications.

I was curious to know whether it was considered "generally safe" to use v7 IDs in such scenarios, and at what point (number of nodes) would one expect to see a non-trivial number of collisions.

terrajobst commented 4 months ago

I was curious to know whether it was considered "generally safe" to use v7 IDs in such scenarios,

The RFC basically says "weigh your risks". Ultimately, neither version can guarantee uniqueness. In the context of distributed applications, I see collisions as a transient error; if you get a duplicate error, try again.

Maybe @tannergooding has more details here.

tannergooding commented 4 months ago

Right, the "uniqueness" factor here is basically dependent on how you're initializing your UUID and how frequently you're doing so.

For v4 for example, there is no guarantee of uniqueness. There are 122-bits of random data and it is possible (although extremely unlikely) for two sequential NewGuid calls to produce a bitwise identical value. It is also possible for any two random calls to NewGuid to produce bitwise identical values with the odds of it occurring being extremely minimal and minimally increasing each subsequent invocation.

For something like v7 where some bits are not random but rather seeded based on some input state, the chance of collision differs. Assuming that you aren't changing the source timestamp provider, then there is a guarantee of no conflicts provided that you are calling CreateVersion7 no more than once per millisecond (at least for the next 10k years or so, but that's not something that really needs to be accounted for in practice). If you call it more frequently then there is 74-bits of randomly seeded data that tries to help ensure uniqueness, but it has the same fundamental consideration as v4 which is that there is a possibility (although an extremely minimal one) that you can end up with a conflict.

UUIDv7 has some options to encode certain bits with structured rather than random data, but these are not currently exposed in .NET 9. We may consider exposing overloads to provide this optional functionality in the future, but users can always manually achieve the same using new Guid(...) for the time being. These options include using up to 12 additional bits to track a sub-millisecond timestamp fraction (this is [0, 4095] and so theoretically lets you track at an around 245 nanosecond accuracy). The remaining 62 and up to the full 74 bits can then be optionally used to represent a Fixed Bit-Length Dedicated Counter -or- a Monotonic Random counter. The former simply increments by 1 for each UUID created within a given timestamp tick. The latter first randomly seeds the data per timestamp tick and then increments within that timestamp tick.

These options can help guarantee uniqueness for a given UUID generator, but they are more advanced and were considered out of scope for the BCL to provide in .NET 9 given the limited timeframe before we lock down for RC1.

jodydonetti commented 4 months ago

Hi @tannergooding , thanks for sharing so much juicy info.

When all the details will be settled (are they already?) I think a blogpost on devblogs with a recap of all of this would be great!

dotnet / runtime