Closed dariogriffo closed 2 months ago
Please note that SQL Server needs a different ordering for unique identifier. Their sequential GUIDs work better then UUIDv7. For example: https://github.com/nhibernate/nhibernate-core/blob/master/src/NHibernate/Id/GuidCombGenerator.cs
A quick benchmark is 98% fragmentation for UUIDv7 vs 44% for sequential GUID for 100k inserted records.
Please note that SQL Server needs a different ordering for unique identifier. Their sequential GUIDs work better then UUIDv7. For example: https://github.com/nhibernate/nhibernate-core/blob/master/src/NHibernate/Id/GuidCombGenerator.cs
A quick benchmark is 98% fragmentation for UUIDv7 vs 44% for sequential GUID for 100k inserted records.
Hi Robbie, yes, but that's SQL server specific that's why I didn't mention, yet still in proposal v7 hopefully can land into an official format. I guess until is approved this will remain inactive or closed... Let's see I opened now because seems like Microsoft takes time to implement new things refer to SHA3
The container proposed in this API Proposal could have been an excellent data type for storing Uuid v5 and v7, as well as any other versions, but it was closed. The alternative proposed option, in my opinion, does not solve the problems described here.
Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.
Author: | dariogriffo |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Runtime`, `untriaged` |
Milestone: | - |
does not solve the problems described here.
UUID v5 and v7 have to do with how new values are constructed. It has little to do with the storage mechanism of the underlying bits.
Once a correct sequence of bits has been generated, it can then be serialized as required by the consumer. For some scenarios, this will be BigEndian
format and for others LittleEndian
format and which exactly is "correct" depends on how the consumer will interpret a raw sequence of 16-bytes.
API Proposal
It should probably be called NewGuid
, rather than NewUuid
, to match the existing naming and general framework design guidelines conventions. As per the UUID
specification, GUID
is simply an alternative name:
This specification defines a Uniform Resource Name namespace for UUIDs (Universally Unique IDentifier), also known as GUIDs (Globally Unique IDentifier).
And as per the existing NewGuid
docs:
The method creates a Version 4 Universally Unique Identifier (UUID) as described in RFC 4122, Sec. 4.4
There will probably be some discussion as to whether they should be called NewGuid5
, NewGuidV5
, or NewGuidVersion5
. There may also be some discussion on whether this should simply be NewGuid(int version)
or something like NewGuid(GuidVersion version)
for some enum GuidVersion
.
Exposing version 5
is likely to be non-controversial as its already been formally reserved and defined by the RFC 4122
specification. Exposing methods for versions 1, 2, and 3
would probably be uncontroversial for similar reasons.
Exposing version 7
may get rejected on the basis that its not been formally ratified yet. The same would be true for versions 6 and 8
. I expect that it would become uncontroversial once ratification is complete. -- This is namely because we would no longer have the flexibility to be compliant if anything changed between shipping such an API and ratification finalizing.
There may likewise need to be discussion on if the methods need any parameterization. NewGuid()
itself is version 4
which is "random data". Other versions have different inputs or parameterization used to influence the generation. This is roughly:
version 1
is time based using 60-bits to represent the number of 100ns
intervals in UTC since 15 OCT 1582
version 2
is for DCE Securityversion 3
is from an MD5 hash over a given namespace valueversion 4
is 128-bits of random data, replacing 6 bits with fixed valuesversion 5
is a SHA-1 hash over a given namespace valueversion 6
is just version 1
with the bits stored in different positionsversion 7
is similar to v1/6
but uses 48-bits representing the number of milliseconds
in UTC since 1 JAN 1970
(Unix Epoch) + optional and otherwise random data for the remaining unused bitsversion 8
is supposed to be set aside for vendor specific use-case and experimentationAlso noting, with regards to the parameterization, the UUID spec covers:
While Appendix A details a few interesting namespaces; implementations SHOULD provide the ability to input a custom namespace. For example, any other UUID MAY be generated and used as the desired namespace input for a given application context to ensure all names created are unique within the newly created namespace.
version 7 is similar to v1/6 but uses 48-bits representing the number of milliseconds in UTC since 1 JAN 1970 (Unix Epoch) + optional and otherwise random data for the remaining unused bits
Does this affect what it means to serialize as big vs little endian? The six bytes comprising that value won't be reversed on their own and will thus change where they are in the resulting byte order.
The actual definition of Version 7
is:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_ts_ms |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_ts_ms | ver | rand_a |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 11: UUIDv7 Field and Bit Layout
unix_ts_ms:
48 bit big-endian unsigned number of Unix epoch timestamp in
milliseconds as per Section 6.1.
ver:
4 bit UUIDv7 version set as per Section 4.2
rand_a:
12 bits pseudo-random data to provide uniqueness as per
Section 6.8 and/or optional constructs to guarantee additional
monotonicity as per Section 6.2.
var:
The 2 bit variant defined by Section 4.1.
rand_b:
The final 62 bits of pseudo-random data to provide uniqueness as
per Section 6.8 and/or an optional counter to guarantee additional
monotonicity as per Section 6.2.
And the general layout description in the spec has been simplified to:
5. UUID Layouts
To minimize confusion about bit assignments within octets and among
differing versions, the UUID record definition is provided as a
grouping of fields within bit layout consisting four octets to a row.
The fields are presented with the most significant one first.
Thus, the latest draft spec defaults to Variant 1
(0b10x
) and what is effectively "BigEndian" layout.
The Variant 2
(0b110
) layout that System.Guid
defaults to is simply uint, ushort, ushort, byte[8]
and thus its only the first 8-bytes that are effected by endianness and which differ, the last 8 bytes are always identical to Variant 1
.
So we'd serialize it correctly if the user specifies ToByteArray(bigEndian: true)
and deserialize it correctly given new Guid(array, isBigEndian: true)
. That is, the two layouts are
Where:
a: unix_ts_ms 48-bits
b: ver 4-bits
c: rand_a 12-bits
d: var 2-bits
e: rand_b 62-bits
Variant 1 (Big Endian)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a0..7 | a8..15 | a16..23 | a24..31 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a32..39 | a40..48 | b0..3 | c0..3 | c4..11 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| d | e0..5 | e6..13 | e14..21 | e 21..29 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| e30..37 | e38..45 | e46..53 | e 54..61 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Variant 2 (Little Endian)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a24..31 | a16..23 | a8..15 | a0..7 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a40..48 | a32..39 | c4..11 | b0..3 | c0..3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| d | e0..5 | e6..13 | e14..21 | e 21..29 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| e30..37 | e38..45 | e46..53 | e 54..61 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
UUID v5 and v7 have to do with how new values are constructed. It has little to do with the storage mechanism of the underlying bits.
I will quote the proposal of the new uuid versions to answer. You can read the proposal here
So is not me who says that, is the reason to have a pseudo sequential generate random IDs.
One area UUIDs have gained popularity is as database keys. This stems from the increasingly distributed nature of modern applications. In such cases, "auto increment" schemes often used by databases do not work well, as the effort required to coordinate unique numeric identifiers across a network can easily become a burden. The fact that UUIDs can be used to create unique, reasonably short values in distributed systems without requiring synchronization makes them a good alternative, but UUID versions 1-5 lack certain other desirable characteristics:
If you are using UUIDs for purpose of DB indexing, I would recommend https://github.com/mareek/UUIDNext. It uses v7 for most DBs and a custom v8 low endian format for MSSQL.
The only thing missing is Guid type support for formatting LE UUIDs, which we will hopefully get soon.
The container proposed in this API Proposal could have been an excellent data type for storing Uuid v5 and v7, as well as any other versions, but it was closed. The alternative proposed option, in my opinion, does not solve the problems described here.
Am I missing something? The proposal (with bigEndian boolean) completely solves the issue, it allow us to add APIs mentioned in the original post like Guid.NewUuid7()
.
So we'd serialize it correctly if the user specifies ToByteArray(bigEndian: true) and deserialize it correctly given new Guid(array, isBigEndian: true)
I'm not concerned about whether we'd roundtrip it well; we could output any order we wanted and as long as we used the same order for deserialization it would roundtrip :)
My question is really:
Or asked another way, do the new APIs we've approved make sense to use with Guids created from the APIs proposed in this issue?
Or asked another way, do the new APIs we've approved make sense to use with Guids created from the APIs proposed in this issue?
Yes.
For isBigEndian: true
, it exactly matches the spec for these types. For isBigEndian: false
, it exactly matches the only other widely used and well-defined layout.
That is, the NewGuid()
methods simply create a valid new GUID
/UUID
of the format f81d4fae-7dec-11d0-a765-00a0c91e6bf6
. This can be visualized as a binary number 11111000000111010100111110101110011111011110110000010001110100001010011101100101000000001010000011001001000111100110101111110110
or a decimal number 329800735698586629295641978511506172918
. This is irrespective of the underlying field layout or storage mechanism (it's effectively just generating a new 128-bit value and replacing very specific bits).
There are then two well-defined layouts:
variant 1
, which is what's defined by the RFC spec and is functionally big endian (in a big endian setup, the types of fields don't matter)variant 2
, which is what System.Guid
uses as its actual field layoutvariant 0
and variant 3
are both reserved for future usevariant 2
is functionally defined as uint32, uint16, uint16, uint8[8]
and is generally compatible with the original UUID field layout and is what has been historically called little-endian
because every individual field is in little-endian format.
It would not make sense for us to define a new format where we have some uint48, uint16, uint64
or uint48, uint4, uint12, , uint2, uin62
because that would be incompatible with the two layouts everything currently expects.
As for the performance of database index, just inserting does not give the full picture. Try reading in the insertion order in the same performance testing workload. For example, add an indexed timestamp column, and query some recently inserted records. This way timestamp index seek would retrieve the keys (cheap operation), and then key lookup will end up in reading a separate page for each key. This would bring the database to it's knees.
Here is a test harness designed specifically for this problem: https://github.com/george-polevoy/uuid-load-tests/
so, will other systems understand this schema of reversing portions of unix_ts_ms in this way?
UUIDv7 is a very novel concept, all currently existing [^1] implementations use OSF DCE-compatible layout (i.e. Big Endian), and specifically for v7
it is essential to have a natural order based on timestamp value. Reversing portions of timestamp in a binary representation will break that assumption, so it makes total sense only in isBigEndian: true
case.
Note that RFC-4122 and recent IETF drafts only describe Variant 1
with Big Endian layout. System.Guid
implementaion is different for compatibility reasons and is explicitly out of scope of RFC-4122. Although, the MSFT "flavor" has little endian layout, RFC-4122 states that Variant 2
is reserved for "MSFT backward compatibility", and recent IETF draft specifically state "there is a known caveat that Microsoft's Component Object Model (COM) GUIDs leverage little-endian when saving GUIDs" - it does not actually mean MSFT flavor
= Little Endian
= Variant 2
First of all, Variant 2
is reserved for MSFT backward compatibility, but not all MSFT Guids are Variant 2
. This variant actually refers to very old type of GUID, from OLE/COM/DCOM era, way before RFC-4122. For example:
{00000000-0000-0000-C000-000000000046} // IUnknown
{00000001-0000-0000-C000-000000000046} // IClassFactory
{00000112-0000-0000-C000-000000000046} // IOleObject
{0000031b-0000-0000-C000-000000000046} // CLSID_ErrorObject
Note that the variant
field here has a value of 'C'
= 1100
, which means Variant 2
, as expected. Also, there is no valid version
field here, because version
is a feature exclusive to Variant 1
only.
Later, MSFT moved away from these variant 2
GUIDs to Variant 1 Version 1
(time/mac based), e.g.:
{B196B28F-BAB4-101A-B69C-00AA00341D07} // IClassFactory2
{99FCFEC4-5260-101B-BBCB-00AA0021347A} // IID_IObjectExporter
{4D9F4AB8-7D1C-11CF-861E-0020AF6E7C57} // IID_IActivation
Note that the variant
field has a pattern of 10xx
, which means Variant 1
. Also, now there's a version
field, with a value of 0001
, which means Version 1
.
And then, MSFT moved to Variant 1 Version 4
(RNG based), e.g.:
{4193A62E-F128-410D-9746-E57B5485903D} // Windows.UI.Internal.Input.IMouseCapture
{C0EFA91A-EEB7-41C7-97FA-F0ED645EFB24} // MsRDP.MsRDP.10
{214A1A2F-232B-4FEF-93B7-2F7D9AF053AD} // produced by System.Guid.NewGuid()
Note that the variant
field still has a pattern of 10xx
, which means Variant 1
. And the version
field has a value of 0100
, which means Version 4
.
However, even after the switch from Variant 2
to Variant 1
, MSFT implementation stayed being little-endian based.
So, considering the existence of such "MSFT flavor", I think the input of Windows dev team is desirable here. If v5
or v7
to be implemented in .NET, it should have the same "flavor" as v4
has now, i.e., be compatible with the implementation Windows would have, in case there will be one in the future.
[^1]: NEWSEQUENTIALID in SqlServer serves the same purpose, but it's different implementation.
The internal field layout used by a given runtime (for it's GUID
/UUID
type) has very little to do with the actual data being tracked. It only matters for scenarios where you are unsafely accessing the raw bytes of the struct, such as via pointers, other forms of reinterpret like casting, or in the case you are implementing official serialization/deserialization APIs for the type.
.NET's internal field layout is, due to historical reasons, compatible with the Microsoft GUID
type defined by the Win32 header files. This is formally defined as the following, which in .NET is uint, ushort, ushort, fixed byte[8]
:
typedef struct _GUID {
unsigned long Data1;
unsigned short Data2;
unsigned short Data3;
unsigned char Data4[8];
} GUID;
This definition has not fundamentally changed since its original introduction and its not going to change in the future either, because doing so would be massively breaking to the entire Windows ecosystem. Since these are just regular fields, the byte order observed via unsafe code matches the endianness of the host system. Since Windows does not currently support any big endian systems, the layout in practice appears to users as always being "little endian" and discussions often simplify it down to this, even if its not "technically correct".
This also means that the layout of these bytes is "functionally fixed". Given a GUID
/UUID
of 6B29FC40-CA47-1067-B31D-00DD010662DA
, Data1
is always equal to 0x6B29FC40
, Data2
is always equal to 0xCA47
, Data3
is always equal to 0x1067
, and Data4
is always equal to { 0xB3, 0x1D, 0x00, 0xDD, 0x01, 0x06, 0x62, 0xDA }
, regardless of the endianness of the underlying system. This then translates over to every single other representable value and so the layout is always well-defined for such a struct.
The NewGuid()
API that exists today generates a "spec compliant" 128-bit GUID
/UUID
following the variant 1, version 4
mechanism described by RFC 4122
. Which is to say it generates 122-bits of random data setting the variant
field to 0b10x
and the version
field to 0b0100
. The general format of this is xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
where M
represents the version
, N
represents the variant
, and x
represents "random data". Since the encoding of variant 1
in the N
location is itself described as 0b10xx
, we then have 4 baseline masks:
xxxxxxxx-xxxx-4xxx-8xxx-xxxxxxxxxxxx
xxxxxxxx-xxxx-4xxx-9xxx-xxxxxxxxxxxx
xxxxxxxx-xxxx-4xxx-Axxx-xxxxxxxxxxxx
xxxxxxxx-xxxx-4xxx-Bxxx-xxxxxxxxxxxx
This proposal is simply asking for additional functions which would generate variant 1, version 5
and variant 1, version 7
mechanisms described by the latest RFC 4122
draft specification. This would be generally the same as the above but with the M
location being 5
-or- 7
, rather than 4
. The x
locations would likewise be generated as per the specification rather than being "random data".
it does not actually mean MSFT flavor = Little Endian = Variant 2
The general discussion was being somewhat simplified as it basically boils down to, in practice, that Windows uses "little endian" and many other (but not all) systems use "big endian". In both cases, this is regardless of the variant/version.
Ultimately, users who need to serialize a Guid
as a sequence of 16-bytes should use the relevant API and pass in the appropriate true
or false
value to isBigEndian
depending on which byte layout they need. This will typically be true
for developers interacting with various Unix APIs or systems, some types of networking, and some databases. This will typically be false
for developers interacting with COM and various Windows APIs or systems. The "correct" value will be scenario dependent.
There is no need to discuss hypothetical alternative layouts which are extremely unlikely to be encountered in practice and the APIs being exposed are sufficient for what is effectively the two de-facto layouts users will actually encounter.
This proposal is simply asking for additional functions which would generate variant 1, version 5 and variant 1, version 7
I'm interested specifically in Variant 1 Version 7
, specifically, whether the fundamental property of v7
- time-based ordering will be preserved with current layout, and will it be compatible with changes from #86798
I.e., Consider {t1, t2, ... tN} is a sequence of timestamps, and {g1, g2, ... gN} is a sequence of v7
Guids produced at these timestamps. Now, if t1 < t2 < ... tN, will it also be true that g1 < g2 < ... < gN. In code, in string serialized form, in little-endian binary serialized form (e.g. in SqlServer), and in big-endian serialized form (e.g. in PostgreSql)
Endianness makes no difference as to the value.
Just as the 32-bit unsigned integer 0x0000_0001
is always 1
regardless of big-endian ({ 0x00, 0x00, 0x00, 0x01 }
) or little-endian ({ 0x01, 0x00, 0x00, 0x00 }
) and therefore always compares less than 0x0000_0002
.
The same is true that 6B29FC40-CA47-1067-B31D-00DD010662DA
is always this exact GUID, regardless of how its stored. It is compared as if it were a 128-bit unsigned integer 0x6B29FC40_CA471067_B31D00DD_010662DA
and therefore will compare less than 0x6B29FC41_CA471067_B31D00DD_010662DA
(6B29FC41-CA47-1067-B31D-00DD010662DA
) and so on.
In the proposal: public static Guid NewUuid5() But v5 is namespace based Guid's. Signature for Guid v5 would be something like: public static Guid NewUuid5(Guid namespace, string name)
But v5 is namespace based...
Yes, you're right, it is namespace based - so the signature would be different. However, it's also the reason it probably won't be implemented in BCL. Namespace based UUIDs (v3 and v5) use MD5 and SHA-1 hash functions to construct an actual binary value. These hash functions are considered vulnerable, so MSFT has an outright ban on any new usage of MD5/SHA-1, even in non-cryptographic context.
It is compared as if it were a 128-bit unsigned integer
0x6B29FC40_CA471067_B31D00DD_010662DA
and therefore will compare less than0x6B29FC41_CA471067_B31D00DD_010662DA
(6B29FC41-CA47-1067-B31D-00DD010662DA
) and so on.
According to this there is no standard way to CompareTo\order a Guid: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=102450
That’s referring to the across all GUID types.
The .NET System.Guid type has long had a standardized sorting behavior and matches how I detailed it above.
The new UUIDs specification was published a few days ago with RFC 9562: https://datatracker.ietf.org/doc/rfc9562/
Any chance this could now be prioritized?
Looking forward to this as well!
In the meantime, my praise goes to @tannergooding for his thorough and insightful posts in this thread.
I expect this won't happen in .NET 9, it's a bit late in the cycle with about 1-2 months before the time we normally snap for RC1.
There's a lot of other work across the stack that is already the focus and a priority, but this can potentially be included for .NET 10. In order for that to happen (or to even attempt getting it in for .NET 9 if there's community members willing to do the work), the proposed surface needs to be updated to account for some of the feedback given here.
Namely we should ensure that signatures exist that make sense given the existing APIs on System.Guid
and which take in any necessary parameters. The proposal should likely include a link to the spec and a small explanation of the layout for the given version and how the parameters are used in constructing the value.
For example, UUIDv5
is generated using bits from the SHA-1
of a string namespaceId + string name
, but there may be considerations based on it being SHA-1 based as well, since SHA-1 is no longer recommended for use. While for UUIDv7
there likely needs to be consideration about how the Unix timestamp is seeded (is it automatic, can the user provide one explicitly, etc). There should likely also be a consideration on whether support for other stable UUID
versions should be provided simultaneously (noting that we have UUIDv4
already via NewGuid()
)
I have created a library for Ulid (890 stars) that has similar characteristics to UUIDv7. https://github.com/Cysharp/Ulid
While its characteristics are desirable, the fact that it is a custom type (struct Ulid) has caused significant frustration because the .NET ecosystem and databases require GUIDs. Therefore, I strongly believe it is very desirable for UUIDv7 to be provided as a standard and for its type to be GUID. I hope for its early adoption (in .NET 9!).
I opened #103658 which covers this for UUIDv7 and opens the path to also do it for UUIDv5 in the future if that's desired
Background and motivation
With the increase of distributed storage, and extensive use of uuid in databases, uuid v4 insert times grow linearly making inefficient its use for primary keys. New versions are proven to have less impact on indexes than v4. A first benchmark with Uuid7 in my 8 core machine with docker/postgres 15 shows: inserting 1 million uuid v4 with EFCore takes 8.5 seconds with an empty table, and 16 seconds with 12750000 rows (table is only id) inserting 1 million uuid v7 with EFCore takes 8.5 seconds with an empty table, and 9 seconds with 12750000 rows (table is only id)
API Proposal
API Usage
Alternative Designs
No response
Risks
No response