dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.72k stars 3.17k forks source link

Documentation is unclear for Guid primary keys #34260

Closed jscarle closed 2 months ago

jscarle commented 2 months ago

I am currently in the midst of trying to repair the data architecture of an application which fell victim to the original designer's poor decision to make the primary keys of each of the tables Guids stored as char(36) strings... sigh... 😔

Concidently, with the arrival of Guid.CreateVersion7, this has renewed discussions regarding the use of Guid in databases and surfaced the issues faced SQL Server's endianess conflicts when dealing with Guids. Fragmentation of clustered indexes when using Guids has been going on for years now.

I know that similar discussions have come up within this repository and I know that the documentation has been improved in the past, but honestly I'm now more confused than ever.

In the documentation, Explicitly configuring value generation, there is a mention that "on SQL Server, when a GUID property is configured as a primary key, the provider automatically performs value generation client-side, using an algorithm to generate optimal sequential GUID values", but it doesn't mention whether the generation is lexicographical, nor if it is stored lexicographically within the database by taking into account the endianness difference between Guids in .NET and uniqueidentifiers in SQL Server.

In SQL Server's documentation for NEWSEQUENTIALID, there is a mention of how it can be used to "generate GUIDs to reduce page splits and random IO at the leaf level of indexes".

The EF Core documentation for the SQL Server provider for Guids also mentions that "the provider automatically generates optimal sequential values, similar to SQL Server's NEWSEQUENTIALID function", but again without any mention of its impacts on clustered index fragmentation.

There's also the privacy concern that sequential uniqueidentifier values brings forward.

On the surface, it seems like Version 7 of the Guid would solve a lot of these issues, but with the lack of clarity regarding the way a Guid translate to the way it's finally stored in the database, and what EF Core does when storing it, makes me unsure on which direction to go with this.

I could write my own value converter that calls TryWriteBytes(destination, bigEndian: true) when storing and new Guid(source, bigEndian: true) when reading with a binary(16) instead of a uniqueidentifier, but I'd rather avoid going through all of the trouble to reimplement something that may already be done internally.

So any clarification would be greatly appreciated, and thanks for all your hard work @roji and @ajcvickers!

roji commented 2 months ago

Concidently, with the arrival of https://github.com/dotnet/runtime/issues/103658, this has renewed discussions regarding the use of https://github.com/dotnet/runtime/issues/86084 and surfaced the issues faced https://github.com/dotnet/runtime/issues/86798 when dealing with Guids. Fragmentation of clustered indexes when using Guids has been going on for years now.

I don't think anything in https://github.com/dotnet/runtime/issues/86084 is relevant to this discussion - that's purely about endianness in the binary serialization APIs around Guid. Getting the endianness right when serializaing Guid is an internal concern of the low-level database driver (SqlClient in this case), which is also why I think the impact on the end user in that conversation is greatly exaggerated (i.e. users generally shouldn't need to deal with binary serialization of GUIDs in database contexts, unless the GUID is stored as a binary field). In any case, all this has nothing really to do with index fragmentation.

The EF Core documentation for the SQL Server provider for Guids also mentions that "the provider automatically generates optimal sequential values, similar to SQL Server's NEWSEQUENTIALID function", but again without any mention of its impacts on clustered index fragmentation.

The SequentialGuidValueGenerator that's used by default with the EF SQL Server provider largely mimics the SQL Server NEWSEQUENTIALID function, as is clearly documented; the point there is precisely to reduce index fragmentation. A link to the NEWSEQUENTIALID page in the SQL Server docs is provided right there, so users can go deeper and understand exactly what that means. What additional documentation do you think is missing in the EF docs, beyond us replicating the entire SQL Server NEWSEQUENTIALID docs inside the EF docs, which I don't believe we should do?

If you're looking for guidance in the EF docs on how to generate your own binary representation of efficient GUID values for SQL Server, then I don't think that belongs in our docs. That's quite an anti-pattern, and doing this correctly has nothing to do with EF.

On the surface, it seems like Version 7 of the Guid would solve a lot of these issues, but with the lack of clarity regarding the way a Guid translate to the way it's finally stored in the database, and what EF Core does when storing it, makes me unsure on which direction to go with this.

That's most likely not the case: i haven't personally verified this, but SQL Server has its own sort order which UUIDv7 wouldn't be appropriate for (see https://github.com/dotnet/efcore/issues/33579#issuecomment-2225877527).

I could write my own value converter that calls TryWriteBytes(destination, bigEndian: true) when storing and new Guid(source, bigEndian: true) when reading with a binary(16) instead of a uniqueidentifier, but I'd rather avoid going through all of the trouble to reimplement something that may already be done internally.

I still am not clear on whether you're representing your GUIDs as binary or string in the database; at the top of your post you mention char(36), but below you speak of binary(16).

Timovzl commented 2 months ago

@jscarle A .NET Guid has the same ordering as its .ToString() counterpart. With .NET 9's UUIDv7 being monotonic in Guid form, that means it is monotonic in string form too. As such, the char(36) column should achieve the intended performance benefits from UUIDv7.

The endianness only comes in to play when serializing to binary, which you don't seem to be doing.

roji commented 2 months ago

Closing this as no answer was provided to my last comment, and I'm not seeing anything actionable on our side. @jscarle if you'd like to continue the conversation, please post back here and we can revisit.