Obsolete AddAsync and change HiLo to generate IDs at SaveChanges time

dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.

https://docs.microsoft.com/ef/

MIT License

13.81k stars 3.2k forks source link

Obsolete AddAsync and change HiLo to generate IDs at SaveChanges time #30957

Open roji opened 1 year ago

roji commented 1 year ago

We currently have DbSet.AddAsync alongside Add in order to support the HiLo value generator, which must occasionally contact the database to fetch more IDs. This has the following drawbacks:

Users are frequently confused about when to use AddAsync vs. Add. IDEs suggest converting Add to AddAsync, but that really is necessary only in the very small minority of cases, when HiLo is used.
There are various places in our code where the value generator needs to be invoked, but not all those places have an async context; as a result, we sometimes call the sync Next on the value generator instead of NextAsync, and the user has no way of avoiding sync I/O (#30936). Allowing async everywhere would require adding async overloads to many user APIs (e.g. there's no AttachAsync, DetectChangesAsync).

Instead of this, we can generate a temporary value during Add, and generate the permanent value during SaveChanges, allowing us to obsolete AddAsync.

The main drawback is that entity types will no longer have a (permanent) value after Add, but only after SaveChanges. Since this is already the state of affairs for the default database-based IDs (IDENTITY), and users haven't complained about it, we think that should be OK. Since the usage of HiLo is also generally low, we don't think the breaking change impact is significant.

Note #30753 which is related, proposing switching from client-side GUID generation to server-side. The main objection there is the moving of the (permanent) value generation from Add to SaveChanges time as well.

stevendarby commented 1 year ago

What would this mean for custom I/O value generators?

Currently we implement Next (because we have to) with sync I/O and NextAsync/NextValueAsync with async I/O, and we try to always use AddAsync so we use the async path. Would these changes lead to Next being called on Add and NextAsync being called on SaveChangesAsync, i.e. two I/O calls? Would you have to issue strong guidance that developers update their Next to only produce temporary values, like you plan to do with your Hi-Lo generator?

roji commented 1 year ago

@stevendarby we haven't yet worked out the details - but there will likely be a new type of value generator that can generate both temporary values (during Add, sync only) and permanent ones (later during SaveChanges, sync and async). The current style of value generator would likely continue working at least for a while.

Out of curiosity, can you share some details on your custom I/O value generaotr? We've seen very little of these from users.

stevendarby commented 1 year ago

We consider the DB design and use of value generator to support it as tech debt we essentially want to remove, so to be clear, I don't present this as good case for an I/O generator - I was just hoping there'd be a way to keep it working if you get to this issue before we get to ours 😄

But fwiw, we have 'partitioned' sequences on some tables - e.g. imagine a sequence on Posts that is incremented per Blog. We get the next value by querying Posts for the current max value, filtered by Blog, and adding one :neutral_face:

roji commented 1 year ago

I was just hoping there'd be a way to keep it working if you get to this issue before we get to ours

I wouldn't necessarily work too much about that :wink:

We get the next value by querying Posts for the current max value, filtered by Blog, and adding one

That sounds like it could have some concurrency issues... But thanks for the detail!

seanterry commented 5 months ago

Would you consider renaming it to be less confusing rather than obsoleting it and retiring it entirely?

I generate large object graphs using HiLo and AddAsync to get ID values client-side so they can all be saved in a single SaveChanges(Async) to avoid multiple round-tips and to ensure that they all get the same created/updated/deleted timestamps set by our interceptors.

HiLo+AddAsync has been a godsend in reducing the amount of code ceremony that goes into generating ID values on the client side where GUIDs cannot be used. The fact that it works in both SQL Server and PostgreSQL is fantastic. Even when working with databases created by other applications (like Rails) where HiLo can only claim 1 value at a time, it is still extremely helpful to always be able to use AddAsync by convention.

I can live without Add. I can't fathom getting by without AddAsync or some other conventional way to add an entity to a DbSet and run its id value generator asynchronously prior to SaveChangesAsync. The extension method I use to call AddAsync and return the entity (instead of its entry) is the hottest of my hot paths.

roji commented 5 months ago

I can live without Add. I can't fathom getting by without AddAsync or some other conventional way to add an entity to a DbSet and run its id value generator asynchronously prior to SaveChangesAsync. The extension method I use to call AddAsync and return the entity (instead of its entry) is the hottest of my hot paths.

@seanterry can you provide more details on why you care whether EF fetches the ID from the database at AddAsync-time (as today) or at SaveChangesAsync-time (as proposed here)?

seanterry commented 5 months ago

@roji We adopted it for the exact purpose described in DDD/micro-services architecture guide (emphasis mine):

The Hi/Lo algorithm is useful when you need unique keys before committing changes. As a summary, the Hi-Lo algorithm assigns unique identifiers to table rows while not depending on storing the row in the database immediately.

Our entities don't have navigation properties. We set the foreign key values on related records. We have interceptors for audit-trail logging and attaching created/updated/deleted timestamps that match for all of the entities in a unit of work a.k.a SaveChanges(Async), most of which are going to have foreign key relationships a few levels deep.

One specific case is handling a request to share documents with a student's parents/guardians. Given this pseudocode:

var (organization, sender, packet) = await GetDocumentInfo( request );

// note: AddEntityAsync is an extension method for  `(await AddAsync( entity )).Entity`
organization ??= await db.Organizations.AddEntityAsync( new(), { /* ... */ } );
// proposed change requires Savechanges here
sender ??= await db.Senders.AddEntityAsync( new() { OrganizationId = organization.Id, /* ... */ } );

// proposed change requires Savechanges here

var sigRequest = await db.SignatureRequests.AddEntityAsync( new() 
{
    PacketId = packet.Id,
    SenderId = sender.Id,
    /* ... */ 
} );

// proposed change requires Savechanges here

Dictionary<string,SignatureRequestCulture> sigCultures = new();

foreach ( var culture in request.Recipients.Select( r => r.Culture ).Distinct() )
{
    if ( !sigCultures.TryGetValue( culture, out _ ) )
        sigCultures[culture] = await db.SignatureRequestCultures.AddEntityAsync( new() { RequestId = sigRequest.Id,  /* ... */ } );        
}

// proposed change requires Savechanges here

Dictionary<string,SignatureRequestDocuments> sigDocuments = new();

foreach ( var document in request.Documents )
{
    sigDocuments[document.Id] = await db.SignatureRequestDocuments.AddEntityAsync( new() 
    {
        RequestCultureId = sigCultures[document.Culture].Id,
        /* ... */
    } );
}

// proposed change requires Savechanges here

List<SignatureRequestRecipient> sigRecipients = [];

foreach ( var recipient in request.Recipients )
{
    var sigRecipient = await db.SignatureRequestRecipients.AddEntityAsync( new() 
    { 
        RequestCultureId = sigCultures[recipient.Culture].Id,
        /* ... */ 
    } );

    sigRecipients.Add( sigRecipient );

    foreach ( var ( doc, field ) in recipient.FieldsToSign )
    {
        await db.SignatureRequestDocumentFields.AddAsync( new() 
        {
            RequestDocumentId = sigDocuments[doc].Id,
            RequestRecipientId = sigRecipient.Id,
            /* ... */
        } );
    }
}

await db.SaveChangesAsync();

// start out-of-process worker 

return Ok();

This isn't real code, but that's what it effectively does. I hope that shows an idea of how we're using the ID values populated by AddAsync. We use long for our primary keys, and they are all sourced from the same HiLo sequence.

That one SaveChangesAsync results in even more records created for the audit trail, but with HiLo+AddAsync, most requests like this result in 3 round-trips to the database total: one for GetDocumentInfo, one for getting the next block of IDs for the HiLo sequence (sometimes we're lucky and get to skip this), and one that batches all the inserts together, including the audit trail entries. 4 round trips on a large number of recipients or documents.

The interceptor ensures any of the timestamp values are populated on the records and that they all have the same value since they are part of the same unit of work.

I count no fewer than ~~four~~ five (sorry, I missed one) additional round-trips to get those ID values populated if SaveChanges(Async) is required to get the to populate. ~~And I don't know if value generation would run before or after the audit interceptor's SavingChanges(Async) is called~~ It runs before, based on a simple test I did, so my audit-trailing interceptor would at least work as expected.

The developer does not have to even think about how the value is generated... just call AddAsync or its equivalent everywhere and it Just Works(TM).

seanterry commented 5 months ago

@roji After some time to tinker with some test cases and looking at where this gets used, my main hangup I suppose is this:

Instead of this, we can generate a temporary value during Add, and generate the permanent value during SaveChanges, allowing us to obsolete AddAsync.

How is this envisioned to work? I'm hoping it isn't based on navigation properties or foreign key definitions. The reason for that is because I'm working against existing databases with very few FKs and expect Database.EnsureCreatedAsync to produce an exact replica of the existing database schema for automated tests. EF Core as it exists today is remarkably good at hitting that target and blowing the doors off of the Rails app that actually owns that database schema.

Note https://github.com/dotnet/efcore/issues/30753 which is related, proposing switching from client-side GUID generation to server-side. The main objection there is the moving of the (permanent) value generation from Add to SaveChanges time as well.

I never realized this was a thing, and now I want to use AddAsync with custom value generators to populate my GUID keys implicitly with db-friendly values from UUIDNext. 😛

roji commented 5 months ago

@seanterry thanks for the added details - this change would indeed be problematic for scenarios where EF isn't aware of the navigation. Ultimately it sounds like you want #15854: the ability to have EF navigations (so EF is aware of the relationships), but without actual FKs in the database. Presumably once we have that, you should be able to write your above code without SaveChangesAsync() in the middle just to fetch ID values from the database.

I never realized this was a thing, and now I want to use AddAsync with custom value generators to populate my GUID keys implicitly with db-friendly values from UUIDNext. 😛

AddAsync is only needed if the value generation happens at the server (i.e. requires network I/O). If you're generating your GUIDs locally you can just use Add() instead of AddAsync().

seanterry commented 5 months ago

@roji #15854 and #13146 would both be nice-to-have for some projects, and I have voted them. But that still requires rather extensive refactoring to add code to every entity configuration to make it reach parity with today's functionality, with no guarantee that it will handle edge cases like polymorphic tables (of which we have several, and they are not set up as TPH).

Value generation should be independent of relationships, IMO. AddAsync currently gives first-class support for immediate-use of ID values before they ever exist in the database. Magic glue like temporary values requires a database round-trip before an ID can be used with anything other than EF Core.

If AddAsync goes away, ID values don't generate until SaveChanges(Async), and it requires a relationship to handle populating foreign key references... that's hundreds of IEntityTypeConfiguration<> files and thousands of tests to touch. To say nothing of refactoring all the places we use the AddAsync generated IDs for non-EF things before SaveChanges is called.

Consider the following example:

public async Task<StudentAttachment> Handle( long studentId, PlanType planType, string path, string title, CancellationToken cancellationToken )
{
    var result = ( await db.StudentAttachments.AddAsync( new() 
    {
        StudentId = studentId,
        DomainId = domainAccessor.Current,
        PlanType = planType,
        Name = Path.GetFileName( path ),
        Title = title,
        StorageKey = "temporary"
    }, cancellationToken ) ).Entity;

    result.StorageKey = Path.Combine( "student_attachments", studentId.ToString(), planType.ToString(), result.Id.ToString(), result.Name );

    await storageService.Upload( result.StorageKey, path, cancellationToken );
    return result;
}

Concise and testable. It doesn't call SaveChangesAsync because that's not this method's job and we don't want it called if the upload method throws an exception. There are three other records added after this that we want to go in the same db round-trip and which need the exact same created/updated timestamps applied, which is currently handled in a SaveChangesInterceptor.

To make this work with Add, I need to start a transaction in the calling method (extra db round-trip without dotnet/SqlClient#1554), make an extra SaveChangesAsync call (extra db round-trip), and explicitly set the timestamps in the records that follow, or else come up with some other ambient timestamp scope that can be injected somewhere. And then commit the transaction (extra db round-trip).

So where one round-trip (two if HiLo is exhausted) would suffice, we now have four (or five for HiLo) to accomplish the same task (some other refactoring could maybe eliminate one). In addition to the workload of refactoring every chunk of code that uses this pattern to avoid round-trips to the database. And I penny-pinch round-trips to the database because a LOT of this code runs on an FTP server on the east coast, and the application database is on the west coast connected by point-to-point VPN. Should we change how the storage keys are determined and move the FTP server to be co-located with the database? Absolutely. Can I make that happen before .NET 20? 🤷

If there were something like a RunValueGeneratorsAsync method hanging off of DbSet and DbContext when this change goes through, then those of us that depend on this functionality could add AddAsync back in as an extension method without much effort.

AddAsync is only needed if the value generation happens at the server (i.e. requires network I/O). If you're generating your GUIDs locally you can just use Add() instead of AddAsync().

I learned a new thing. I'll have to try that out since that's another handy convention I can adopt.

FWIW, I love EF Core and how much code I don't have to write to get exactly the behaviors I want just by following a few conventions. HiLo, AddAsync, and Interceptors are the killer combo that got us to adopt it. Knowing how widespread patterns like the one above is in our code and how much more we have planned, I will lose sleep over this proposed change.

roji commented 5 months ago

Value generation should be independent of relationships, IMO.

That's definitely not the case - the reality is that the two can interact in various ways. For example, when using a server-generate ID (e.g. SQL Server IDENTITY), SaveChanges breaks a batch into two parts when it needs to set the foreign key on a dependent to point to the server-generated ID of something that's inserted earlier.

If AddAsync goes away, ID values don't generate until SaveChanges(Async), and it requires a relationship to handle populating foreign key references... that's hundreds of IEntityTypeConfiguration<> files and thousands of tests to touch. To say nothing of refactoring all the places we use the AddAsync generated IDs for non-EF things before SaveChanges is called.

First, there's frequently no need to do explicit configuration of navigations; EF is perfectly capable of automatically discovering relationships by convention for a large number of cases. I'd suggest thinking about this a bit more - is it really bad to have EF simply be aware of the navigations?

Beyond that, I understand and appreciate the problem, but at the end of the day, EF is designed to have an accurate model of your database, and relationships are part of that model (regardless of whether FKs are actually defined in the database). Requiring everything in EF to work optimally while at the same time refusing to fully model things is a bit much, especially given that AddAsync also has its share of issues (see the original post above).

seanterry commented 5 months ago

when using a server-generate ID (e.g. SQL Server IDENTITY), SaveChanges breaks a batch into two parts when it needs to set the foreign key on a dependent to point to the server-generated ID of something that's inserted earlier.

That's fine for identity values. But I don't want that behavior for HiLo. To quote @ajcvickers (emphasis mine):

One of the original motivations for Hi/Lo was that it generated real IDs as soon as the entity is tracked--same reason for client-side GUIDs. EF6 could never do this, and it was a frequent complaint. We could create a HiLo generator which has the characteristics that you like while not generating IDs until SaveChanges. (i.e. one extra round trip ever x number of calls to SaveChanges.) That would be much simpler in lots of places, but wouldn't match one of the main reasons for doing it in the first place.

It's fine for EF to be aware of navigations. And I use them where they make sense.

Requiring everything in EF to work optimally while at the same time refusing to fully model things is a bit much, especially given that AddAsync also has its share of issues (see the original post above).

EF Core already works optimally for my use cases, without modeling every possible relationship. As described, I would need to fully-model everything, refactor a lot of code, and accept sub-optimal performance simply to account for a breaking change. I won't do that. If this goes through, I'll probably just hold my nose and make value generator that does HiLo sync-over-async so it will work on Add, now that I know that's a thing.

roji commented 5 months ago

when using a server-generate ID (e.g. SQL Server IDENTITY), SaveChanges breaks a batch into two parts when it needs to set the foreign key on a dependent to point to the server-generated ID of something that's inserted earlier.

That's fine for identity values. But I don't want that behavior for HiLo.

My statement there was in response to you saying that "Value generation should be independent of relationships" - that's already not the case.

One of the original motivations for Hi/Lo was that it generated real IDs as soon as the entity is tracked [...]

So there's two things being slightly conflated here in my opinion:

Having non-temporary ID values available right after add (not possible with IDENTITY, currently the case with HiLo (hence AddAsync))
Not having EF aware of navigations

In other words, if you're OK with properly configuring EF for navigations (you indicated you're not above, "hundreds of IEntityTypeConfiguration<> files and thousands of tests to touch" - I'm still not sure why that would be the case), then you shouldn't need(1) above. Rather than manually setting foreign key values - as your code is currently doing because it's bypassing the navigation concept - you'd just connect entities together via properties, and just let SaveChanges figure everything out (that's it's job). Yes, before SaveChanges is invoked, the instances would have temporary IDs, but when you call SaveChanges, EF would populate the non-temporary IDs, possibly fetching more HiLo values from the databases (only if needed). That's really EF is meant to be used - just connect .NET objects in the normal .NET way via properties, rather than dealing directly with foreign key values.

Concern (1) above - having non-temporary ID values right after add - is for when you want to e.g. send an ID to some external system - like to a user browser via JSON - before calling SaveChanges(). You haven't indicated anywhere that this is what you need.

So nobody's saying you need to change your code and do tons of extra roundtrips - that would indeed be bad. The point is that if you use EF navigations properly, you shouldn't need any extra roundtrips and your code should stay the same (except of course for starting to use the navigations).

seanterry commented 5 months ago

@roji

The point is that if you use EF navigations properly

I didn't post here for an Ivory Tower opinion on what I should-or-should-not be doing with EF navigations and relationships.

I posted here to voice my support for keeping HiLo working as-is, as-intended, and as-documented.

I absolutely use EF in all its ideal navigational glory in newer projects. And I learned a thing or two from the side conversation that will save me a few lines of code, for which I am grateful.

But I also use EF to handle gnarly edge cases where other teams and their opinionated framework cesspools struggle. Many of those edge cases involve relationships EF cannot currently represent (see #15854, #13146, polymorphic tables, and a couple of unruly ones whose related table is defined by tenant-specific feature flags in a config file). I am perfectly aware of what should be done instead, but I'm playing the hand I'm dealt.

Concern (1) above - having non-temporary ID values right after add - is for when you want to e.g. send an ID to some external system - like to a user browser via JSON - before calling SaveChanges(). You haven't indicated anywhere that this is what you need.

Per my example above:


public async Task<StudentAttachment> Handle( long studentId, PlanType planType, string path, string title, CancellationToken cancellationToken )
{
var result = ( await db.StudentAttachments.AddAsync( new() 
{
StudentId = studentId,
DomainId = domainAccessor.Current,
PlanType = planType,
Name = Path.GetFileName( path ),
Title = title,
StorageKey = "temporary"
}, cancellationToken ) ).Entity;

result.StorageKey = Path.Combine( "student_attachments", studentId.ToString(), planType.ToString(), result.Id.ToString(), result.Name );

await storageService.Upload( result.StorageKey, path, cancellationToken );
return result;

}


In this example, we use the HiLo generated ID immediately to build keys for uploading objects to blob/S3 storage. The ID needs to be part of the path for it to be located by other applications. SaveChanges is not called unless/until the upload completes successfully. We have many such handlers where an ID is used immediately to enqueue a background job, build a link, include in an electronic signature envelope signed by an HSM... all of which defer calling SaveChanges(Async) until they complete successfully. This neatly avoids the BeginTransaction/Commit ceremony code required when using values generated by identity or default constraints.

This is what HiLo exists for.

> So nobody's saying you need to change your code and do tons of extra roundtrips - that would indeed be bad.

I literally gave examples of where extra round-trips would occur or code would have to change. And unless EF can magically detect relationships that I didn't define, then I at least have to add navigation properties and composite foreign keys that don't exist to hundreds of entities.

roji commented 5 months ago

This is what HiLo exists for.

That is definitely not its primary purpose in my mind; HiLo is about reducing roundtrips when inserting a principal and a dependent, not about making non-temporary keys available immediately after AddAsync. There are other opinions on this - and that's fine - but the question of whether non-temporary IDs should be available right after Add/AddAsync is ultimately orthogonal to HiLo. Once again, see #30753 which is very related: generating GUIDs on the client is nice because it makes them available immediately after Add, but the GUIDs generated on the server are generally better.

Note that a decision hasn't been reached here yet, and there's no consensus on the team. But ultimately, even if we move HiLo to work at SaveChanges, you can still do HiLo yourself without using the built-in value generator. In other words, you'd just implement your own HiLo cache and fetch values. Is it nicer that EF does that for you? Sure. But as already stated several times above, the current situation also has iots drawbacks, which is why we're considering changing this.

ajcvickers commented 5 months ago

Just a couple of cents from me. 😆

First, we have had lots of feedback over the years from people who want their keys available before SaveChanges is called. Most of this may have come from the EF6 days, when this was difficult/impossible. This was one of the reasons we implemented HiLo.

Recently, @roji has been asking the difficult question, "But why do the ID values need to be known?" I haven't been able to come up with any case where the ID values could not be obtained during SaveChanges, and any processing of the entities to use the values appropriately happens then.

Having said that, I think two things remain:

There is additional complexity of moving this processing later, especially for applications that have been written to assume that it happens on Add. I don't think we can change this and expect applications to "just work."
We don't know all the scenarios, and just because we haven't come up a compelling case doesn't mean it isn't out there, in use.

Finally, realistically, the chance that anything will change is, I think, minimal. We could add much more value changing other things. So I suspect most of this is academic.

voroninp commented 4 months ago

I'd rather throw sync versions of the methods and replace them with async counterparts. Sync closes the door for async extensions.

Users are frequently confused about when to use AddAsync

So you are telling devs are so lame that they are incapable of reading xmldocs?

This method is async only to allow special value generators, such as the one used by 'Microsoft.EntityFrameworkCore.Metadata.SqlServerValueGenerationStrategy.SequenceHiLo', to access the database asynchronously. For all other cases the non async method should be used.

roji commented 4 months ago

So you are telling devs are so lame that they are incapable of reading xmldocs?

No; async methods suggest that I/O may need to happen under the hood, although in 99% of applications this will never be the case - only applications using HiLo ever need this. This is a confusing state of affairs that goes against people's legitimate mental model of what it means to call Add/AddAsync.

voroninp commented 4 months ago

@roji ,

temporary value management is something that EF already does - when you use regular identity columns, where the identity value is generated in the database (the default/common modeling), the property already gets a negative, temporary value.

But this value is not mirrored to entity's property, is it? And one of the motivations to have ids before saving changes is to be able cross reference entities by ids only.

Speaking about mental models:

[Fact]
public async Task Check_id_value_after_add()
{
    var dbContainer = new PostgreSqlBuilder()
        .Build();

    await dbContainer.StartAsync();

    var dbContext = new AppContext(dbContainer.GetConnectionString());
    dbContext.Database.EnsureCreated();

    var entity = new Entity();
    var entry = dbContext.Add(entity);
    var idProperty = entry.Property(e => e.Id);

    idProperty.Metadata.GetValueGenerationStrategy()
        .Should()
        .Be(NpgsqlValueGenerationStrategy.IdentityByDefaultColumn);

    idProperty.CurrentValue.Should().BeLessThan(0, "Must be negative");
    idProperty.IsTemporary.Should().BeTrue();
    entity.Id.Should().Be(0);

    await dbContainer.StopAsync();
}

I am surprised to discover that CurrentValue differs from value of entity's property.

roji commented 4 months ago

@voroninp a lot of this is covered by our change tracking docs; in any case, none of this is directly relevant to this issue, since temporary values already exist in EF (e.g. for identity). This issue is purely about moving HiLo-style value generation from AddAsync to SaveChanges.

voroninp commented 4 months ago

This issue is purely about moving HiLo-style value generation from AddAsync to SaveChanges.

Do I understand correctly that for HiLo strategy you are going to generate temporary values but also expose them in properties of the entity, so I can do this:

ctx.Add(foo);
bar.FooId = foo.Id; // without HiLo Id will be default, that is 0
ctx.Add(bar);

Because currently HiLo generated does exactly that. As soon as I add the entity, I see its Id property has non-default value.

roji commented 4 months ago

@voroninp no. As with identity, the temporary value isn't visible on the entity type, only in the change tracker.

voroninp commented 4 months ago

This is going to be a big breaking change then.

roji commented 4 months ago

@voroninp this has all been discussed above. We're not sure whether we'd introduce a new API for this as a way to avoid the break, etc.