PomeloFoundation / Pomelo.EntityFrameworkCore.MySql

Entity Framework Core provider for MySQL and MariaDB built on top of MySqlConnector
MIT License
2.67k stars 379 forks source link

Please support Mariadb "ON DUPLICATE UPDATE" for saving results #1346

Open ramon-garcia opened 3 years ago

ramon-garcia commented 3 years ago

When one saves results to the database with dbcontext.SaveChanges(), an exception is raised if the table already contains a row with the same primary key value.

A very desirable property for a process of loading data from somewhere (or a process transforming it) is idempotency: repeating it (if part of it failed, for instance) should not fail, and should give the same result as running it once.

There should be an extension method, like DbContext.SaveChangesOnDupUpdate() that adds the option "ON DUPLICATE UPDATE" to the Sql statement.

I could implement this feature if it is welcome.

mguinness commented 3 years ago

See https://github.com/dotnet/efcore/issues/4526 and FlexLabs.Upsert for a possible solution.

lauxjpn commented 3 years ago

There should be an extension method, like DbContext.SaveChangesOnDupUpdate() that adds the option "ON DUPLICATE UPDATE" to the Sql statement.

I think it would be better to implement this on the DbSet level, so that it can be decided per entity, whether an INSERT or an UPSERT should be performed, instead of applying it to all entities that have been added to the context. Corresponding extension methods would be added to DbSet and DbContext in addition to the existing Add() methods, like AddOrUpdate() or AddOrModify(). This would currently not apply an UPSERT to any objects that get added to the context through the object that was specified using the AddOrUpdate()/AddOrModify() method (we will need to decided, whether or not we need an option for that scenario).

The implementation would need to subclass DbSetSource and register it via IDbSetSource and return an object that is subclassed of InternalDbSet. This InternalDbSet subclass could then save additional state (the fact that the Add is actually an UPSERT when the AddOrUpdate()/AddOrModify() method is being called).

We would also need to subclass InternalEntityEntryFactory or just implement IInternalEntityEntryFactory, and register it, to return our own InternalEntityEntry subclassed objects, that implement a custom interface like IAddOrUpdateEntry (similar to IUpdateEntry), that we can use later in the pipeline to extract the UPSERT information from the entity entry (so we know it's an UPSERT and not an INSERT).

We could then probably just alter the existing INSERT command pipeline to honor the UPSERT.

The UPSERT command would either insert all fields (including the PK) or update all fields except the PK. The PK needs to be first, so that MySQL uses it instead of another UNIQUE index for duplication comparison (MySQL quirk, though shouldn't matter too much either way). Since UPSERT works with LAST_INSERT_ID(), the post command query logic can stay unchanged.

The whole implementation should be done with as little impact on the base classes as possible, so maintenance is simple, as these base classes are all internal to EF Core and could change at a later point in time.

When EF Core implements this feature in the future, it is likely that we will be able to adapt our code accordingly with too much hassle for the community or for us.

ramon-garcia commented 3 years ago

Sorry for not answering before.

There are important use cases where specifying update on duplicate for all entities is useful: when one is loading data from text files, CSV, XML, web services for analysis. One wishes that the whose process is idempotent, so that in case of error or interruption one can repeat the process without errors or duplicated rows.

lauxjpn commented 3 years ago

Sorry for not answering before.

That is fine. These are more general thoughts of me, so I don't forget them late, when it comes to implementing a prototype for this.

There are important use cases where specifying update on duplicate for all entities is useful: when one is loading data from text files, CSV, XML, web services for analysis.

Yes, I can see that. However, allowing a feature like this to be controlled on an entity level would work fine in those scenarios. You would just call AddOrModify() instead of Add(), for every entity.

On the other hand, making this a feature that can only be used on a context-wide level would not allow scenarios, where only specific entities should be updated if already existing. And this would be a major issue.

One wishes that the whose process is idempotent, so that in case of error or interruption one can repeat the process without errors or duplicated rows.

This part is already supported through transactions.


BTW, while I was recently testing different scenarios to optimize multiple UPDATE statements for the MySQL related TechEmpower benchmarks, I found out that using ON DUPLICATE UPDATE in practice can lead easily to locking issues, with MySQL aborting the command because it detects a deadlock.

So people that use this feature should be aware, that with increasing concurrent database access, an increasing number of those ON DUPLICATE UPDATE commands would need to be retried (and therefore the retry policy should be carefully configured for those scenarios).