Open NickAb opened 8 years ago
I wasn't aware of that issue. I had assumed that the target table of a MERGE
is locked like in an update. This is a test case with both INSERT
and UPDATE
clauses, though. With just one type of clause, I'd definitely expect the same behavior as a normal DML. With inserts for sure since there is no source table. For updates... it would need to be tested. I'd test that by using SQL Profiler to display locks taken and compare UPDATE
and a MERGE
equivalent to an update. Or, I'd leave the transaction open and use the DMVs to look at active locks. That can paint the wrong picture, though, because locks might be released. This is common for update operations where rows are tested under a U-lock which is then released if there is no match.
HOLDLOCK
should cause the target table to be accessed equivalently to SERIALIZABLE
... which is not an acceptable default. Or at least, it must be optional.
I wasn't aware of that issue. I had assumed that the target table of a MERGE is locked like in an update.
Same here.
I agree that the problem seems specific to UPSERT (this issue), and that our current usage of MERGE for insertion seems safe (in any case, it has been that way for a long time and we haven't seen complaints). Note that in EF Core we currently don't use MERGE for updates, though I already have a note on my list to look at that at some point.
Perhaps a suitable atomic alternative would be to send a query of two statements: an INSERT … ON DUPLICATE KEY UPDATE followed by an UPDATE (with the primary key as the match condition).
To avoid any shenanigans around unique secondary keys (where coincidentally-matching rows would get updated too), the INSERT should, for any row, always either insert or lock without modifying. This way, the first statement guarantees that (A) all desired rows now exist and (B) we have a lock on each of them.
To make the INSERT … ON DUPLICATE KEY UPDATE lock-but-not-modify existing rows, we can have its SET clause change the primary key to its own value (e.g. id = id). Despite making no modifications, this does indeed lock the row, at least on MySQL. Hopefully others can confirm the same behavior for other SQL flavors.
An open challenge is how to set different rows to different values. While the INSERT achieves this with ease, it is not obvious for the UPDATE. (We could issue separate statements if that is the accepted practice, but it sure feels suboptimal.)
@Timovzl are you talking about SQL Server? Because I'm not aware of ON DUPLICATE KEY UPDATE
being supported there (the discussion above about MERGE is SQL Server-specific).
@roji The joke’s on me. 😄 That renders my point moot for SQL Server.
I suppose if we’d start with an INSERT ignoring duplicates (thus ensuring that all rows exist) followed by an UPDATE, a competing DELETE (on the pre-existing, unlocked rows) could occur in between, leaving a final result that is expected by neither of the two competitors?
Similarly, I suppose an UPDATE followed by an INSERT could suffer from a competing INSERT or DELETE in between the two statements?
Presumably, the issue could be circumvented by both starting and ending with an UPDATE, i.e. performing three statements. That’s a lot of hoops to jump through…
Presumably, the issue could be circumvented by both starting and ending with an UPDATE, i.e. performing three statements. That’s a lot of hoops to jump through…
Not really, there's always the possibility someone would DELETE your row just before the update. There really isn't any way to a truly atomic UPSERT in a 100% reliable way,
But that's all besides the point - most databases (other than SQL Server) do have a true atomic UPSERT operation. For example, PostgreSQL and SQLIte have INSERT ... ON CONFLICT UPDATE
(see this document for an overview of the UPSERT situation across databases, and some design considerations for the PG feature). So we'd presumably implement the EF Core Upsert API via these database-specific atomic mechanisms, and on SQL Server we'd use the non-atomic MERGE because there's no other choice.
The usual way of imitating upsert is that you insert and then update, isn't it? The problem you can get with some databases is that the transaction gets rolled back when the insert fails, but I think it works if you can avoid that.
Imagine two transactions running concurrently. One is doing an upsert, and one is deleting the record. If upsert is truly atomic, at the end of the process, you will either have an updated record (the delete went first) or no record (the upsert went first).
If you now consider insert followed by update, you get those two possibilities, and you also have to consider what happens when the delete falls between the insert and the update.
If the insert succeeded, the transaction will be holding a lock on the new record, so the delete will be held up until the transaction completes. You therefore get no records at the end of the process, which is the same as you would get if the delete followed a truly atomic upsert.
If the insert failed, the transaction won't be holding any locks, so the delete can execute immediately. The update will be a no-op because there are no records for it to operate on, and again you will have no records at the end of the process.
@Petex the point is that an atomic upsert doesn't fail (or causes a deadlock), and when it completes, the row is there and has the desired values. Of course, if the upsert is racing with a delete, the row may end up being deleted, but that's not relevant here.
This is a topic that has been discussed a lot in databases, I really suggest reading the PostgreSQL document I linked to above, it provides lots of context and design considerations.
Note that SQL Server MERGE is apparently not atomic - see here for various resources discussing this, especially this and this.
tl;dr without HOLDLOCK, MERGE is vulnerable to concurrency issues, so if two upserts are raising, one of them may fail with an error (since the row didn't exist when it checked, but by the time it attempts to insert it, it did). With HOLDLOCK, deadlocks may occur.
This doesn't in itself mean that MERGE can't be use to support UPSERT, but think about exactly what we want to do here.
The FlexLabs.Upsert library uses the MERGE with HOCKLOCK hint, and has an open issue that discusses some of the deadlock limitations and possible workarounds that might be useful context when trying to solve this for SQL Server artiomchi/FlexLabs.Upsert/issues/100
Alternatively I don't know how much influence the EF team have internally to push the SQL Server team to implement a true UPSERT statement in a future release or if this is already on their roadmap
If you now consider insert followed by update, you get those two possibilities, and you also have to consider what happens when the delete falls between the insert and the update.
@PeteX I believe this one "leaks" when multiple rows are involved. Between the INSERT and UPDATE, imagine that a competing DELETE removes some (but not all) of the pre-existing rows. We then have a hybrid result between the DELETE and the UPSERT. In other words, the result cannot be described as one query having gone first and the other last.
Presumably, the issue could be circumvented by both starting and ending with an UPDATE, i.e. performing three statements. That’s a lot of hoops to jump through…
Not really, there's always the possibility someone would DELETE your row just before the update. There really isn't any way to a truly atomic UPSERT in a 100% reliable way,
@roji If we do the {UPDATE, INSERT, UPDATE}
I intended to describe, I believe the UPSERT is atomic in a 100% reliable way:
Analyzing the possible interference, I conclude the following:
Every possible result is indistinguishable from either the simulated UPSERT having entirely gone first or it having entirely gone last.
Note that this is comparable to what I originally described for MySQL, although that required only two steps: insert-new-or-lock-existing (now every relevant row exists and is locked by us, which SQL Server could achieve with (1) plus (2)), and then update (the same as (3) for SQL Server).
@Timovzl an "INSERT that ignores duplicate" is again something which does not exist on SQL Server, as far as I'm aware. If you're discussing an MySQL-specific solution, then there's already an atomic INSERT ... ON DUPLICATE KEY UPDATE
syntax (link) which does this in a single statement, so I'm not sure in which situation UPDATE+INSERT+UPDATE
is supposed to help with...
(note that MySQL's INSERT ... ON DUPLICATE KEY UPDATE
isn't perfect - it doesn't provide a way for specifying the unique index to use for the conflict check, so problematic for use when there are several. Again see this link for more information).
@Timovzl an "INSERT that ignores duplicate" is again something which does not exist on SQL Server, as far as I'm aware.
@roji I stand corrected. I had to read up on this. No atomic UPSERT and no INSERT IGNORE... That is a serious issue for an RDBMS. 😞
(note that MySQL's INSERT ... ON DUPLICATE KEY UPDATE isn't perfect - it doesn't provide a way for specifying the unique index to use for the conflict check, so problematic for use when there are several. Again see this link for more information).
Agreed. I remember mentioning this earlier in this thread. It is actually the reason that I proposed the two-step approach for MySQL.
I'd really hoped it could be extended to SQL Server, but without a way to ignore duplicates, it's a pickle.
SQL Server does not have per-query duplicate ignore. It has a per-index setting. This feature works by rewriting the query plan and inserting checks by joining. This feature not suitable for use in a general setting or in an ORM. In my opinion, it should never be used and should not have been added to the product.
I wonder how other RDBMSes implement these duplicate ignore features. I can think of various ways, all of which involve trade-offs:
That being said, the SQL Server limitations around OUTPUT
and other DML features do not follow from its design. These are just deficiencies which are hard to explain. The product has a history of refusing to implement some technically really simple convenience features. An example is null safe compare (a IS b
). The engine already supports that and it shows up in query plans. But $&%$ it, it fails to be exposed in SQL syntax for 20 years now forcing us to do nasty workarounds. Overall, it's an awesome RDBMS, though.
Locking/Concurrency: Achieving safety is always easy, but it must trade-off concurrency or incur deadlock risk.
Great point. It made me contemplate this earlier remark:
HOLDLOCK should cause the target table to be accessed equivalently to SERIALIZABLE... which is not an acceptable default. Or at least, it must be optional.
Provided that we want atomic behavior, is it actually possible for us to do better, especially in a bulk scenario?
If A wants to atomtically "INSERT IGNORE" 1, 2, 3, 4
and B similarly wants to do 6, 5, 4, 3
, can we actually avoid a deadlock? I suspect that the only way to do that might be to align the orderings.
As long as all those operations originate from EF, I suppose we're doing them by PK and we could pass them in ascending order. That would constraint potential conflicts to unrelated queries, such as custom queries that neglect to account for deadlocks.
If it is impossible to meet the requirement without the chance of deadlocks, then we might as well go with MERGE and HOLDLOCK.
Optimistic concurrency could be an alternative, although that can be a pain to deal with in batch scenarios.
If A wants to atomtically "INSERT IGNORE" 1, 2, 3, 4 and B similarly wants to do 6, 5, 4, 3, can we actually avoid a deadlock? I suspect that the only way to do that might be to align the orderings.
I think several things are being conflated here. UPSERT is nothing more than making sure a row exists and with the specified values, without possibility of failure or deadlock. Deadlocks produced by inserting/updating multiple different rows with different orderings is an entirely different thing. EF does indeed provide deterministic ordering when inserting/updating, so deadlocks from inserting multiple should not occur; but that really is orthogonal to UPSERT.
At the end of the day, an atomic UPSERT is an operation that the database either provides or does not. If it does not, it's still possible to implement an EF API via a non-atomic UPSERT (e.g. with MERGE); that's obviously not ideal, but we can only do things which the database supports.
(optimistic concurrency is already well-supported by EF, including in batching scenarios)
I got the impression that MERGE with HOLDLOCK was atomic.
MERGE with HOLDLOCK isn't atomic - it's equivalent to an IF block with INSERT and UPDATE, wrapped in a lock (that's discussed in the various links as well).
I have done some digging and experimenting. We can atomically obtain all our desired locks and discover which rows already exist:
BEGIN TRANSACTION
SELECT Id
FROM Table
WITH (UPDLOCK, HOLDLOCK)
WHERE
Id IN (1, 2, 3)
UPDLOCK causes update locks to be taken on rows that exist.
HOLDLOCK causes locks to be retained until the end of the transaction. What's more, it "is equivalent to SERIALIZABLE". As documented, SERIALIZABLE's "singleton fetch" places range locks on the gaps where the missing values could exist.
The above query locks both the existing rows and the gaps where the missing rows could exist. It returns the IDs of the existing ones.
Within the same transaction, we can now safely proceed to UPDATE the existing rows and INSERT the missing ones.
@Timovzl I'm no expert, but is this any different from just doing MERGE with HOLDLOCK?
@roji I don’t know much about MERGE, but you said that MERGE with HOLDLOCK isn’t atomic. As such, I’ve provided an alternative that is not MERGE and that is atomic.
It helps that its guarantees are easier to reason about and test.
@Timovzl holding a lock and being atomic is not the same thing. Holding locks can cause deadlocks and blocks other transactions until the lock releases; these two things would not happen with an atomic UPSERT. I'd recommend reading up on these two concepts, and especially the link I posted multiple times above which deals with this.
Again, I'm no expert, but the same HOLDLOCK which you propose on SELECT figures above in MERGE, and probably has the same effect in both statements.
@roji I'm curious what distinction you are pointing to, and what makes you say that MERGE
with HOLDLOCK
is not atomic.
Atomicity: "An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright."
Since for most queries we cannot achieve this property with a simple compare-and-swap, locks are held in order to achieve it. The referenced implementations very much do so: they hold locks and block other transactions until the lock releases (which is when the transaction completes).
PostgreSQL's beautiful ON CONFICT
and MySQL's not-quite-ideal ON DUPLICATE KEY
both use locks in the following way to achieve an atomic UPSERT: Traverse an index, locking and modifying matching rows as they are encountered. The locks are held until the end of the transaction - this is necessary to guarantee that, by that time, we can still commit the complete result as-is. It naturally follows that deadlocks may occur (and indeed they can, in each of these implementations - example 1, example 2, example 3). Deadlocks occur when concurrent modifications share at least two rows and work in a different order. The query author or the optimizer can avoid deadlocks by working in a consistent order.
The proposed SELECT ... WITH (UPDLOCK, HOLDLOCK)
works in the exact same way (except in that it needs separate INSERT/UPDATE statements to perform the actual modifications). We obtain all the locks in a single statement, just like ON CONFLICT
does, and we retain them for the transaction's duration, just like ON CONFLICT
does. Within the same transaction, whether we follow up with 0 or 100 subsequent UPDATEs to our locked rows (or INSERTs into our locked gaps) is irrelevant at that point, for either implementation. Furthermore, by operating in the same way, we have the exact same chance of deadlocks. (Luckily, most people simply don't encounter them, because UPSERTS tend to be based on the PK and thus generally have a consistent ordering.)
Presumable, MERGE
with HOLDLOCK
works the same way as well. If it does, and assuming that its other quirks have been adequately fixed, then it satisfies the requirements entirely rather than partially, which is good news.
Although I'm reluctant to reason about MERGE
due to my lack of experience with it, its greater complexity, and its many reported quirks, we can reason about SELECT ... WITH (UPDLOCK, HOLDLOCK)
quite well. It's also fairly easy to test its behavior under concurrency, as I did before proposing it.
I think I've mentioned this several times above: PostgreSQL's ON CONFLICT
never throws or causes deadlocks because of concurrent access, whereas MERGE does. Yes, there are indeed multiple definitions of "atomicity" out there; there's:
To fully understand this distinction, compare Interlocked.Increment with a more classical locking approach:
Interlocked.Increment(ref a);
lock (something)
{
a++;
}
The first approach is inherently safer, as it cannot ever cause a deadlock (and does not block other threads).
PostgreSQL's beautiful ON CONFICT and MySQL's not-quite-ideal ON DUPLICATE KEY both use locks in the following way to achieve an atomic UPSERT [...] The locks are held until the end of the transaction [...] It naturally follows that deadlocks may occur [...]
I think you're conflating different types of deadlocks (FWIW I tried to explain this above).
Any sort of row-modifying operation takes a row-level write lock in PG, e.g. any UPDATE, regardless of whether it has ON CONFLICT
or not. AFAICT, the links you pointed to above are cases of deadlocks resulting from other operations coming after the update, i.e. just the generic kind of deadlock where rows are updated in reverse order. In other words, the ON CONFLICT
doesn't seem relevant there, and those people would have the same deadlocks if they used just a regular UPDATE. But I haven't read too deeply.
The problem with MERGE is that it can produce deadlocks without other operations, simply by doing two MERGEs concurrently; this does not happen with PG ON CONFLICT
cannot. In that sense, MERGE and ON CONFLICT
are very different, and ON CONFLICT
can be said to be atomic in the 2nd sense I used above.
Finally, regarding to your proposal to use SELECT with UPDLOCK, HOLDLOCK
... This is a variation on what is proposed in this post:
BEGIN TRANSACTION;
UPDATE dbo.t WITH (UPDLOCK, SERIALIZABLE) SET val = @val WHERE [key] = @key;
IF @@ROWCOUNT = 0
BEGIN
INSERT dbo.t([key], val) VALUES(@key, @val);
END
COMMIT TRANSACTION;
Note that this doesn't require multiple roundtrips, which I think you were suggesting above (first SELECT in one roundtrip, then UPDATE/INSERT in another). This becomes more complicated if you want to upsert multiple rows - the same post shows a way of doing it, but that relies on a TVP, which is generally less appropriate in the EF context.
I'm also not enough of an expert to know what potential negative impact the SERIALIZABLE hint can have here; in general, higher isolation levels have a cost in terms of performance etc. Note again that PG's ON CONFLICT
doesn't require this complexity, and also doesn't require any sort of higher isolation level or alternative locking strategy - because it's atomic (in the 2nd sense).
But regardless, I do agree that something like this could definitely be used to implement UPSERT for SQL Server in EF Core.
@roji Ah, thanks for the clarification!
Can you elaborate on how PostgreSQL avoids these deadlocks? I can imagine precisely one way, and that is by internally forcing a consistent order (most likely PK ASC). If that's the case, then that is, of course, easy enough to replicate by the way the query is implemented, in the context of EF - which seems to make it a non-issue.
Out of sheer curiosity, are you 100% sure about ON CONFLICT
never causing deadlocks? There is no way to force it to use a malicious order and make it deadlock with another instance of ON CONFLICT
? I haven't had the time to test it yet.
Note that this doesn't require multiple roundtrips, which I think you were suggesting above (first SELECT in one roundtrip, then UPDATE/INSERT in another). This becomes more complicated if you want to upsert multiple rows - the same post shows a way of doing it, but that relies on a TVP, which is generally less appropriate in the EF context.
I was thinking that the solution based on SELECT ... WITH (UPDLOCK, HOLDLOCK)
could be done in a single roundtrip if the SELECT part was done as a subquery or CTE. Regrettably, the subquery or CTE would have to be repeated between the INSERT and the UPDATE, though, unless an explicit temporary table were used.
I agree that the @@ROWCOUNT
solution neatly avoids this. The SELECT ... WITH (UPDLOCK, HOLDLOCK)
I chose specifically because it is not restricted to a single row. 😄
I'm also not enough of an expert to know what potential negative impact the SERIALIZABLE hint can have here; in general, higher isolation levels have a cost in terms of performance etc. Note again that PG's ON CONFLICT doesn't require this complexity, and also doesn't require any sort of higher isolation level or alternative locking strategy - because it's atomic (in the 2nd sense).
You're right, that is important to know. The only one I have been able to find is that its "singleton" gap locks (which I linked to earlier) may prevent certain unrelated rows to be inserted in the meantime if we're unlucky. For example, when locking nonexistent PK value "Bill", and the predecessor to that would-be row is "Ben", then every potential row greater than "Ben" and no greater than "Bill" will be blocked by that lock, or something to that effect. For SQL Server, I expect this will be unavoidable, for lack of a more accurate feature. I think this highlights the primary difference with ON CONFLICT
: that will simply insert the missing row and lock that, avoiding the need for gap locks.
Can you elaborate on how PostgreSQL avoids these deadlocks? I can imagine precisely one way, and that is by internally forcing a consistent order (most likely PK ASC). If that's the case, then that is, of course, easy enough to replicate by the way the query is implemented, in the context of EF - which seems to make it a non-issue.
I have very little knowledge on PostgreSQL implementation details. It could be that it applies a consistent ordering, and also takes an exclusive lock from the start to avoid deadlocks related to promotion from shared to exclusive. BTW AFAICT this is one of the issues with MERGE: when it checks if the row exists it acquires a shared lock, which it then must promote in order to update (and two concurrent MERGEs can deadlock each other because of this).
But the important thing is that all this is irrelevant/hidden to us; PG provides us with a primitive that takes care of it for us. This is similar to the atomic operations provided by CPUs (which you use e.g. via Interlocked.Increment) - I don't know what goes on in the CPU, but I know what guarantees it provides as a primitive.
Out of sheer curiosity, are you 100% sure about ON CONFLICT never causing deadlocks? There is no way to force it to use a malicious order and make it deadlock with another instance of ON CONFLICT?
Not to my knowledge, and avoiding "unprincipled deadlocks" was one of the primary design goals of PG's ON CONFLICT
, so I'm assuming that's not possible (see again this link).
I was thinking that the solution based on SELECT ... WITH (UPDLOCK, HOLDLOCK) could be done in a single roundtrip if the SELECT part was done as a subquery or CTE. Regrettably, the subquery or CTE would have to be repeated between the INSERT and the UPDATE, though, unless an explicit temporary table were used.
Yeah, repeating the query is definitely not ideal. Note that batching together multiple single-row UPSERTs is probably sufficient (so still a single roundtrip). But in any case, we can revisit and dig deeper when we get to actually implementing this feature.
You're right, that is important to know. [...]
Yes - this is indeed the important bit, the range locks can be cause lots of blocking for other transactions. Note the following note in the SQL Server docs:
Because concurrency is lower, use this option only when necessary.
I have essentially the same vexing questions that @Timovzl has asked. How can it be that Postgres appears to be so much better? I have described some of the trade-offs that exist here, and I don't know how Postgres can straddle this so well. There must be some downside.
One explanation could be that SQL Server needlessly provides a worse locking model for MERGE
than for UPDATE
. It seems that this was the conclusion in this thread. In particular, it was claimed/proposed that MERGE
fails to acquire U-locks the same way UPDATE
does. This would be a satisfactory explanation for the difference in exposed capabilities.
And it's not so easy to restore those U-locks because then those locks persist past the end of the statement, reducing concurrency. I think...
That Postgres doc says something interesting:
As outlined below, SQL MERGE seemingly doesn't meet this standard (principally because it lacks "the essential property of UPSERT", which appears to be in tension with supporting a fully flexible join).
This sounds like they recognize they can't deliver those capabilities with MERGE
, so they instead built something specialized. That fits the picture. And if this capability is enough for EFs needs, then that's a reasonable approach to take.
@roji I again see what tough battlefield you are on 😄. This space of trade-offs is amazingly complex. And it's so difficult to obtain reliable information. I'm really rooting for the team to find the right path.
It could also turn out that developers need a way to pick different strategies. Maybe it is not possible to pick a universally appropriate strategy.
One explanation could be that SQL Server needlessly provides a worse locking model for MERGE than for UPDATE.
I think this goes in the direction - though I wouldn't necessarily say "needlessly". MERGE is a complex and flexible statement, that was intended for synchronizing data between tables; it is capable of (and was conceived for) doing far more than upserting. Concretely, MERGE first does a check (the WHEN clause(s)), and only after that performs the actual merge action (update/insert/delete). AFAICT this is why it first acquires a shared lock (for the check), and only converts/promotes the lock to exclusive for the merge action, and that is a major reason why it can deadlock with itself and is unsuitable as UPSERT for concurrent scenarios. MERGE could have acquired an exclusive lock immediately for the check, but that would make it needlessly lock rows which wouldn't necessarily have to be modified later.
The PG ON CONFLICT
, in contrast, is much more restricted/targeted, and does not need to perform any lock promotion. Also, since it's a single, self-contained "primitive", there's no need to take range locks which extend to the end of the transaction (since we know they're not needed after the statement completes).
To summarize, I think it's a case of comparing a very simple and targeted feature (ON CONFLICT
), which can be geared to solving one exact problem (and solving it well), vs. a much more general/flexible feature (MERGE
) - possibly too general/flexible for its own good. I think it could also be argued that MERGE really never was intended for a concurrency-safe UPSERT (even if the MERGE docs do show UPSERT as possible use), but rather for merging/syncing data between tables.
Interestingly enough, merging data between tables in PG is typically done by building on top of ON CONFLICT
(e.g. by doing INSERT ... SELECT
to move data from one table to another) , so that ON CONFLICT
is a building block for the same workload (syncing/merging) that MERGE performs on SQL Server.
@roji I again see what tough battlefield you are on smile. This space of trade-offs is amazingly complex. And it's so difficult to obtain reliable information. I'm really rooting for the team to find the right path.
Thanks :) Yeah, it's really tricky stuff - and thanks both to you @GSPP and to @Timovzl for the conversation: it helps research and get an accurate picture of the situation.
Just keep in mind that we're not currently planning to implement UPSERT for EF Core 7.0: even beyond these complicated issues, there's quite a bit of work in adding the concept to EF's change tracking, API surface etc. But it's definitely high on our wanted features list.
Note: when designing this, consider the case of using non-PK unique properties as the uniqueness "arbiter" (see #28263).
At the moment can be achieved with authored lib. EFCore.BulkExtensions, using method:
context.BulkInsertOrUpdate(entities);
(has config option for custom unique column(s))
Free for most users, paid only for companies with annual revenue $ 1 million+ (small fee for yearly license).
@borisdj if you promote your library, be transparent and put a disclaimer in all your post saying it's not free + link to your pricing https://github.com/borisdj/EFCore.BulkExtensions/issues/1079 or website https://www.codis.tech/efcorebulk/. I lost my time by trying your solution... I had to undo my code because our company doesn't want to pay $1000 for your license!!!
If it does not affect performance, I hope so ^_^ .
@borisdj if you promote your library, be transparent and put a disclaimer in all your post saying it's not free + link to your pricing borisdj/EFCore.BulkExtensions#1079 or website https://www.codis.tech/efcorebulk/. I lost my time by trying your solution... I had to undo my code because our company doesn't want to pay $1000 for your license!!!
Thats why the license is the first thing I check when evaluating a third party dependency. Both free and commercial products may have license terms not suitable for your project.
Hey there. Any traction on this issue? Any chance it's considered for v8?
This issue is in the Backlog milestone. This means that it is not planned for the next release (EF Core 8.0). We will re-assess the backlog following the this release and consider this item at that time. However, keep in mind that there are many other high priority features with which it will be competing for resources. Make sure to vote (👍) for this issue if it is important to you.
Well, this very issue is already the most upvoted one. 😉
@0xced Every vote counts!
At the moment can be achieved with authored lib. EFCore.BulkExtensions, using method: context.BulkInsertOrUpdate(entities); (has config option for custom unique column(s)) Free for most users, paid only for companies with annual revenue $ 1 million+ (small fee for yearly license).
@markusschaber @allesandrogallo
Shameless plug:
I created MIT fork (from point when it was still MIT (18.1.2023)) of repository above (where I solved also multiple issues not fixed in original):
https://www.nuget.org/packages/EFCore.BulkExtensions.MIT/
So give it a try.
Over 8 years later and still not implemented despite consistently being one of the top three highest-voted issues in this repo... @roji @ajcvickers can't you get the SQL Server team to add an equivalent construct to Postgres's on conflict
?
@IanKemp we don't control what features SQL Server implements, but they're definitely aware of the need. In any case, this issue isn't about adding anything to SQL Server, but implementing whatever does make sense within EF (and a MERGE-based implementation may make sense, we'd need to study this more).
I do hope we get around to tackling this in EF 10.
At the moment can be achieved with authored lib. EFCore.BulkExtensions, using method: context.BulkInsertOrUpdate(entities); (has config option for custom unique column(s)) Free for most users, paid only for companies with annual revenue $ 1 million+ (small fee for yearly license).
@markusschaber @allesandrogallo
Shameless plug:
I created MIT fork (from point when it was still MIT (18.1.2023)) of repository above (where I solved also multiple issues not fixed in original):
https://www.nuget.org/packages/EFCore.BulkExtensions.MIT/
So give it a try.
I was using EFCore.BulkExtensions and sometime a couple months ago it started jumbling my IDs during bulk imports. Spent over 10 hours debugging the issue. The https://github.com/videokojot/EFCore.BulkExtensions.MIT package dropped right in and cleared everything up. A big thank you to @videokojot. I agree with the rest of the folks here though, this should be supported out of the box functionality from Microsoft for EF.
While waiting for the feature, I've created a custom DbContextOptionsExtension that replaces the IUpdateSqlGenerator to one that overrides the AppendInsertCommandHeader method to write "REPLACE INTO" instead of "INSERT INTO" for MySQL and "INSERT OR REPLACE INTO" for Sqlite when needed. Since the classes are singleton, I've created an accessor (similar to the HttpContextAccessor) to be able to get my saveOptions. The end results looks like this
var options = new DbContextOptionsBuilder<AppDbContext>()
.UseSqlite()
.UseSqliteSaveOptions()
.Options;
var dbContext = new AppDbContext(options);
dbContext.Entities.Add(new Entity { Id = Guid.NewGuid(), Name = "An entity" });
await dbContext.SaveChangesAsync(options => options.UpsertOnInsert = true);
I would have prefered to be able to do something like
dbContext.Entities.Upsert(new Entity { Id = Guid.NewGuid(), Name = "An entity" });
but the only possible way I found to extend it that way was to create a temp shadow property to flag the entityEntry. It felt very dirty. With the Extension types it would probably have been possible, but sadly it's not in the next release of dotnet.
Can you write a little snippet showing how you would have use it?
Revisiting this question in light of the Execute APIs added in EF Core 7, it would be nice to have a method similar to this for the standardised MERGE statement.
static Task MergeIntoAsync<TSource, TTarget, TKey>(
this DbSet<TTarget> target,
IQueryable<TSource> source,
Expression<Func<TTarget, TKey>> targetKeySelector,
Expression<Func<TSource, TKey>> sourceKeySelector,
Expression<Func<SetPropertyCalls<TSource>, SetPropertyCalls<TSource>>>? whenMatchedThenUpdate = null,
Expression<Func<TTarget, TSource, TTarget>>? whenNotMatchedThenInsert = null) where TTarget : class
Easier said than done, of course.
Useful post on concurrency-safe usage of MERGE for upsert.
Just chiming in to mention this would be nice to have in light of data seeding improvements in EF Core 9. In tests and other scenarios, I find myself often wanting to do something like:
public void SeedData(ApplicationDbContext db){ db.SomeTable.AddOrUpdate(new SomeValue{ Id=1, Value="One", SomeChildren=[ new Child{....}] }); ... db.AllowIdentityInsert() db.SaveChanges(); }
In some cases, HasData may accomplish what's needed. But the main use case I'm thinking of is integration tests and local development. In these scenarios, we have a bunch of data we want to populate into the dev and test dbs, but shouldn't really go to production. As it stands, our options are:
@danroot I agree that the new UseSeeding() API would be a good place to use upsert. I'll just note that you can simply use EF to check whether there's certain data in the table, and only do the seeding in that case; in other words, no need to delete and recreate unconditionally every time. That's the good thing about the new UseSeeding() you can use arbitrary code and EF operations to determine what you need to seed.
In fact, even if EF had upsert support, doing this would probably have a good idea; if you have a large amount of seed data, then you may not want to upsert it every time, since that still isn't great for perf; checking what's in the database in advance is easy and can really help with performance.
@roji I do use a "check if the seed data exists" approach in the makeshift seeding code I use today. The problem is there are cases where the check condition passes, but we still want to reseed. For example:
if (!db.Users.Any())
db.Users.Add(...);
Works until you decide to add a new user to the seed. Then you have to delete all data to force the check to fail, or change the condition:
if (!db.Users.Count() == 2){
db.Users.Add(...);
db.Users.Add(...);
}
Which works until you need to add a field. Then you have again have to delete all data to force the check to fail, or change the condition:
if (!db.Users.Count() == 2 || db.Users.Any(x=>x.Foo == null)){
db.Users.Add({...,Foo="Bar"});
db.Users.Add({...,Foo="Blah"});
}
In fact, this will logically always be the case. Either the check data condition will be some subset of the seed data, or it will be all of it, in which case might as well use MERGE. In our case for now, we just drop the database and force re-seed in those cases. We've also done seed data in migrations, which mitigates that issue, as it is essentially:
IF(EXISTS(SELECT Id FROM __MigrationHistory WHERE Name='Seed1'))
BEGIN
INSERT INTO UserTypes ...
END
This works well for things like tables with a row that corresponds to each field in an enum. But it doesn't work well for dev/test data we don't want applied in the production db.
In my project I need to insert entity if it does not exist yet, or update it otherwise (somewhat like
UPSERT
in mongodb). As I understand in SQL it can be accomplished usingMERGE
, but I was not able to findMERGE
in EF Core.