dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.79k stars 3.19k forks source link

TPH: SQL for persisting property that references collection of derived types is performed alphabetically instead of respecting the collection order #33918

Closed luisabreu closed 5 months ago

luisabreu commented 5 months ago

Hello.

I've noticed a somewhat strange behavior while using EF Core to persist an object which holds a list of TPH types. When EF Core generates the SQL for insertion, it will only generate them according to the list's order; instead, it seems to rely on the derived types alphabetical order. This breaks things when the list that is being persisted is used as an event log and you'tr trying to persist several events at once.

I have already written about this here (with a somewhat thorough explanation of the problem). Am I missing something?

Thanks

roji commented 5 months ago

@luisabreu there are no ordering guarantees in SQL, unless an ORDER BY clause is present. In other words, if you do INSERT INTO x (y) VALUES (1), (2), (3), there's no guarantee that 1 gets inserted first, then 2, then 3 - the database is free to do this in any ordering. For example, it may choose to parallelize the insertion, resulting in random ordering (this can be a valuable optimization). So you generally should not be relying on the order in which rows are specified in an INSERT statement.

luisabreu commented 5 months ago

Hello.

That's really not the issue. The problem is that you have a list that is ordered and, for instance, looks like this: error1

And when EF persists data to the database, it will generate several insert instructions according to the ahlpabetical order of the types instead of respecting the order of the list:

-- insert for  type IntervencaoDadosGerais
INSERT INTO [IntervencaoPATs] ([CodUnidadeOrganica], [EstadoPedido], [IdFuncionario], [IdLocalTrabalho], [IdPATs], [ObservacaoUtente], [Observacoes], [OrigemPedido], [Prioridade], [TipoPedido], [Versao], [Data], [Login])
OUTPUT INSERTED.[IdIntervencaoPATs]
VALUES (@p8, @p9, @p10, @p11, @p12, @p13, @p14, @p15, @p16, @p17, @p18, @p19, @p20);

-- insert for type IntervencaoDocumentoAdicionado
INSERT INTO [IntervencaoPATs] ([EstadoPedido], [IdPATs], [ObservacaoUtente], [Observacoes], [Prioridade], [TipoPedido], [Versao], [Data], [Login])
OUTPUT INSERTED.[IdIntervencaoPATs]
VALUES (@p21, @p22, @p23, @p24, @p25, @p26, @p27, @p28, @p29);

-- insert for type IntervencaoEquipamentos
INSERT INTO [IntervencaoPATs] ([EstadoPedido], [IdPATs], [ObservacaoUtente], [Prioridade], [TipoPedido], [Versao], [Data], [Login])
OUTPUT INSERTED.[IdIntervencaoPATs]
VALUES (@p30, @p31, @p32, @p33, @p34, @p35, @p36, @p37);

-- insert for type IntervencaoGenerica
INSERT INTO [IntervencaoPATs] ([DataFim], [DataInicio], [Observacoes], [EstadoPedido], [IdPATs], [ObservacaoUtente], [Prioridade], [TipoAreaIntervencao], [TipoPedido], [TipoIntervencaoAssistencia], [Versao], [Data], [Login])
OUTPUT INSERTED.[IdIntervencaoPATs]
VALUES (@p38, @p39, @p40, @p41, @p42, @p43, @p44, @p45, @p46, @p47, @p48, @p49, @p50);

-- insert for type IntervencaoTecnicos
INSERT INTO [IntervencaoPATs] ([EstadoPedido], [IdPATs], [ObservacaoUtente], [Prioridade], [LoginResponsavel], [TipoPedido], [Versao], [Data], [Login])
OUTPUT INSERTED.[IdIntervencaoPATs]
VALUES (@p51, @p52, @p53, @p54, @p55, @p56, @p57, @p58, @p59);

These SQL instructions are generated when I try to save the top pat object (ie, ´db.Pedidos.Add(pat);await db.SaveChangesAsync();`).

As you can see , EF is generating several inserts statements (one for each type of Intervencao). The problem is that this will completely break apps that have aggregates that need to save batches of objetcs where order is important.

Is there anything that can be done to solve this? Am I missing something on the mappings? And why does EF use such a curious approach while generating the INSERT statements?

Thanks

roji commented 5 months ago

I think my comment above does apply to what you're saying.

Keep this firmly in mind: in the database, there's no ordering of rows at all, unless you have some sort of explicit column that you use to convey that ordering. In other words, the ordering of INSERT statements should never matter, and does not guarantee anything. For example, if you select back your rows from IntervencaoPATs, they may come back at any random order, regardless of how you (or EF) inserted them into the database.

I'm still not sure what exact problem this is causing for you... If the order of a given PedidoAssistencia's Intervecoes is important in any way, then you need to add a numeric column to your Intervecoes to express that order.

The main reason that EF sorts rows alphabetically is to prevent various deadlock situations where e.g. two threads are concurrently doing SaveChanges(), updating the same rows but in different order. Alphabetical sorting guarantees that the execution ordering is always deterministic and avoids deadlocks.

luisabreu commented 5 months ago

Hello again.

I must confess that I'm still having a hard time understanding your explanation. I'm trying, I really am...please bear with me.

So, can we please go through this once more, step by step to see if I can make sense of what you're saying. I have an object (pat in the previous image) which has a property called Intervencoes. It looks something like this (sorry, I'm writing this from off the top of my head, so syntax might not be perfect):

public class PedidoAssistencia: ...{

  private IList<Intervencao> _intervencoes = new List<Intervencao>();  
  public IEnumerable<Intervencao> Intervencoes {
    get => _intervencoes.ToList();
    set => _intervencoes = new List<Intervencao>(value);
  }
}

Intervencao is an abstract base type and there are several derived types. The types are all mapped into the same table using TPH (not sure if it's important, but the column discriminator is an enum).

What I'm trying to do is save an existing PedidoAssistencia object graph to the db (everything is wrapped in a transaction):

var pat = new PedidoAssistencia();
// call methods to add intervencoes
// then save it to db

await db.SaveChangesAsync();

I expected to see a SQL batch INSERT statement (ie, several INSERT INTO...VALUES(); INSERT INTO...VALUES(); ...) generated by EF: one forpat (table PedidosAssistencias) and one for each of the items on the list of Intervencoes (table Intervencoes). In other words, I was expecting to see the following:

INSERT INTO PedidosAssistencias (...)
values(...);
-- and here the inserts for the related entities
-- going back to the image, this would look like this

INSERT INTO Intervencoes(columns_for_first_intervencao_dados_genericos)
VALUES(values_for_first_intervencao_dados_genericos);

INSERT INTO Intervencoes(columngs_for_first_intervencao_dados_tecnicos)
VALUES(values_for_first_intervencao_dados_tecnicos);
-- other inserts that would respect the order of the items on the list

Unfortunately for me, that's not what's happening. Indeed, I'm seeing the generated batch with several INSERT instructions, but the order of the INSERT statements is not the one I expected. In practice, it's as if the INSERT statements are being generated after "sorting" the list of existing Intervencao according to each object's typename.

I must stress that when I mention the INSERT batch statements I'm talking about the SQL that is generated by EF, ie, I'm not talking about the order of execution of the 'INSERT' statements on the DB engine.

Going back to the image before saving, it's as if the list would be "ordered" into something that looks like this before EF generates the INSERT batch:

pat
 |
 |-IntervencaoDadosGerais (already on position 0)
 |-IntervencaoDocumentoAdicionado (jumps from last to position 1)
 |-IntervencaoEquipamentos (from position 3 to 2)
 |-IntervencaoGenerica (jumps from position 2 to 4)
 |-IntervencaoTecnicos (from position 1 to  last)

In your previous answer, you said that "The main reason that EF sorts rows alphabetically is to prevent various deadlock situations". What do you mean by sort rows alphabetically? Are you saying that whenever EF needs to save a collection of polymorphic types (accessed through a property of an object graph) it will automatically "reorder" collection list according to each item's typename before generating the INSERT batch statements and that by doing this it's trying to prevent deadlocks?

I'm sorry for insisting on this topic, but I'd really love to get a grip on how/why EF behaves like this. Coming from NH world, this behavior is not something I was expecting. If I'd had to write the SQL code by hand, I'd also follow the order in the list for generating the final SQL batch statement...

It seems like I'm missing something. I'd really appreciate your help in trying to understand what.

BTW, sequence order is important because each PedidoAssistencia can be seen as some sort of event long where each Intervencao represents an operation performed on that PedidoAssistencia so it's essential that it gets persisted in the correct order (btw, and please correct me if I'm wrong, but I'm under the impression that explicit batches like the ones we're talking about are executed sequentially - assuming that each one runs succesfully) .

thanks again.

roji commented 5 months ago

I'm not sure how to explain further without repeating what I already wrote above... I'd urge you try carefully think why you care about the order of INSERT statements generated by EF - that is the crux of the matter. I think you have some other place (your event log) which you think depends on this insertion order; and what I'm telling you, is that it cannot/should not do so.

Once again... if you send an INSERT statement with multiple values to the database (INSERT INTO Intervencoes(columngs_for_first_intervencao_dados_tecnicos) VALUES(values_for_first_intervencao_dados_tecnicos)), the database is free to do the actual insertion (writing to disk) in any order it wishes (e.g. for efficiency), which means that how EF orders things in the SQL is completely irrelevant, and should never be important to the application.

Now, you're reporting that you're seeing multiple INSERTS into the Intervencoes table rather than a single INSERT with multiple rows (like what I discussed in the paragraph just above this one). I'm not sure why that's happening - usually EF does use a single INSERT statement, but let's disregard that for now. You're right for such a series of multiple INSERT statements, the database cannot "reorder" them, since the later INSERTs are only processed once the earlier once have been completed. But here's the crucial point: EF could use a single INSERT with multiple values (from the previous paragraph) - that would be more efficient. And at that point, once again, the database would be free to reorder and/or parallelize the inserts, once again meaning that the SQL ordering is irrelevant/unimportant.

BTW, sequence order is important because each PedidoAssistencia can be seen as some sort of event long where each Intervencao represents an operation performed on that PedidoAssistencia so it's essential that it gets persisted in the correct order

I don't know what this "event log" refers to, or how you're concretely trying to observe the insertion order. In any case, it's very likely that you have a design problem in your application. If the ordering of your Intervecoes is important to your application in any way, you must add a column containing the index of that intervention with its pedida, and not rely on insertion ordering.

To summarize, relational database contain unordered sets, and insertion order does not matter/isn't preserved in any case - not in the general case. Therefore, when you call SaveChanges in EF, EF may change the ordering for various reasons (such as deadlock avoidance).

luisabreu commented 5 months ago

Hello again.

Thanks for the thorough explanation.

Just a couple of extra points to clear things up.

I think that EF is not doing a singe INSERT INTO...VALUES(...),(...) because the property references a collection of polymorphic types, so it might require different number of columns for each object instance that belongs to the collection.

Regarding the event log concept, what I was trying to say was that PedidoAssistencia is an aggregate obejct which behaves like an event log where each entry is represented by an Intervencao derived type. The implicit list order should generate the INSERT INTO...; INSERT INTO...; batch statement sequence in the "right" order and the autonumber ID generated by the DB for of each of those items would automatically allow me to retrieve them in the correct order.

As I said, I didn't expect to see this reordering and I never saw NH do it and I'm pretty sure that if anyone had to write the code by hand, they would simply generate the INSERT statements following the collection order.

You make a good point when you say that relational database contain unordered sets. However, a good ORM should also minimize the changes you have to make to your code in order to solve the impedance mismatch that exists between OO and relational databases. My question probably seems dumb from a relational database point of view (though I am write because the batch with several INSERT INTO...; INSERT INTO... does respect the sequence). However, from an OO point of view, that extra index I'll have to add does not make any sense because my list is an ordered sequence of elements.

Once again, thanks for your help.

jrcpereira commented 5 months ago

This is interesting... @roji, I apologize upfront if this is a naive comment, but I believe there is a language misunderstanding...

When we execute multiple operations (e.g. INSERT) against a database on the same roundtrip (using Dapper as example), we can retrieve the inserted id's from the ordered result set. The DB engine returns the datasets in the same ordered way as the input. From an overall view, this is what the OP expected from EF when he execute the Save command.

What you are saying is that EF reorders alphabetically by table name the commands before executing, changing the order of the multiple insert command that are supposed to be executed in the same roundtrip in the same connection?

From my POV this is not a default database engine behavior, but more a EF lib behavior and may (or may not) have to do with how the EF processes and executes the request.

Does EF span this Save request into multiple parallel requests to the database server using different connections? Isn't this being executed on a single roundtrip request to the DB using the same connection?

I'm just trying to understand the why of the ordering here and imagine possible scenarios where this could be useful.

Thanks!

roji commented 5 months ago

I think that EF is not doing a singe INSERT INTO...VALUES(...),(...) because the property references a collection of polymorphic types, so it might require different number of columns for each object instance that belongs to the collection.

Possibly, I'd have to look into it; but the point is that it could, since those inserts are into the same table. For example, there's a good chance that it's more efficient to do such a single INSERT, and explicitly insert NULLs where relevant because of the hierarchy. But we'd have to look into it.

The implicit list order should generate the INSERT INTO...; INSERT INTO...; batch statement sequence in the "right" order and the autonumber ID generated by the DB for of each of those items would automatically allow me to retrieve them in the correct order.

Please re-read what I wrote above: relying on the autonumber ID to represent order is wrong. In principle, there isn't even a necessary guarantee that autonumber IDs are increasing; a database could in theory find that there's an unused gap from before, and assign a number from that gap, causing the numbers to not be increasing. I don't think any database does that, but it could. And in any case, as I explained, EF could decide to use a single INSERT for multiple rows (as it already does in many cases!), at which point the database really can assign IDs in random order because of parallelization.

Once again: avoid assigning any meaning or making any assumptions for autonumbers.

roji commented 5 months ago

Regarding the deadlocks, imagine you have two UPDATEs as follows:

UPDATE x SET y=3 WHERE Id=1;
UPDATE x SET y=3 WHERE Id=2;

And concurrently you have another session doing the reverse:

UPDATE x SET y=4 WHERE Id=2;
UPDATE x SET y=4 WHERE Id=1;

This is a deadlock, as each transaction locks the same two rows in reverse order. EF orders these updates to sort by the ID, so that the deadlock is avoided.

The thing to realize, is that SaveChanges() does not execute an ordered set of SQL statements on behalf of the user. It figures out which changes have occured in a complex object graph, and then figures out the SQL statements to persist those changes, and the best ordering for that. It may decide to coalesce multiple updates into the same INSERT statement, and in some cases it may even use a MERGE to do it.

SaveChanges() is a high-level ORM API. It is very different from executing a series of SQL statements with e.g. Dapper; Dapper's job is to simply transmit SQL statements you give it to the database, as you've provided them. SaveChanges() does something completely different.

Hopefully this clarifies things.

luisabreu commented 5 months ago

Hello again.

@roji, first of all, thanks for taking the time to help us understand theses issues.

What you say makes perfect sense and I'm not disputing any of it. The update examples are clear.

Going back to my initial problem, if EF was generating a single INSERT INTO ...values (...), (...), (...) and then these were getting inserted in an "unexpected" order, then I'd say you're completely right.

However, I really can't think of any good example that justifies generating the INSERT strategy EF is using in my scenario. Surely locking isn't an issue here, so I'm still don't understand the rationale on using the type's name alphabetical order for generating the INSERT statement for each item that exists on the list.

I haven't found anything on the docs that alert you for this behavior (though I haven't also found anything that said otherwise). So, it's like you said before: I'll have to redesign my objects in order to make sure that the list order is preserved after re-hidration (it's not hard, it's just something I believe shouldn't be required in order to get the what I need).

Instead of comparing with Dapper, it might be more interesting in comparing this behavior with the one used by NH, where collection mappings are mapped as "simple" collections or "indexed collections".

ajcvickers commented 5 months ago

@luisabreu Indexed collections are tracked by #9067.

roji commented 5 months ago

However, I really can't think of any good example that justifies generating the INSERT strategy EF is using in my scenario. Surely locking isn't an issue here, so I'm still don't understand the rationale on using the type's name alphabetical order for generating the INSERT statement for each item that exists on the list.

I wrote about this above, a single INSERT with multiple VALUES is very likely more efficient. The alphabetical order is likely just a way to make things deterministic, to avoid deadlocks; the specific ordering there doesn't really matter, as long as it's the same ordering for all updates generated by EF. Alphabetical is just a way to do that.

So, it's like you said before: I'll have to redesign my objects in order to make sure that the list order is preserved after re-hidration (it's not hard, it's just something I believe shouldn't be required in order to get the what I need).

If you still believe that an extra "ordering" column isn't required to preserve the list order after rehydration, then you need to read all of the above again. If you're saying you'd rather EF took care of managing that extra column for you, then as @ajcvickers wrote, we have #9067 for indexed collections in the backlog for that; I'm assuming that's what NHibernate does - because there's really no other way (nothing around ordering of INSERTs would guarantee it). But indexed collections aren't yet implemented, so for now you'll have to take care of it yourself.

I think the conversation has run its course and I've done the best I could to explain everything. Of course, feel free to continue asking and I'll continue answering, but I really do think all of the information is already contained in the above comments.