dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.63k stars 3.15k forks source link

Ensure that split queries with non-deterministic Skip/Take don't produce wrong results (by adding deterministic orderings) #26808

Open roji opened 2 years ago

roji commented 2 years ago

When doing split query, if there's any non-determinism in the query that may produce wrong results, since the multiple queries may return different/inconsistent data.

In some cases we can be sure the query is non-deterministic, and can throw (#21202 tracks that). In other cases (https://github.com/dotnet/EntityFramework.Docs/issues/3242) we may not be sure (e.g. depends on the uniqueness of the last column being ordered by), so throwing may be too much (and block legitimate, non-dangerous use). A warning may be more appropriate here (probably needs to be discussed).

roji commented 1 year ago

Took a look at this again because of #31657, and we review what we're doing around split queries with non-deterministic ordering that potentially yields wrong results. To recap, NorthwindSplitIncludeQuerySqlServerTest.Include_collection_take_no_order_by generates the following two queries:

SELECT TOP(@__p_0) [c].[CustomerID], [c].[Address], [c].[City], [c].[CompanyName], [c].[ContactName], [c].[ContactTitle], [c].[Country], [c].[Fax], [c].[Phone], [c].[PostalCode], [c].[Region]
FROM [Customers] AS [c]
ORDER BY [c].[CustomerID]

SELECT [o].[OrderID], [o].[CustomerID], [o].[EmployeeID], [o].[OrderDate], [t].[CustomerID]
FROM (
    SELECT TOP(@__p_0) [c].[CustomerID]
    FROM [Customers] AS [c]
) AS [t]
INNER JOIN [Orders] AS [o] ON [t].[CustomerID] = [o].[CustomerID]
ORDER BY [t].[CustomerID]

The 1st query orders by CustomerID, but the subquery in the second does not; that means arbitrary rows come out of the subquery, potentially resulting in non-matching rows with the first -> data corruption.

ajcvickers commented 12 months ago

Skip without ordering (not Take),

Note that it is reasonable, in general, to do Take without ordering when you just want some number of items, without caring which ones. This is also why it is fine to do First without ordering.

roji commented 12 months ago

@ajcvickers absolutely. The point is that the SQL generated above in my previous comment - which is exactly the case of Take without ordering - produces potentially incorrect results. It's a completely legitimate query (unlike Skip), but if you use split query you may get thewrong Posts on your Blogs.

stevendarby commented 12 months ago

@roji

  • Do we remember why we decided to throw, and not inject the correct ordering in the subquery of the second query? I remember discussions around this but couldn't find an issue. This indeed affects only queries which produce non-deterministic results as a whole (Skip/Take without ordering), but that shouldn't make us produce wrong results.

Although not exactly the same, this might be some of the discussion you remember?

https://github.com/dotnet/EntityFramework.Docs/issues/3242

Slightly different in that here an OrderBy is provided, just not a deterministic one. One of the suggestions is to append the stable ordering, but that's rejected. If I've followed it correctly, seems the augment is that adding the stable ordering is only done for correct materialisation when streaming, and never as an attempt to fix a user's non-deterministic query, hence not applying to the sub query.

Personally just think it should add the stable ordering. Indeterminate results across multiple executions of the same query (in single mode) is one thing- as pointed out above, could even be intended. But wrong results when running the query once in split mode is another thing!

roji commented 12 months ago

Thanks @stevendarby, yeah - that's the discussion I had in mind (neglected to search for it in the docs repo). We also discussed this offline.

And yeah, I agree that EF should not return incorrect results for AsSplitQuery with non-deterministic ordering, and that we should consider having the ordering in the subquery to ensure matching results between the multiple split queries. Even if not, we should at the very least be warning about all cases where such data corruption may occur (we seem to only be doing it for Skip for now).

Note that it's not enough to just include the same ORDER BY inside the subquery: that's still non-deterministic when the the column isn't unique. In this case we'd need to add ordering on the primary key properties to ensure uniqueness as well.

I'll bring this discussion to a design meeting.

roji commented 11 months ago

Design discussion: we indeed think that EF should ensure correct results here by injecting orderings as necessary.

kev-andrews commented 5 months ago

I created a gist to showcase the problem with Npgsql.EntityFrameworkCore.PostgreSQL:

https://gist.github.com/kev-andrews/9cb4cc4954ed9cff177beaaeada59a5a

the weird thing is...the exact same Code, which produces the same SQL for Postgres and SQL Server, works fine when using MS SQL Server Local Db and Microsoft.EntityFrameworkCore.SqlServer. Generated SQL for the Split Query:

Postgres:

SELECT p."Title", t."Id"
FROM (
         SELECT a."Id", a."AlwaysSameValue"
         FROM "Authors" AS a
         ORDER BY a."AlwaysSameValue" DESC
         LIMIT @__p_1 OFFSET @__p_0
     ) AS t
         INNER JOIN "Posts" AS p ON t."Id" = p."AuthorId"
ORDER BY t."AlwaysSameValue" DESC, t."Id"

SQL Server:

SELECT [p].[Title], [t].[Id]
FROM (
         SELECT [a].[Id], [a].[AlwaysSameValue]
         FROM [Authors] AS [a]
         ORDER BY [a].[AlwaysSameValue] DESC
         OFFSET @__p_0 ROWS FETCH NEXT @__p_0 ROWS ONLY
     ) AS [t]
         INNER JOIN [Posts] AS [p] ON [t].[Id] = [p].[AuthorId]
ORDER BY [t].[AlwaysSameValue] DESC, [t].[Id]

I also tried the gist with Pomelo MySql, it fails the same way Postgres does when querying/merging the split query includes with the following SQL:

MYSQL:

SELECT `p`.`Title`, `t`.`Id`
      FROM (
          SELECT `a`.`Id`, `a`.`AlwaysSameValue`
          FROM `Authors` AS `a`
          ORDER BY `a`.`AlwaysSameValue` DESC
          LIMIT @__p_1 OFFSET @__p_0
      ) AS `t`
      INNER JOIN `Posts` AS `p` ON `t`.`Id` = `p`.`AuthorId`
      ORDER BY `t`.`AlwaysSameValue` DESC, `t`.`Id`

I have tried numerous times to cause the same errors on SQL Server, but it seems to always work...I do not know why.

roji commented 5 months ago

@kev-andrews I haven't looked in depth, by assuming AlwaysSameValue indeed always has the same value (i.e. is a non-unique column), then actual returned ordering is non-deterministic, and so it makes sense for different databases to behave differently; the ordering your get is basically totally up to the database's internal implementation details.

kev-andrews commented 5 months ago

Yes, I overexaggerated the problem by sorting on that column which has all identical values.

Our real world problem was an order by on a "LastName" column, with skip/take pagination and more identical last names than the page size, which caused projected relations to not resolve properly when using split queries, as the implicit ordering by "Id" is only added on the outer query by EF core.

I would assume that SQL Server just does something similar to primary key ordering internally so it seems to "just work" there.

roji commented 5 months ago

I would assume that SQL Server just does something similar to primary key ordering internally so it seems to "just work" there.

That really, really depends - creating a clustered index on another column would probably cause that to change.

At the end of the day, if you don't have explicit orderings which result in a fully unique set (mostly unique like LastName vs. fully non-unique like "AlwaysSameValue" makes no difference), your resultset is non-deterministic and anything may happen.

ascott18 commented 1 month ago

The core issue here is worse than this issue originally describes. Entity Framework is actively applying different, mismatched orderings to different parts of the split query.

Would any EF Core team members have a chance to check out my PR for this? https://github.com/dotnet/efcore/pull/34097. Apologies for not asking for advanced permission to open a PR per the contribution guidelines, but this started out as an exploratory exercise for me and I just stumbled upon what seems to me like an elegant, simple solution. All feedback is very welcome!