Generate table joins instead of subquery joins

roji commented 5 years ago

For queries with includes, we currently generate joins with a subquery:

SELECT [b].[Id], [b].[Name], [b].[UserId], [t].[Id], [t].[BlogId], [t].[Description], [t].[UserId], [t].[Id0], [t].[Created], [t].[Hash], [t].[IsDeleted], [t].[Modified], [t].[PostId]
FROM [Blog] AS [b]
LEFT JOIN (
    SELECT [p].[Id], [p].[BlogId], [p].[Description], [p].[UserId], [p0].[Id] AS [Id0], [p0].[Created], [p0].[Hash], [p0].[IsDeleted], [p0].[Modified], [p0].[PostId]
    FROM [Post] AS [p]
    LEFT JOIN [PostInstance] AS [p0] ON [p].[Id] = [p0].[PostId]
) AS [t] ON [b].[Id] = [t].[BlogId]
ORDER BY [b].[Id], [t].[Id], [t].[Id0]

We could simplify this to:

SELECT [b].[Id], [b].[Name], [b].[UserId], [t].[Id], [t].[BlogId], [t].[Description], [t].[UserId], [t].[Id0], [t].[Created], [t].[Hash], [t].[IsDeleted], [t].[Modified], [t].[PostId]
FROM [Blog] AS [b]
LEFT JOIN [Post] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [PostInstance] AS [p0] ON [p].[Id] = [p0].[PostId]
ORDER BY [b].[Id], [p].[Id], [t].[Id0]

We should measure the execution perf difference between the above two. Even if there is no (significant) difference, we could still decide to do this for SQL simplicity.

Originally raised in https://github.com/aspnet/EntityFrameworkCore/issues/17455.

yuriy-syrota commented 2 years ago

Oh, I thought it was something I could do in my project. I'll try to put together a minimal project to share with you.

jzabroski commented 2 years ago

Hi @roji - this is some initial thoughts on this issue. Hope to follow up later with some more thorough analysis. Was brought here by a coworker who was stumped by slow queries in both EFCore and EF6, and blamed this issue.

For queries with includes, we currently generate joins with a subquery: -- @roji , original post in thread

and

Stepping back, his issue was https://github.com/dotnet/efcore/issues/17622#issue-489433134 opened purely to eliminate subqueries which are unnecessary, replacing them with mathematically equivalent SQL that performs JOINs directly without the subquery: [...example removed from quote...] The various subsequent discussions seem to widen this scope and bring in various other questions (at least in some cases) - single query vs. split query performance, or changes to the SQL (INNER instead of LEFT join) which would cause different data to be returned, and therefore would be incorrect. -- @roji

and

https://github.com/dotnet/efcore/issues/17622#issuecomment-966789915 -- @smitpatel

@roji If this issue is for cases where there are equivalent mathematical transforms, then I think I have a regression test to add to the test suite that I discovered today:

For example, if I have a schema like:

1 Transaction ... Has Many Account Allocations 1 Transaction ... Has 1 Transaction Type 1 Transaction ... Has 1 Warehouse 1 Transaction ... Has 1 Currency 1 Account Allocation ... Has 1 Account 1 Account ... Has 1 Group

return Set<AccountAllocation>()
                .Where(a => !a.Deleted)
                .Where(a => a.TransactionType.AffectsPosition)
                .Where(a => a.Account.IsActive)
                .Where(a => a.Account.Group.Id == accountGroupId)
                .Where(a => a.Transaction.TransactDate > cutoffDate)
                .Where(a => a.Transaction.Currency.Id == currencyCode)
                .Where(a => !a.Transaction.Warehouse.IsTransferHub)
                .Take(1) // new command inbetween - this forces a TOP (1) which alleviates some table scan pressure on the query
                .Any();

CLICK ME FOR SQL

```sql exec sp_executesql N'SELECT CASE WHEN ( EXISTS ( SELECT 1 AS [C1] FROM ( SELECT TOP (1) [Extent3].[AccountGroupId] AS [AccountGroupId], [Extent4].[TransactDate] AS [TransactDate], [Extent4].[WarehouseId] AS [WarehouseId], [Extent4].[CurrencyId] AS [CurrencyId] FROM [dbo].[AccountAllocations] AS [Extent1] INNER JOIN [dbo].[TransactionTypes] AS [Extent2] ON [Extent1].[TransactionTypeId] = [Extent2].[TransactionTypeId] INNER JOIN [dbo].[Accounts] AS [Extent3] ON [Extent1].[AccountId] = [Extent3].[AccountId] INNER JOIN [dbo].[Transactions] AS [Extent4] ON [Extent1].[TransactionID] = [Extent4].[TransactionID] WHERE ([Extent1].[Deleted] <> 1) AND ([Extent2].[AffectsPosition] = 1) AND ([Extent3].[IsActive] = 1) ) AS [Filter1] LEFT OUTER JOIN [dbo].[Warehouses] AS [Extent5] ON [Filter1].[WarehouseId] = [Extent5].[WarehouseId] WHERE ([Filter1].[AccountGroupId] = @p__linq__0) AND ([Filter1].[TransactDate] > @p__linq__1) AND (([Filter1].[CurrencyId] = @p__linq__2) OR (1 = 0)) AND ([Extent5].[IsTransferHub] <> 1) ) ) THEN cast(1 as bit) ELSE cast(0 as bit) END AS [C1] FROM ( SELECT 1 AS X ) AS [SingleRowTable1]',N'@p__linq__0 int,@p__linq__1 datetime2(7),@p__linq__2 nvarchar(4000)',@p__linq__0=88,@p__linq__1='2021-11-03 00:00:00',@p__linq__2=N'USD' GO ```

then, in EF6, I get a query where the .Where statements without free variables are generated as table joins, but the .Where statements with free variables (accountGroupId, cutoffDate and currencyCode) are left outer joined to all those table joins. This results in a really large number of read operations, because the EXISTS check is after the left outer join. Thus, I have to cheat and add Take(1) to simulate a TOP 1 to get this query to not be terrible on SQL Server. Without Take(1), on a table with > 1.5 million transaction entities in the cut off period, it will scan every one of the pages containing those transaction entities, and takes about 6 seconds to run. With Take(1), it has an initial SQL Server query plan compilation overhead for the first run (on the db server side), and after that, runs in 1 ms flat.

If you see my point, there seems to be broader pattern of how transitive navigation paths are handled. It looks like the previous merges by Smit linked to this issue deal with other scenarios, like OrderBy (query) and HasQueryFilter (model builder configuration) and Includes.

If the broader pattern is transitive navigation paths handling, then I see why @msmolka added a comment here as well. It's also not particularly clear to me why placement of free variables should alter the query, but it seems like it may. There's probably a good reason I do not see why this is happening.

smitpatel commented 2 years ago

@jzabroski - You are posting a SQL generated by EF6. We have no plans to make any changes to that. Which brings us to question - What is the generated query in EF Core. In the absence of any of those navigation being an owned navigation, all of them would expand on same level without causing subquery as EF6 did. So please run your query in EF Core and if that is not satisfactory to you, file new issue with detailed repro steps.

CharlieDigital commented 1 year ago

@smitpatel Should this be fixed in EFCore 7?

I'm using EFCore7 with Npgsql 7.0.3

I have a reproduction repo here: https://github.com/CharlieDigital/efcore-m2m

docker compose up -d
dotnet run

# Omit the | jq. if you don't want pretty print.
curl http://localhost:5256/init & curl http://localhost:5256/products | jq .

The generated query is as follows:

SELECT p.id, p.created_utc, p.name, t.product_id, t.media_id, t.id, t.created_utc, t.name
FROM products AS p
LEFT JOIN (
    SELECT p0.product_id, p0.media_id, m.id, m.created_utc, m.name
    FROM product_media AS p0
    INNER JOIN media AS m ON p0.media_id = m.id
) AS t ON p.id = t.product_id
ORDER BY p.id, t.product_id, t.media_id

There's a branch with no explicit ProductMedia relationship table: https://github.com/CharlieDigital/efcore-m2m/tree/variant/no-explicit-join

Result is the same.

roji commented 1 year ago

This issue is in the Backlog milestone. This means that it is not planned for the next release (EF Core 8.0). We will re-assess the backlog following the this release and consider this item at that time. However, keep in mind that there are many other high priority features with which it will be competing for resources. Make sure to vote (👍) for this issue if it is important to you.

CharlieDigital commented 1 year ago

@roji That's a shame; this issue has serious performance implications beyond toy datasets and it may not be obvious to users unless they are dumping their SQL that this is the cause of the performance issues.

roji commented 1 year ago

@CharlieDigital can you provide a clear, simple comparison and query plan showing the performance issue? That would help bump the priority of this. Note that I'm not saying we shouldn't do this - it would just help to have a clear, side-by-side comparison showing an exact difference.

inf9144 commented 1 year ago

Would also like to see this fixed. Besides the performance - it makes reading the generated sqls really hard. :-/

CharlieDigital commented 1 year ago

@roji on simple queries, the performance is as expected. We've observed the issue when performing deep ThenInclude and ordered sub-queries. For example, an Order that has multiple LineItems which reference a Product which has a set of Media that are ordered.

The workaround that we ended up using is to carry a discriminator field down as deep as practical and then performing a filter on the ThenInclude using the discriminator. For example, consider a case where we have many Stores in the DB.

We would add a StoreId field to Media so that we can filter the Media so we can filter the Media by the StoreId in the ThenInclude.

I've moved off of the project so I'll see if I can find some time to produce a sample project that demonstrates this issue, but we saw queries go from 16s down to sub-20ms on our dataset.

roji commented 1 year ago

We've observed the issue when performing deep ThenInclude and ordered sub-queries. For example, an Order that has multiple LineItems which reference a Product which has a set of Media that are ordered.

That sounds like it could intersect with #29171, which is about removing the orderings from the SQL altogether.

I definitely think there's a potential perf issue here (simple joins instead of subquery joins), but one of the problems is that we haven't yet seen a clear, minimal sample backed by server plans and/or benchmarked query runtime that prove the difference. I definitely intend to investigate this at some point, but such a clear repro would certainly help prioritize this.

RelativelyRandom commented 1 week ago

If you need a demonstration, I just ran an SQLite test, if anyone is interested. I simplified a structure we commonly use in production so it's not an adversarial example or anything like that, just fake data. I can simplify even more, formalize the test and pack it up if someone can provide me with requirements.

We very often have a three-level hierarchy of one-to-many mappings: each Lot has many Parts, and each of those Parts has a few Inspections. If you want to fetch a lot, you could just naively use context.Lots.Include(l=>l.Parts).ThenInclude(r=>r.Inspections).Where(l=>l.Id = 15). Unfortunately, this will generate a query which reads through the entire Parts and Inspections tables.

I populated my test DB with 100 lots totaling about 500 MB and that took over 3 seconds on my laptop. When I refactor the SQL to flat joins, it's done in less than 100 ms. The outputs are identical.

On SSDs this is a nuisance, but on HDDs it's completely unusable. In the past, to work around this, we resorted to hacks like manually querying just the lot, then the min and max IDs of parts inside the lot, then fetching all the parts between this ID range which belong to the lot.

Original SQL with SQLite's query plan:

SELECT "t"."Id", "t"."Barcode", "t0"."Id", "t0"."EndTime", "t0"."LotId", "t0"."StartTime", "t0"."Id0", "t0"."CaptureTime", "t0"."Message", "t0"."PartId", "t0"."Passed"
FROM (
      SELECT "l"."Id", "l"."Barcode"
      FROM "Lots" AS "l"
      WHERE "l"."Id" = 33
      LIMIT 1
) AS "t"
LEFT JOIN (
      SELECT "p"."Id", "p"."EndTime", "p"."LotId", "p"."StartTime", "i"."Id" AS "Id0", "i"."CaptureTime", "i"."Message", "i"."PartId", "i"."Passed"
      FROM "Parts" AS "p"
      LEFT JOIN "Inspections" AS "i" ON "p"."Id" = "i"."PartId"
) AS "t0" ON "t"."Id" = "t0"."LotId"
ORDER BY "t"."Id", "t0"."Id"

QUERY PLAN
|--CO-ROUTINE t
|  `--SEARCH l USING INTEGER PRIMARY KEY (rowid=?)
|--MATERIALIZE t0
|  |--SCAN p
|  `--SEARCH i USING INDEX IX_Inspections_PartId (PartId=?) LEFT-JOIN
|--SCAN t
|--SCAN t0 LEFT-JOIN
`--USE TEMP B-TREE FOR ORDER BY

Refactored SQL with SQLite's query plan:

SELECT "t"."Id", "t"."Barcode", "p"."Id", "p"."EndTime", "p"."LotId", "p"."StartTime", "i"."Id", "i"."CaptureTime", "i"."Message", "i"."PartId", "i"."Passed"
FROM (
    SELECT "l"."Id", "l"."Barcode"
    FROM "Lots" AS "l"
    WHERE "l"."Id" = 33
    LIMIT 1
) AS "t"
LEFT JOIN "Parts" AS "p" ON "t"."Id" = "p"."LotId"
LEFT JOIN "Inspections" AS "i" on "p"."Id" = "i"."PartId"
ORDER BY "t"."Id", "p"."Id"

QUERY PLAN
|--CO-ROUTINE t
|  `--SEARCH l USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN t
|--SEARCH p USING INDEX IX_Parts_LotId (LotId=?) LEFT-JOIN
|--SEARCH i USING INDEX IX_Inspections_PartId (PartId=?) LEFT-JOIN
`--USE TEMP B-TREE FOR ORDER BY

dotnet / efcore

Generate table joins instead of subquery joins #17622