Open roji opened 5 years ago
Is there any way for a provider writer to override this?
I'm developing an EF 2.2 provider for an older database, and it doesn't support subqueries in a join clause at all. So currently the generated SQL is invalid.
In my case, I'm just executing the git BuiltInDataTypesBase test:
var entity = context
.Set<StringKeyDataType>()
.Include(e => e.Dependents)
.Where(e => e.Id == "Gumball!")
.ToList().Single();
That generates this SQL statement:
SELECT "e.Dependents"."Id", "e.Dependents"."StringKeyDataTypeId"
FROM "StringForeignKeyDataType" "e.Dependents"
INNER JOIN(
SELECT "e0"."Id"
FROM "StringKeyDataType" "e0"
WHERE "e0"."Id"=N'Gumball!'
) AS "t" ON "e.Dependents"."StringKeyDataTypeId"="t"."Id"
ORDER BY "t"."Id"
However it is invalid for the particular DB vendor, and it must instead be:
SELECT "e.Dependents"."Id", "e.Dependents"."StringKeyDataTypeId"
FROM "StringForeignKeyDataType" "e.Dependents"
INNER JOIN ("StringKeyDataType" "t")
ON "e.Dependents"."StringKeyDataTypeId"= "t"."Id"
WHERE "t"."Id"=N'Gumball!'
ORDER BY "t"."Id"
I've been digging into the code, and it's hard to find much information on how to change the query generation engine at that level.
Should I open a separate question for this?
@Gwindalmir your LINQ query doesn't produce a subquery for me, either on 2.2 and on 3.1:
Out of curiosity, which database are you trying to develop for? This issue is about removing a subquery join in a very particular case, but there are quite a few others where doing so isn't possible. Subquery joins are a standard SQL feature, and a database which doesn't support them is likely to have many issues as an EF Core relational provider...
Finally, note that EF Core 2.2 is no longer supported - 2.1 and 3.1 are the current long-term support versions. Any new development should probably happen against 3.1.
Just to answer how to do it, add a custom implementation IQueryTranslationPostprocessor
deriving from RelationalQueryTranslationPostprocessor
and replace ShapedQueryExpression.QueryExpression
which would be a SelectExpression
with a different SelectExpression
to generate same result without subquery joins. If you find any lacking APIs to make required change, then another option is to provider custom IQuerySqlGenerator
which will just simplify subquery join to table join when printing it out to DbCommand text.
Thanks, at the time I started, 3.1 wasn't released, and supporting .NET Framework is a requirement, so I went with 2.2.
I'm not sure why you don't see it, as the SQLite driver included in this source constructs the same query. I downloaded the release/2.2 tag as my reference point.
As for the DB in question, I'm not sure I should reference it, as I work for the company that makes it. I will say it supports primarily SQL-92 standard, with a few SQL-99 additions.
@Gwindalmir the best way would be to open a new issue and include a short, runnable code sample with SQLite that shows it happening.
If you're still in development, I'd strongly recommend considering switching to 3.1 - it's the LTS version for years to come, whereas 2.2 is already out of support.
@Gwindalmir I don't see the subquery with a single include or ThenInclude, using SQLite. It took two ThenIncludes for me to generate the subquery (see the example in issue #19418 linked above). That was with .Net Core 3.1.
@Gwindalmir I don't see the subquery with a single include or ThenInclude, using SQLite. It took two ThenIncludes for me to generate the subquery (see the example in issue #19418 linked above). That was with .Net Core 3.1.
I'm going to migrate to 3.1, and test again. If the issue is resolved there, then that's great. If not, I'll open a new issue here. Thanks for the help everyone!
Just as a follow-up, in case anyone else has the same problem: Upgrading to EF 3.1 solved the issue!
Good to hear, thanks @Gwindalmir.
Any updates on an ETA for the original issue in this thread to be resolved? :)
@Webreaper no update at the moment - this issue is "consider-for-next-release", which means it's a stretch goal for 5.0. While it's considered important, we don't think it's as important as the other issues have have triage into the 5.0 milestone (but it may still get done).
Totally understand. Thanks for the update! Looking forward to .Net 5!
Errm, looking forward to this in .Net 6? ;)
This is a 6-monthly reminder - my queries are taking 950ms when they could be taking under 200ms due to having to workaround this bug. Any chance of a fix in .Net 6 previews 6-10?
@Webreaper this is still in the plan for EF Core 6.0, I do hope we'll manage to get it in.
Great! Thanks!
Will this cover https://github.com/dotnet/efcore/issues/20758#issue-607360242 or is this a more specific case? (Note the title of that issue a bit wrong, the top level projection still selects the right columns, the sub queries just make it appear like it’s selected more)
Will this cover #20758 (comment)
I doubt it. That looks like something very different.
Will this cover #20758 (comment)
I doubt it. That looks like something very different.
I disagree that they’re very different. Both are fundamentally about the potential to generate table joins instead of sub query joins. I would happily accept they have different root causes that require different fixes, however I still think it’s a valid question for an EF Core team member.
Edit: this item https://github.com/dotnet/efcore/issues/21082 seems like a dupe of the one I linked to, and was identified as a possible dupe of this one.
Oh, I see, I thought it was just about the extra columns, but I see that the sub-query is doing a full select and then a join, rather than a table join, so yes, it could be similar/related.
When subquery has a where predicate applied like in global filters then both kind of joins are not same and can have different characteristics. If the filter is reducing the size of the right source by large amount then subquery join may end up working faster. Perf of such query needs to be studied differently from this issue.
Regardless of joins, issue #20758 also captures the global query filter aspect of it. If all your entities have soft-delete filter then if they are consistent graph with both filter values then, filter would need to be applied only one of the table, rest will automatically filtered out by joins.
Based on the tweets and RC1 issue, I'm guessing this is gonna roll over to 7 now?
@Webreaper this has already been punted from the 6.0 release (see the milestone).
If you could post actual perf numbers showing the difference between the two queries, that could help push forward the priority of this - we haven't yet had time to properly investigate that.
Ah, missed that, thanks.
I posted some comparative numbers in the issue I raised about this before I found it was a dupe of this one: https://github.com/dotnet/efcore/issues/19418
With 500,000 entity rows rows and a ThenInclude
onto a table with 1.2m entries, the unfiltered select pulls in all 1.2 million rows before filtering them, so the query takes around around 3 seconds (on my M1 MacBook pro). Manually adjusting the SQL to use a proper table join, drops the query time to less than 300ms. So it's an order of magnitude faster when table joins are used.
Since I raised that issue, the DB I'm using now has over 2 million rows in the ImageTags table, which means that the query is likely to take 3-5s to run. With the table-joins issue fixed, this should remain constant at around 300-350ms or less, depending on indexes. So it's quite a big difference.
Please let me know if you need any more details.
@Webreaper looking at #19418 again, you seem to be comparing the EF-generated single query to split query (which isn't relevant here), as well as to a query where there's no join with the Tags table. So in the context of this issue, we don't have the exact same query, once written as a join of a subquery and once as a table join... Ideally we'd have the two side-by-side as runnable code samples, with clearly different runtime numbers.
The important thing is to understand whether this is something that really does affect perf, or whether (most) databases can optimize this in any case, in which case it's just a SQL simplification issue (and the priority is lower).
Okay, so here's the query that gets generated by EF today:
SELECT "b"."ImageId", "b"."DateAdded", "i"."ImageId", "f"."FolderId"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "Folders" AS "f" ON "i"."FolderId" = "f"."FolderId"
LEFT JOIN (
SELECT "i0"."ImageId", "i0"."TagId", "t"."TagId" AS "TagId0", "t"."Keyword"
FROM "ImageTags" AS "i0"
INNER JOIN "Tags" AS "t" ON "i0"."TagId" = "t"."TagId"
) AS "t0" ON "i"."ImageId" = "t0"."ImageId"
ORDER BY "b"."ImageId", "i"."ImageId", "f"."FolderId", "t0"."ImageId", "t0"."TagId", "t0"."TagId0"
The query below is what should, IMO, be generated by EFCore (this one includes the 'Tags' join that was missing previously):
SELECT "b"."ImageId", "b"."DateAdded", "i"."ImageId", "f"."FolderId"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "Folders" AS "f" ON "i"."FolderId" = "f"."FolderId"
LEFT JOIN "ImageTags" AS "i0" ON "i"."ImageId" = "i0"."ImageId"
LEFT JOIN "Tags" as "t" on "i0".tagId = t.TagID
ORDER BY "b"."ImageId", "i"."ImageId", "f"."FolderId", "i0"."ImageId", "i0"."TagId"
If I run the first query, it takes over 10 seconds to execute. The second query returns instantly (i.e., significantly sub-second). They're doing the same thing, and using the same set of data, and return identical results. That seems like a pretty significant performance issue to me.
It seems pretty clear that if you left-join on an unfiltered select statement that returns all rows from another table with millions of rows, only to then immediately filter out 99.999% of the rows with the as/on clause of the join, that it's going to be massively slower than it needs to be.
I'm currently using the split-query approach to work around this bug, but that's still slower than the 'proper' table-join query (i.e., it requires 2 x 200ms instead of 1 x 200ms because there's two queries being run). I'd really like to remove that hack, but I can't because a user searching for an image in the app would have to wait over 10s for search results to come back.
Hopefully the performance numbers I've quoted will convince you. I'd hope you guys can knock up a sample and repro this pretty trivially, and it seems like the left-join-subselect SQL above is glaringly obviously wrong. But if the only way it's going to get prioritised is for me to generate a sample app, I'll try and find the time (but I really hope you can just fix it without me needing to do that :)).
@Webreaper note that the first query has an inner join on Tags, whereas the latter has a left join. I'm not saying that makes a huge difference, but it's better to be sure about these things (BTW this is Sqlite right? Any similar experience with another database, just in case this is a Sqlite thing?). A database schema with data for this would be ideal to reproduce.
But I generally agree, and yeah, at some point for EF 7 I'll definitely find the time to investigate this properly across database...
I haven't tried with another DB; I started adding support for Postgres to play with that, but haven't got it working properly yet. I guess I don't really care if another DB is smart enough to optimise this out, as I'm using Sqlite. ;)
If it helps to understand what's going on, here's my DbContext: https://github.com/Webreaper/Damselfly/blob/master/Damselfly.Core/Models/ImageContext.cs
I will try and see if I can pull out the salient points from the data model, knock up some test data, and stick a test app into github at some point, but it might not be for a couple of weeks. I know that would make it much easier for you guys to delve into it....
Thanks for your efforts @Webreaper, much appreciated!
Okay, finally put this repro together. https://github.com/Webreaper/EFCore6TableJoinBug
I've tried to document it as clearly as possible in the README, but if you have any questions at all about what my code is doing or why, please comment here (or raise an issue on the repo, or email me at mark@otway.com).
This is still apples to oranges comparison. Both SQL queries generates different results. Absence of data which generates different result is not a validation to prove that both are equivalent SQL. We cannot do suggested faulty optimization no matter perf gain it represents.
@smitpatel I'm not sure I get the point you're making. The two queries produce identical results. Are you saying that it's not correct because it may be possible for the SQL to provide different results in some circumstances? In which case you might be right, but I think that the presence of the FK constraints on both fields in ImageTags
means that it can't be.
Even if we agree that a) it could be too risky to produce SQL with the left-joins as I've described because there might be edge cases that don't provide the same resultset or b) it's too complicated to identify the cases where you can use this syntax, then EFCore should produce the following SQL which is correct in all cases AFAICT, and is still an order of magnitude faster than using the unfiltered table join (by my calculations it runs in about 400ms).
SELECT "b"."ImageId", "i"."ImageId"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "ImageTags" AS "i0" ON "i"."ImageId" = "i0"."ImageId"
INNER JOIN "Tags" as "t" on "i0".tagId = t.TagID
ORDER BY "b"."ImageId", "i"."ImageId", "i0"."ImageId", "i0"."TagId"
If you believe this will produce different results to the original EFCore-generated query, I'd love to know exactly how? Perhaps my SQL understanding is missing something.
What isn't in dispute is that EFCore's current approach of using a completely unfiltered sub-query on the ImageTags
table, before joining back to the images table which then filters out all but 5 results, means that the query produced is extremely inefficient - because a million rows are unnecessarily brought into the query plan and then discarded. It feels like it should be fixable, and would potentially make a huge performance gain even for tables where the sub-query doesn't have hundreds of thousands of results.
So the only question that remains is how you can determine when that might be necessary, and when table joins can be used directly? Perhaps - if it's too complex for EFCore to figure out automatically - there's a way we can enhance the Linq with some sort of hint? But it feels to me that using inner joins should be the default; and the hugely inefficient version with the sub-query should be the edge case.
The two queries produce identical results.
Because you haven't seeded data in the table which will produce different results for both queries. We have generated queries following configured constraints which requires an inner join and you are producing a left join which is plain wrong. You should look into how inner join and left join work to understand what kind of data you need to produce different results.
The query you are suggesting doesn't generate correct results in all cases. We are not going to change it no matter how fast it performs.
It would be productive for the discussion if we all take the charitable interpretation of what we read, quoting the first disagreeable thing while leaving the other 90% unanswered just breeds frustration.
Regardless of that being said, @Webreaper what's the guarantee here that there is actually always an imagetag available such that for any image queried via BasketEntries an inner join on imagetags would always return images that do exist? The converse being, if an image does not have a tag a query with inner joins would now start returning 0 results due to the inner join.
@NinoFloris - I understand that part. Though the same thing has been explained multiple times in above post. The discussion is going on without a new point being added. So it is repeating same thing again. 2 points written above which I didn't answer explicitly.
@Webreaper If you want to continue this discussion, file a new issue. The incorrect optimization you are suggesting is not same as the legitimate optimization being suggested in this issue by OP. And at this point this has become noise.
I agree a separate issue seems in place, and thank you for summarizing.
Regardless of that being said, @Webreaper what's the guarantee here that there is actually always an imagetag available such that for any image queried via BasketEntries an inner join on imagetags would always return images that do exist? The converse being, if an image does not have a tag a query with inner joins would now start returning 0 results due to the inner join.
That's a good point. That's why earlier in the post I showed the query with the left joins, which covers that scenario and from everything I can tell, produces the correct resultset every time. I should have seeded the data with some images without tags, as part of the test. Sorry, I don't have much time to spend on this, as it falls without the remit of my day job, so it might have been a bit rushed.
I'm not sure what would be gained by filing a new issue, as it would be raising the same inefficiency. I actually raised this as a separate issue in the first place - it was MSFT devs who said it was a dupe of this one. The point I've made is that the "select 1m rows to return 30" seems very slow and inefficient, and can clearly be optimised, right? I'm not a SQL or EF expert - I believe that's your jobs. 🤣
Is there really nothing that can be done to mitigate this scenario? I mean, as I've shown, you can mitigate it by splitting it into two linq queries and get 30ms response times, but it feels like something could be done by EF itself. If it can't, that's fine.
Bear in mind, I only ever raise issues to try and help EF be improved. @smitpatel seems to take it like this is a person attack. That's not how it's intended.... :)
@Webreaper it's definitely too late to do anything here for 6.0, which is now pretty much locked down.
Stepping back, his issue was originally opened purely to eliminate subqueries which are unnecessary, replacing them with mathematically equivalent SQL that performs JOINs directly without the subquery:
LEFT JOIN (
SELECT [p].[Id], [p].[BlogId], [p].[Description], [p].[UserId], [p0].[Id] AS [Id0], [p0].[Created], [p0].[Hash], [p0].[IsDeleted], [p0].[Modified], [p0].[PostId]
FROM [Post] AS [p]
LEFT JOIN [PostInstance] AS [p0] ON [p].[Id] = [p0].[PostId]
) AS [t] ON [b].[Id] = [t].[BlogId]
We could simplify this to:
LEFT JOIN [Post] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [PostInstance] AS [p0] ON [p].[Id] = [p0].[PostId]
The various subsequent discussions seem to widen this scope and bring in various other questions (at least in some cases) - single query vs. split query performance, or changes to the SQL (INNER instead of LEFT join) which would cause different data to be returned, and therefore would be incorrect.
I agree with @smitpatel and @NinoFloris that anything beyond the pure removal of a mathematically redundant subquery (as above) would be better off split out to a different issue; this doesn't mean we're against it or anything - but it would be a different proposal, that's all. If you're not trying to propose something else but are only trying to support this subquery elimination, then a very clear repro showing concrete perf differences between the simplified and non-simplified version could be helpful (but again, this would need to be precise).
The point I've made is that the "select 1m rows to return 30" seems very slow and inefficient, and can clearly be optimised, right? [...] Is there really nothing that can be done to mitigate this scenario? I mean, as I've shown, you can mitigate it by splitting it into two linq queries and get 30ms response times, but it feels like something could be done by EF itself. If it can't, that's fine.
There are indeed many scenarios where the same LINQ query can be very inefficient as a single SQL query (because of the so-called "cartesian explosion" problem), and run very efficiently as split query. This is a result of how SQL works, and is extensively explained in our docs. We intentionally don't decide for users when to use single vs. split query; the two are very different querying strategies and we believe users should make an informed decision on which one to use. In any case, this question is quite orthogonal to the other question discussed above.
To summarize, your interest and help on this is definitely appreciated, and can help us prioritize the issue etc. But in these subtle performance issues, it's important to focus on one thing at a time: subquery elimination (this issue), split vs. single query, or INNER vs. LEFT join, otherwise we may end up conflating different things.
Thanks. Totally agree with all this, very eloquently put. From my entirely personal (and somewhat selfish perspective) I'm not too worried about this, because I've had a workaround (the double/split linq query) for over 12 months now, and it works.
If it's possible to factor out the unfiltered join, or do something smarter with it, great. If it's more complicated than that, or I'm over-simplifying and hence missing potential edge-cases, that's also good. I'll continue to fiddle with my test app (I'm trying a modified version to include seeded data where some images don't have tags, per @NinoFloris's suggestion) and see if anything of interest pops up, and may write a slightly more concise and specific issue which doesn't have the various different flavours of query which have muddied the waters of this discussion. Thanks for the discussion though, chaps, it's been interesting and useful!
Sounds great, thanks for understanding @Webreaper. Removing the unnecessary subquery is definitely something we want to do.
Not sure if it belongs here, but I have case that reminds me about this resulting SQL. If these joins (from include) are followed by a Where clause, the navigational properties are not using the previously fetched include data (e.g. filtered includes), but rather making a separate select in the Where clause.
Is this intentional or somewhat related to this issue?
Trying to simplify out business case we have something like this:
dbcontext.Set<Article>()
.Include(article => article.Comments.Where(comment => comment.IsDeleted == false))
.Where(article => article.Comments.Any())
.Select(article => new { articleId = article.id, numberOfComments = article.Comments.Count() }
The way I read this is that Article has a navigational collection property to comments. I want to fetch only articles that has any undeleted comments. Next I want to project this into a custom anonymous object with the articleId and the number of comments. However, In this case, the count would also include navigational properties that where filtered out in the include, e.g. lets say an article has 2 comments, one which is deleted and another which is not. I'd expect the Count in the select to return 1, and not 2. Is this intentional? I find nothing about this behavior in the docs.
I believe this is intentional. What you include and what you filter for may have nothing to do with each other e.g.
dbcontext.Set<Article>()
.Include(a => a.Comments.Where(comment => comment.Created >= lastActivity)
.Select(a => new { a.Id, Total = a.Comments.Count(), Unread = a.Comments.Count(c => c.Created >= lastActivity) })
This would be impossible if the include filter is propagated to the projection.
@ggjersund what @bachratyg wrote is correct, the two things are orthogonal and therefore must be specified separately. Hiding as off-topic.
Note from triage: putting this in 7.0 to consider doing it in some cases. It is unlikely we will implement this in all cases due to complexity.
The basic case when the subquery doesn't have any additional operations (including joins), are converted to table joins in 7.0. This is the case where it is mathematically correct to transform. Leaving up to @roji to determine if there are additional cases which are equivalent in all cases.
For reference, @smitpatel's PR for the above is #26476. At some point I'll take a look and think if I can think of other cases.
I have other use case where sub-query is wrong:
SELECT [t0].[RoleId], [t0].[RoleCustomerId], [t0].[RoleName], [t0].[CustomerId]
FROM (
SELECT [r].[RoleId], [r].[RoleCustomerId], [r].[RoleName], [t].[CustomerId]
FROM [Roles] AS [r]
LEFT JOIN (
SELECT [c].[CustomerId], [c].[CustomerDeletionDate]
FROM [Customers] AS [c]
WHERE ([c].[CustomerDeletionDate] IS NULL) AND (@__ef_filter__p_2 = CAST(1 AS bit))
) AS [t] ON [r].[RoleCustomerId] = [t].[CustomerId]
WHERE ([t].[CustomerDeletionDate] IS NULL) AND (@__ef_filter__p_0 = CAST(1 AS bit))
ORDER BY [r].[RoleName]
OFFSET @__p_0 ROWS FETCH NEXT @__p_1 ROWS ONLY
) as [t0]
Entity<Customer>().HasQueryFilter(c => c.DeletionDate == null);
Entity<Role>().HasQueryFilter(c => c.CustomerId == null || c.Customer.DeletionDate == null);
There is optional navigation property with global filter applied. The sub-query generates wrong result that cannot be properly filter outside subquery.
@msmolka please open a new issue with a runnable code sample reproducing the problem (the above isn't sufficient for us to reproduce).
Just to answer how to do it, add a custom implementation
IQueryTranslationPostprocessor
deriving fromRelationalQueryTranslationPostprocessor
and replaceShapedQueryExpression.QueryExpression
which would be aSelectExpression
with a differentSelectExpression
to generate same result without subquery joins. If you find any lacking APIs to make required change, then another option is to provider customIQuerySqlGenerator
which will just simplify subquery join to table join when printing it out to DbCommand text.
Could you please provide more details on how to do this? I'm working on a project with a complex DB structure. There are queries that are joining two dozen tables using Include().ThenIncude(). It also uses global query filters. So one of these things is causing each table in a join to be replaced with a subquery that does a "select *". Even though this is a small DB (none of the tables have more than 1000 records, each query is taking extremely long. I experimented with one query. It's translated into a 2 page long SQL and takes 900ms to execute. When I added IgnoreQueryFilters() to it, the SQL produced was 4 lines long and took 2ms to execute.
Please help.
Could you please provide more details on how to do this?
Just to make it clear, the instructions you quoted above are about modifying EF itself to implement this, not something to be done in your own application. Unless you intend to contribute this to EF (and it isn't a trivial change to make), this would be relevant for you.
Regardless, can you please put together a minimal database and code sample which shows the significant perf difference with and without query filters? This issue is still lacking a minimal repro which clearly shows the subquery join as being a significant perf issue. Also, in many cases where users encounter such a perf difference, the actual cause lies elsewhere; so it would be good to be sure what's going on.
For queries with includes, we currently generate joins with a subquery:
We could simplify this to:
We should measure the execution perf difference between the above two. Even if there is no (significant) difference, we could still decide to do this for SQL simplicity.
Originally raised in https://github.com/aspnet/EntityFrameworkCore/issues/17455.