GroupBy over entity type: stop including all columns

roji commented 3 weeks ago

Our current implementation of GroupBy over entity type populates all of the entity type's columns into the GROUP BY clause:

_ = await context.Posts
    .GroupBy(p => p.Blog)
    .Select(g => new { g.Key.Id, Count = g.Count() })
    .ToListAsync();

SQL:

SELECT [b].[Id], COUNT(*) AS [Count]
FROM [Posts] AS [p]
INNER JOIN [Blogs] AS [b] ON [p].[BlogId] = [b].[Id]
GROUP BY [b].[Id], [b].[Name]

Note that the LINQ query doesn't project the Blog's Name out, and there's no reason for us to group by Name (grouping by the primary key should be enough - just like in the final GroupBy case). But our SQL tells the database to do the (considerable!) extra work of grouping by Name.

This issue also affects final GroupBy, where the grouping key is converted to orderings, so the final query has ORDER BY for all of the entity type's columns, instead of just for the primary key's columns. Aside from being problematic perf-wise, this is also the cause of https://github.com/npgsql/efcore.pg/issues/3202, where the entity type has a non-sortable column type (xid) which causes the query to fail.

Note previous issue #34895 for final GroupBy, and unmerged PR #34896 which specifically removed the unneeded columns post-hoc for final GroupBy; but the more correct fix would be to not have them in the groupby clause in the first place.

roji commented 3 weeks ago

One possible approach here would be to solve this via pruning, similar to how we prune unneeded subquery projections. However, this is tricky; it means that for a grouping key, we need to be able to figure out if a subset of the columns constitute a unique set (e.g. a primary key), and at that point we can prune any other column - assuming it's not referenced (e.g. projected). This isn't easy, x.Id is unique, x.Id + 1 is also unique, but x.Id % 2 is not, etc. I'm not sure we should go in this direction...

More philosophically, this problem generally stems from our design, where we want to be able to just bind a property and generate a ColumnExpression, without that affecting the SelectExpression; or more generally, it's our (flawed) design where RelationalSqlTranslatingEV is expected to be able to accept any lambda and return its translation, without any state or modification to the current SelectExpression. In other words, I think the right design here is for our translation of GroupBy() to only popualate the actual columns specified in the operator (e.g. if it's an entity type, just put the primary key columns). Then, if a non-PK column needs to be projected (or otherwise accessed), we should update SelectExpression.GroupBy at that point, as part of binding the property (again, this requires modifying the SelectExpression from within RelationalSqlTranslatingEV/ProjectionBindingEV, which is the thing that's against our current design).

This is a bit similar to our approach with nav expansion. Instead of adding JOINs at the point where a property is bound via the navigation, we preprocess the tree and transform it, so that when we reach translation, the tree has been remade in a way that JOINs have already been added and properties can just be bound without adding JOINs. Removing the nav expansion preprocessing step would mean adding the JOIN during translation, at the moment where a property is bound - similar to how we'd add the GROUP BY column when the corresponding property is bound.

Back to the concretr implementation, adding columns lazily to the GROUP BY clause assumes that there can't be a pushdown between the initial GroupBy() translation and the binding to its output; if there is one, the GROUP BY would already be nested within a subquery, and we'd have to go down and update it there, which probably isn't feasible. Think if such a case can actually exist.

ajcvickers commented 1 day ago

@roji Agreed we should change this, but will the change be breaking for no-tracking queries where multiple objects with the same primary key value can be returned?

dotnet / efcore

GroupBy over entity type: stop including all columns #34921