dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.79k stars 3.19k forks source link

Duplicate subquery when doing base type/derived type projection #30635

Open aradalvand opened 1 year ago

aradalvand commented 1 year ago

Given this model:

public class AppDbContext : DbContext
{
    public AppDbContext(DbContextOptions<AppDbContext> options) : base(options)
    {
    }

    protected override void OnModelCreating(ModelBuilder builder)
    {
        builder.Entity<Lesson>()
            .HasDiscriminator<string>("Type")
                .HasValue<VideoLesson>("video")
                .HasValue<ArticleLesson>("article");
    }

    public DbSet<Course> Courses => Set<Course>();
    public DbSet<Rating> Ratings => Set<Rating>();
    public DbSet<Lesson> Lessons => Set<Lesson>();
}

public class Course
{
    public int Id { get; set; }
    public required string Title { get; set; }

    public ICollection<Lesson> Lessons { get; set; } = default!;
    public ICollection<Rating> Ratings { get; set; } = default!;
}

public class Rating
{
    public int Id { get; set; }
    public required int CourseId { get; set; }
    public required byte Stars { get; set; }

    public Course Course { get; set; } = default!;
}

public abstract class Lesson
{
    public int Id { get; set; }
    public required int CourseId { get; set; }
    public required string Title { get; set; }

    public Course Course { get; set; } = default!;
}

public class VideoLesson : Lesson
{
    public required Uri Url { get; set; }
}
public class ArticleLesson : Lesson
{
    public required string Text { get; set; }
}

If you were to run a query like this, where you're checking the type of the lesson, and then projecting into the corresponding DTO type

db.Lessons.Select<Lesson, LessonDto?>(l =>
    l is ArticleLesson
    ? new ArticleLessonDto
    {
        Title = l.Title,
        CourseRating = l.Course.Ratings.Average(r => r.Stars),
        Text = ((ArticleLesson)l).Text,
    }
    : l is VideoLesson
    ? new VideoLessonDto
    {
        Title = l.Title,
        CourseRating = l.Course.Ratings.Average(r => r.Stars),
        Url = ((VideoLesson)l).Url,
    }
    : null)
    .ToList();

This would yield the following SQL:

SELECT l."Type" = 'article', l."Title", l."Text", (
    SELECT avg(r."Stars"::int::double precision)
    FROM "Ratings" AS r
    WHERE c."Id" = r."CourseId"), l."Type" = 'video', l."Url", (
    SELECT avg(r0."Stars"::int::double precision)
    FROM "Ratings" AS r0
    WHERE c."Id" = r0."CourseId")
FROM "Lessons" AS l
INNER JOIN "Courses" AS c ON l."CourseId" = c."Id"

The problem, as you might be able to immediately tell, is that the sub-query for the CourseRating is repeated twice.

It may be worth noting that if you selected the Title property on the Course instead, this wouldn't happen:

db.Lessons.Select<Lesson, LessonDto?>(l =>
    l is ArticleLesson
    ? new ArticleLessonDto
    {
        Title = l.Title,
        CourseTitle = l.Course.Title,
        Text = ((ArticleLesson)l).Text,
    }
    : l is VideoLesson
    ? new VideoLessonDto
    {
        Title = l.Title,
        CourseTitle = l.Course.Title,
        Url = ((VideoLesson)l).Url,
    }
    : null)
    .ToList();

There's going to be only a single occurrence of c."Title" in the query, as expected:

SELECT l."Type" = 'article', l."Title", l."Text", c."Title", l."Type" = 'video', l."Url"
FROM "Lessons" AS l
INNER JOIN "Courses" AS c ON l."CourseId" = c."Id"

EF Core version: 7.0.0 Database provider: Npgsql Target framework: .NET 7.0 Operating system: Ubuntu 22.04 IDE: VS Code

I initially created an issue in the Npgsql repo but was told that this is related to EF Core itself.

roji commented 1 year ago

Since the two distinct subqueries are present as-is in your original LINQ query (l.Course.Ratings.Average(r => r.Stars)), you're effectively asking EF to deduplicate them, i.e. to optimize your LINQ query.

Have you tried first selecting out to an intermediate anonymous object which performs the Average, and then to select out again for the DTOs? That would deduplicate the Average operation at the LINQ level.

aradalvand commented 1 year ago

Have you tried first selecting out to an intermediate anonymous object which performs the Average, and then to select out again for the DTOs? That would deduplicate the Average operation at the LINQ level.

@roji Presumably that's possible but it would make the type-checking of the Lesson awkward and difficult, which is in a way the whole point of this query. It's also an attempt to work around something that I believe should be solved at a lower level.

EF already de-duplicates simpler expressions like l.Title (as I mentioned earlier), so why not extend that to all expressions? Makes sense I think.

roji commented 1 year ago

Presumably that's possible but it would make the type-checking of the Lesson awkward and difficult, which is in a way the whole point of this query.

How would it make it more difficult? Lesson is the abstract base class here, and Course is defined on it:

db.Lessons
    .Select(l => new {
        Lession = l,
        CourseRating = l.Course.Ratings.Average(r => r.Stars)
    })
    .Select(l =>
        l.Lession is ArticleLesson
        ? new ArticleLessonDto
        {
            Title = l.Title,
            CourseRating = l.CourseRating
            Text = ((ArticleLesson)l).Text,
        }
        : l is VideoLesson
        ? new VideoLessonDto
        {
            Title = l.Title,
            CourseRating = l.CourseRating
            Url = ((VideoLesson)l).Url,
        }
        : null)

In fact, regardless of EF, I'd have advised rewriting your original LINQ query above to avoid repeating the CourseRating expression twice (duplication is always a bad idea). Note that I haven't tested whether this solves the problem on the SQL side, but I suspect it does.

More generally, when LINQ queries are written in a less condensed/efficient form, EF can't always optimize out things and de-duplicate everything for you; that requires both more complexity in the query pipeilne code and possibly affecting query compilation performance as well. We sometimes do this when it's easy/cheap, but we don't generally aim to do so in all cases; this may be what's happening with l.TItle vs. the more complex query you're trying to execute.

aradalvand commented 1 year ago

How would it make it more difficult? Lesson is the abstract base class here, and Course is defined on it:

@roji I'm writing a GraphQL-to-LINQ transformer and something like that would make the implementation far more complex, I also have to account for navigation properties and whatnot, which I've found don't always play nicely with intermediate projections like that. I know that's a me problem but I'm just saying that it would've been more reasonable if the standard query just yielded the optimal SQL in the first place.

In fact, regardless of EF, I'd have advised rewriting your original LINQ query above to avoid repeating the CourseRating expression twice (duplication is always a bad idea). Note that I haven't tested whether this solves the problem on the SQL side, but I suspect it does.

That would be true if I were writing those queries by hand but they're being auto-generated. Also, someone could be using a tool like EntityFrameworkCore.Projectables or LINQKit where you're not actually repeating expressions in your source code, but it gets repeated once it's all fed to EF.

More generally, when LINQ queries are written in a less condensed/efficient form, EF can't always optimize out things and de-duplicate everything for you; that requires both more complexity in the query pipeilne code and possibly affecting query compilation performance as well. We sometimes do this when it's easy/cheap, but we don't generally aim to do so in all cases; this may be what's happening with l.TItle vs. the more complex query you're trying to execute.

Understandable. But I would argue EF should actually aim to do so generally. I'm not sure the query compilation performance would be affected to any significant degree by something like this, also given that queries are cached, and usually compiled only once, I would say the benefits of deduplication outweigh its costs, especially as things like the libraries I mentioned above and similar solutions for expression/query "composability", if you will, become more widely adopted.

roji commented 1 year ago

I'm writing a GraphQL-to-LINQ transformer and something like that would make the implementation far more complex [...]

I understand and sympathize, but at the end of the day you're asking a lower layer (EF) to take a sub-optimal input which you give it, and make it optimal. I'm not sure exactly how you're generating the LINQ tree, but I'm not sure EF is in a better position to optimize your tree after it's produced, compared to you. In general, it's always better to produce something optimized at the source, rather than producing something less optimized and expecting later components to somehow piece it together and make it better again.

Also, someone could be using a tool like EntityFrameworkCore.Projectables or LINQKit where you're not actually repeating expressions in your source code, but it gets repeated once it's all fed to EF

If those tools produce sub-optimal trees, I'd argue that they need to be improved (just like whatever you're doing to transform GraphQL to LINQ), rather than trying to make EF compensate after the fact.

This is a bit similar to writing bad C# code, and expecting the compiler/JIT to always emit optimized binary code. Although in some simple cases the compiler can see through bad code and yield optimal results, it's ultimately unfeasible/non-scalable to expect it to always do that.