Eager Loading Many to Many Relationship Doesn't Use the Index

da3dsoul commented 3 years ago

File a bug

Okay, so I'm using SQLite. That might matter because DBMS's are pointlessly different just for the sake of it.

I have a many to many relationship declared using List\<Entity> in each class. That generally works, but I know it's newer.

EF Core generates a query that SQLite doesn't use an index search for, and instead it uses a scan, which is slow.

Include your code

    public class Image
    {
        public int ImageId { get; set; }
        public int Width { get; set; }
        public int Height { get; set; }

        public virtual List<ImageTag> Tags { get; set; }
    }

    public class ImageTag
    {
        public int ImageTagId { get; set; }

        public string Name { get; set; }
        public string Description { get; set; }
        public string Type { get; set; }

        [IgnoreDataMember]
        public virtual List<Image> Images { get; set; }
    }

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            // nullability
            modelBuilder.Entity<Image>().Property(a => a.Width).IsRequired();
            modelBuilder.Entity<Image>().Property(a => a.Height).IsRequired();
            modelBuilder.Entity<ImageTag>().Property(a => a.Name).IsRequired();

            // indexes
            modelBuilder.Entity<Image>().HasIndex(a => a.Width);
            modelBuilder.Entity<Image>().HasIndex(a => a.Height);
            modelBuilder.Entity<ImageTag>().HasIndex(a => a.Name).IsUnique();
            modelBuilder.Entity<ImageTag>().HasIndex(a => a.Type);
        }

Query

var namesToLookup = new List<string> { "a", "b", "c" };
var existingTags = await Set<ImageTag>().Include(a => a.Images)
                .Where(a => namesToLookup.Contains(a.Name)).ToListAsync(token);

So that generates a query like so:

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId", "t"."Height", "t"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN (
    SELECT "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
    FROM "ImageImageTag" AS "i0"
    INNER JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
) AS "t" ON "i"."ImageTagId" = "t"."TagsImageTagId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId"

Which explains to this

MATERIALIZE 1
SCAN TABLE ImageImageTag AS i0
SEARCH TABLE Images AS i1 USING INTEGER PRIMARY KEY (rowid=?)
SEARCH TABLE ImageTags AS i USING INDEX IX_ImageTags_Name (Name=?)
SEARCH SUBQUERY 1 AS t USING AUTOMATIC COVERING INDEX (TagsImageTagId=?)
USE TEMP B-TREE FOR ORDER BY

And because of the SCAN TABLE, it's really slow, like minutes, which would be okay if it wasn't a really easy fix.

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
FROM "ImageTags" AS "i"
INNER JOIN "ImageImageTag" AS "i0" ON "i0"."TagsImageTagId" = "i"."ImageTagId"
INNER JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId"

Explains to

SEARCH TABLE ImageTags AS i USING INDEX IX_ImageTags_Name (Name=?)
SEARCH TABLE ImageImageTag AS i0 USING INDEX IX_ImageImageTag_TagsImageTagId (TagsImageTagId=?)
SEARCH TABLE Images AS i1 USING INTEGER PRIMARY KEY (rowid=?)
USE TEMP B-TREE FOR ORDER BY

That's fast. SQLite Studio just says [00:28:36] Query finished in 0.000 second(s).

So this issue is twofold.

Is it like that for a (good) reason?
Is there a workaround that doesn't require me to manually define my intermediate tables or execute raw sql?

I have a several hundred gig database already built, so I'm not running a migration to define the tables myself to then try to manually build a better query in LINQ. I'm not running raw sql because that defeats the purpose. If I was willing to hardcode sql, I wouldn't be using an ORM with so many known quirks. No offense, it's still by far the best option.

Sorry for not following the template exactly, but I've seen this reproduced in other issues. They just didn't complain about it, which makes me think it may be an SQLite <-> EFCore quirk. It's not a specific bug, but more of a logic issue.

Include provider and version information

EF Core version: 6.0.0-preview4... System.Data.Sqlite.Core is the same version Database provider: SQLite Target framework: net5.0

I would be willing to use net6.0, but this app runs on linux, and installing the net6.0 preview tools for linux is still a pita.

ErikEJ commented 3 years ago

Which provider exactly are you using?

Please share the packagereference from your .csproj.

da3dsoul commented 3 years ago

I'm not sure exactly what you mean. This is what fits on the screen in Rider. The left version is what is installed. The right is available from nuget, though it's not doing compatibility checks, so I can't actually install it, as the new ones only target net6.0.

EDIT: Oh I said System.Data instead of Microsoft.Data. That was my bad. It's Microsoft.Data.Sqlite.Core

da3dsoul commented 3 years ago

I have Proxies installed because I was being lazy before, but I don't use them, anymore. Forgot to uninstall it.

da3dsoul commented 3 years ago

I just found out that there's another overload for Include, so I tried it to see.

var existingTags = await Set<ImageTag>().Include("ImageImageTag.Image")
                .Where(a => namesToLookup.Contains(a.Name)).ToListAsync(token);

Returns an identical query

EDIT: tried another method.

var existingTags = await Set<Image>().SelectMany(a => a.Tags)
                .Where(a => namesToLookup.Contains(a.Name)).Include(a => a.Images).ToListAsync(token);

Returns an even worse query

SELECT "t"."ImageTagId", "t"."Description", "t"."Name", "t"."Type", "i"."ImageId", "t"."ImagesImageId", "t"."TagsImageTagId", "t0"."ImagesImageId", "t0"."TagsImageTagId", "t0"."ImageId", "t0"."Height", "t0"."Width"
FROM "Images" AS "i"
INNER JOIN (
    SELECT "i1"."ImageTagId", "i1"."Description", "i1"."Name", "i1"."Type", "i0"."ImagesImageId", "i0"."TagsImageTagId"
    FROM "ImageImageTag" AS "i0"
    INNER JOIN "ImageTags" AS "i1" ON "i0"."TagsImageTagId" = "i1"."ImageTagId"
) AS "t" ON "i"."ImageId" = "t"."ImagesImageId"
LEFT JOIN (
    SELECT "i2"."ImagesImageId", "i2"."TagsImageTagId", "i3"."ImageId", "i3"."Height", "i3"."Width"
    FROM "ImageImageTag" AS "i2"
    INNER JOIN "Images" AS "i3" ON "i2"."ImagesImageId" = "i3"."ImageId"
) AS "t0" ON "t"."ImageTagId" = "t0"."TagsImageTagId"
WHERE "t"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageId", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageTagId", "t0"."ImagesImageId", "t0"."TagsImageTagId", "t0"."ImageId"

EDIT2: I also tried updating to the latest version that supports net5.0, which is 6.0.0-preview.5.21301.9. No difference in query given

da3dsoul commented 3 years ago

@ErikEJ sorry for the ping. Some prefer it, some hate it. Any ideas on what I could try?

AndriySvyryd commented 3 years ago

@smitpatel cc

smitpatel commented 3 years ago

Both SQL you have posted generated different results and are not same. We explicit use LEFT JOIN because when ImageTag doesn't have any associated Image you still want it in your results. All ImageTags matching where predicate needs to be in result regardless of associated Images. Associated Images should populate collection if there are any. INNER JOIN will filter out ImageTag which doesn't have Image. So queries are the way it is for a reason.

If you understand what is the difference in the results above and are ok with it then you have to write your join manually in LINQ. EF Core will happily translate it to INNER JOIN if you use Join operator. Keep in mind that this join will require you to reference the implicitly created join entity also which means, either you have to use the name EF Core gives internally (which could be hard to find and bit fragile) or you need to use API to give it the name you want which will be added configuration code in model.

da3dsoul commented 3 years ago

I'm aware of the difference. The INNER JOIN was an oversight, but I figured it wouldn't make much difference vs a LEFT JOIN in the EXPLAIN. I can test it, but wouldn't

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId", "t"."Height", "t"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN (
    SELECT "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
    FROM "ImageImageTag" AS "i0"
    INNER JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
) AS "t" ON "i"."ImageTagId" = "t"."TagsImageTagId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId"

not be necessarily equivalent, but since I don't access the intermediate tables in this case, functionally equivalent to this? With a left join, after it goes back to EF, it would just be null. If ImageImageTags is null, then so is Images.

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN "ImageImageTag" AS "i0" ON "i0"."TagsImageTagId" = "i"."ImageTagId"
LEFT JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId"

My dev machine with a test database is not right next to me, so I'll test that query explain in a bit.

smitpatel commented 3 years ago

Those queries are not equivalent either since the nested inner join performs filter (by definition) and converting it to LEFT JOIN can generate different result. At the end of the day, queries are generated to produce accurate result in most performant way. We keep perf in mind but we don't change results for the sake of performance. As I said above, you can write your own LINQ query without using navigations to generate exactly SQL you want.

da3dsoul commented 3 years ago

This could be my misunderstanding of how SQL works in this case. The subquery:

    SELECT "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
    FROM "ImageImageTag" AS "i0"
    INNER JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"

This would run first, building an entire map of ImageImageTag to Image, while filtering out the ones that don't match. After that, the outer query would look up ImageImageTags via the ID from that set. It requires getting more results than it needed first (the INDEX SCAN in the explain), then additionally filtering it. Am I correct in that logic flow?

Internally the single statement is different.

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN "ImageImageTag" AS "i0" ON "i0"."TagsImageTagId" = "i"."ImageTagId"
LEFT JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId"

It builds a mapping of ImageTag to ImageImageTag, and if there is no result, it doesn't bother trying to map Images to the ImageImageTag. The result set should be the same, it just does the filtering in a different order, but more efficiently, as it knows in advance what it's looking for. Am I correct here?

I'm not trying to argue that I'm right here. I'm just trying to learn more about it.

" As I said above, you can write your own LINQ query without using navigations to generate exactly SQL you want." Any chance you can point me to how I could go about doing that? All of the examples I've found do it with navigation properties. I need these to be attached for change tracking and modification.

It's also possible that there's a much better way to do what I'm doing, but that seems to be out of the scope of this issue.

EDIT: The plan for LEFT JOINS in a flat query

SEARCH TABLE ImageTags AS i USING INDEX IX_ImageTags_Name (Name=?)
SEARCH TABLE ImageImageTag AS i0 USING INDEX IX_ImageImageTag_TagsImageTagId (TagsImageTagId=?)
SEARCH TABLE Images AS i1 USING INTEGER PRIMARY KEY (rowid=?)
USE TEMP B-TREE FOR ORDER BY

EDIT2: ~In the case of keeping the same style of query, it seems like it would be smarter to build the query scanning on the other edge, Image, as the number of records is going to be much smaller than the number of the relation mapping records, like so:~ NVM an explain of it yielded the same SCAN on ImageImageTag

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId", "t"."Height", "t"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN (
    SELECT "i0"."ImagesImageId", "i0"."TagsImageTagId", "i1"."ImageId", "i1"."Height", "i1"."Width"
    FROM "Images" AS "i1"
    INNER JOIN "ImageImageTags" AS "i0" ON "i0"."ImagesImageId" = "i1"."ImageId"
) AS "t" ON "i"."ImageTagId" = "t"."TagsImageTagId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId", "t"."ImagesImageId", "t"."TagsImageTagId", "t"."ImageId"

smitpatel commented 3 years ago

When performing inner join between ImageImageTags to Images if ImageImageTag doesn't have any Image then that row would be removed from result. And then it will perform LeftJoin with ImageTag table and fill in nulls when there is no associated row. When you perform both Left join on top level, if ImageImageTags matches an ImageTag (so it will have data from that LeftJoin) but if it doesn't have Image associated it will fill in null for those values. This is additional row which is not in result set of first query. Writing top level left joins doesn't perform filter inner join brings in implicitly.

da3dsoul commented 3 years ago

I see. If I used the intermediate tables, that would be different.

In this case, EF handles the intermediate ImageImageTags. I don't have it in the C# models, so I can't use them (afaik). EF controls that logic entirely, so it could use the flat query, as the C# Image.Tags object would still yield no results for either method.

I think in this case, that would be better for performance reasons, but it is a bigger change than I initially thought for the functionality of EF.

Is it possible to use LINQ without navigation properties while having attached entities with change tracking?

da3dsoul commented 3 years ago

After re-reading some, I think what you meant was to do something like this:

await Set<ImageTag>()
                .LeftJoin(Set<ImageImageTag>(), tag => tag.ImageTagId, imageTag => imageTag.TagImageId,
                    (tag, imageTag) => new { Tag = tag, Inter = imageTag }).LeftJoin(Set<Image>(),
                    items => items.Inter.ImageImageId, image => image.ImageId,
                    (inter, image) => new { inter.Tag, image });

ImageImageTag doesn't exist, so that's a compile time error. I've also not used the EF Join method, so that could also be incorrect code, but I think the intent is still there. Is there a way to do that without access to the intermediate type?

AndriySvyryd commented 3 years ago

Something like Set<Dictionary<string, object>>("ImageImageTag")

da3dsoul commented 3 years ago

Huh, would that map the Column Name to Value?

AndriySvyryd commented 3 years ago

No, that's just accessing the DbSet for the EntityType created automatically by EF

da3dsoul commented 3 years ago

Interesting, I'll play with it. That might solve my immediate problem, but I do think that this is something that EF should handle better for the (possibly distant) future.

Thanks much

AndriySvyryd commented 3 years ago

This is something out-of-scope for EF to do automatically as determining whether this optimization would be appropriate for a given query and model is not trivial and might depend on the data and user's intent.

da3dsoul commented 3 years ago

For future peeps that may come across this, I used the following code:

            var namesToLookup = new List<string> { "a", "b", "c" };
            var imageImageTags = Set<Dictionary<string, object>>("ImageImageTag");
            var tempTags = await (from imageTag in Set<ImageTag>()
                join imageImageTag in imageImageTags
                    on imageTag.ImageTagId equals EF.Property<int>(imageImageTag, "TagsImageTagId") into grouping
                from imageImageTag in grouping.DefaultIfEmpty()
                join image in Set<Image>()
                    on EF.Property<int>(imageImageTag, "ImagesImageId") equals image.ImageId into grouping2
                from image in grouping2.DefaultIfEmpty()
                where namesToLookup.Contains(imageTag.Name)
                orderby imageTag.ImageTagId
                select new { imageTag, image }).ToListAsync(token);
            var existingTags = tempTags.GroupBy(a => a.imageTag, a => a.image).Select(a =>
            {
                var tag = a.Key;
                tag.Images = a.ToList();
                return tag;
            }).ToList();
            AttachRange(existingTags);

To generate the following SQL:

SELECT "i"."ImageTagId", "i"."Description", "i"."Name", "i"."Type", "i1"."ImageId", "i1"."Height", "i1"."Width"
FROM "ImageTags" AS "i"
LEFT JOIN "ImageImageTag" AS "i0" ON "i"."ImageTagId" = "i0"."TagsImageTagId"
LEFT JOIN "Images" AS "i1" ON "i0"."ImagesImageId" = "i1"."ImageId"
WHERE "i"."Name" IN ('a', 'b', 'c')
ORDER BY "i"."ImageTagId"

The grouping, model relationship mapping, and context attaching needs to be post processed at the end, but it's still much faster than trying to run the subquery in my case. EDIT: After letting it run against real data for a few hours, it went from an average of 4:12s (rounding) to <1s. The mapping table in question is several hundred thousand records.

Because of the inaccessibility of the intermediate table, I think this could be included in EF with some optimization and cleanup. I can't think of any situation where the first left join would not be null without the second left join matching, as only EF can make them when a user provides a mapping of Image to ImageTag.

dotnet / efcore