dragnilar commented 4 years ago

Note:

Issue #20892 has been created to track actual work on split queries in EF Core 5.0. The new issue is locked so that it can be subscribed to for updates on the implementation without noise.
This older issue has been re-purposed to be discussion about different approaches, etc. It will not be locked because it is important to us that we don’t shut down channels of communication with the community.

Preview6 introduces AsSplitQuery API to load collections using multiple queries. Details of the API has been posted https://github.com/dotnet/efcore/issues/20892#issuecomment-644456029

A way to rewrite the query to avoid large result set & mimick previous version behavior is posted https://gist.github.com/smitpatel/d4cb3619e5b33e8d9ea24d3f2a88333a

Note: Per @roji's request, I am opening this in response to the comments on https://github.com/aspnet/EntityFrameworkCore/issues/17455

I have a very large query that contains about 57 includes that is being used to copy down a large entity so that it can be modified and cloned.

With EF 2.2.6, this large query ran successfully in about 1-3 seconds (variable). With the changes in 3.0 (all includes create one entire SQL statement with joins), the query takes significantly longer and always times out with the default execution timeout settings.

Steps to reproduce

Use a (IMO nasty) Linq query similar to the one as follows:

dbContext.Projects
    .Include(z => z.Financial)
    .Include(z => z.ProjectProtocol)
    .Include(z => z.ReportCategories)
    .Include(z => z.ReportSubCategories)
    .Include(x => x.SubProjects).ThenInclude(y => y.Address)
    .Include(x => x.SubProjects).ThenInclude(y => y.LegacyAddress)
    .Include(x => x.SubProjects).ThenInclude(z => z.BuildingTypes)
    .Include(x => x.SubProjects).ThenInclude(z => z.Buildings).ThenInclude(b => b.BuildingType)
    .Include(x => x.SubProjects).ThenInclude(z => z.Buildings).ThenInclude(b => b.Site)
    .Include(x => x.SubProjects).ThenInclude(z => z.Sites).ThenInclude(s => s.Address)
    .Include(x => x.SubProjects).ThenInclude(z => z.Participants).ThenInclude(p => p.Address)
    .Include(x => x.SubProjects).ThenInclude(z => z.ExcelFileJson)
    .Include(x => x.SubProjects).ThenInclude(z => z.CompanionAddress)
    .Include(x => x.SubProjects).ThenInclude(z => z.UtilityTypes)
    .Include(x => x.SubProjects).ThenInclude(z => z.InspectedUnits).ThenInclude(i => i.Building)
    .Include(x => x.SubProjects).ThenInclude(z => z.InspectedUnits).ThenInclude(i => i.UnitType)
    .Include(x => x.SubProjects).ThenInclude(z => z.Utilities).ThenInclude(i => i.UtilityType)
    .Include(x => x.SubProjects).ThenInclude(z => z.Units).ThenInclude(i => i.Building)
    .Include(x => x.SubProjects).ThenInclude(z => z.UnitTypes)
    .Include(x => x.SubProjects).ThenInclude(z => z.CommonAreas).ThenInclude(ca => ca.Building)
    .Include(x => x.SubProjects).ThenInclude(z => z.FixtureAreas).ThenInclude(w => w.Fixtures)
    .Include(x => x.SubProjects).ThenInclude(y => y.LineItems).ThenInclude(z => z.EnergyOneItems)
    .Include(x => x.SubProjects).ThenInclude(y => y.LineItems).ThenInclude(z => z.EnergyTwoItems)
    .Include(x => x.SubProjects).ThenInclude(y => y.LineItems)
    .ThenInclude(z => z.TraditionalItem).AsNoTracking().FirstOrDefault(x => x.Id == project.Id);

Results:

With EF Core 2.2.6 - I can see in the output via the SQL Server Profiler that EF is breaking up the LINQ statement into smaller queries. The overall process takes about 1-3 seconds.

With EF Core 3.0 - I can see in the output via the SQL Server Profiler that EF is emitting one massive query with lots and lots of joins. The overall process always times out with the default execution timeout setting.

At this point, I am open to the notion that this query needs to be either re-written or the process needs to be changed for handling the cloning. I would still like to hear if there are any workarounds, findings that this is a bug or other suggestions to avoid having to devote a significant amount of time on rewriting.

Edit For now we worked around this by splitting the query up manually using EF Plus' "Includes Optimized" method and then looping through the change tracker to set all of the entities as untracked so we can then reset their keys so that the graph can be comitted as a clone (this gave me a flashback to my EF 6 days).

Note: The model changed somewhat between the time this issue was first encountered and now due to user requests and other factors. I should also note that the system is now in production and users are pretty happy with the performance.

            var clone = dbContext.Projects
                .IncludeOptimized(z => z.Financial)
                .IncludeOptimized(z => z.ProjectProtocol)
                .IncludeOptimized(z => z.ReportCategories)
                .IncludeOptimized(z => z.ReportSubCategories)
                .IncludeOptimized(z => z.SubProject)
                .IncludeOptimized(z => z.SubProject.ValidationFlag)
                .IncludeOptimized(z => z.SubProject.ExcelFileJson)
                .IncludeOptimized(z => z.SubProject.EnergyAuditUtilities)
                .IncludeOptimized(z => z.SubProject.EnergyAuditData)
                .IncludeOptimized(z => z.SubProject.EnergyAuditData.EnergyAuditAreas)
                .IncludeOptimized(z => z.SubProject.Address)
                .IncludeOptimized(z => z.SubProject.Utilities)
                .IncludeOptimized(z => z.SubProject.UtilityTypes)
                .IncludeOptimized(z => z.SubProject.Units)
                .IncludeOptimized(z => z.SubProject.Units.Select(y => y.Building))
                .IncludeOptimized(z => z.SubProject.BuildingTypes)
                .IncludeOptimized(z => z.SubProject.Buildings)
                .IncludeOptimized(z => z.SubProject.Buildings.Select(y => y.BuildingType))
                .IncludeOptimized(z => z.SubProject.Buildings.Select(y => y.Site))
                .IncludeOptimized(z => z.SubProject.Sites)
                .IncludeOptimized(z => z.SubProject.Sites.Select(y => y.Address))
                .IncludeOptimized(z => z.SubProject.Participants)
                .IncludeOptimized(z => z.SubProject.Participants.Select(y => y.Address))
                .IncludeOptimized(z => z.SubProject.InspectedUnits)
                .IncludeOptimized(z => z.SubProject.InspectedUnits.Select(y => y.Building))
                .IncludeOptimized(z => z.SubProject.InspectedUnits.Select(y => y.UnitType))
                .IncludeOptimized(z => z.SubProject.UnitTypes)
                .IncludeOptimized(z => z.SubProject.CommonAreas)
                .IncludeOptimized(z => z.SubProject.CommonAreas.Select(y => y.Building))
                .IncludeOptimized(z => z.SubProject.CommonAreas.Select(y => y.Building).Select(z => z.Units))
                .IncludeOptimized(z => z.SubProject.FixtureAreas)
                .IncludeOptimized(z => z.SubProject.FixtureAreas.Select(y => y.Fixtures))
                .IncludeOptimized(x => x.SubProject.LineItems)
                .IncludeOptimized(x => x.SubProject.LineItems.Select(y => y.EnergyOneItem))
                .IncludeOptimized(x => x.SubProject.LineItems.Select(y => y.EnergyTwoTwoItem))
                .IncludeOptimized(x => x.SubProject.LineItems.Select(y => y.TraditionalItem))
                .FirstOrDefault(x => x.Id == project.Id);

            if (clone != null)
            {
                foreach (var entityEntry in dbContext.ChangeTracker.Entries())
                {
                    if (entityEntry.Entity != null)
                    {
                        entityEntry.State = EntityState.Detached;
                    }
                }
                return clone;
            }

My team was struggling with this at first due to the fact that EF was at first wiping out the entities when we detatched them due to an issue on https://github.com/dotnet/efcore/issues/18982. Using the work around that was posted there allowed for things to work. The overall performance is actually better due to the fact there isn't any client side evaluation. However, I would still prefer if the behavior of Includes Optimized (which is pretty much what EF 2.x did, splitting the query) was something that came out of the box with EF Core. I also do not like it how this entire scenario can no longer be done with a no tracking query (or so it seems), as it was possible to do it before with EF Core 2.x.

Further technical details

EF Core version: 3.0.0 Database provider: Microsoft.EntityFramework.SqlServer Target framework: .Net Core 3.0 Operating system: Windows 10 x64 / Windows Server 2016 IDE: Visual Studio 2019 Pro (16.3)

KevinMallinson commented 4 years ago

This also happens with me: #18017

It seems it's happening to quite a few people. I know my query sucks, as do you in OP, but it would be a shame to have to revert an upgrade due this.

dragnilar commented 4 years ago

@KevinMallinson I'm not surprised, it's a pretty significant change. I was hoping when I first read about it, that there would be some way to opt-out of this new behaviour, but unfortunately, that doesn't seem to be the case.

divega commented 4 years ago

Some observations:

The main source of redundancy in a query like (and the higher potential for improvement if you decide to refactor the queries in your application) this comes from the traversal of collection navigation properties with high cardinality. So we are not talking about having to split a query into ~57 separate queries, but about judiciously partition it where it matters.
As @smitpatel implied on https://github.com/aspnet/EntityFrameworkCore/issues/17455#issuecomment-534629896, the best case scenario for the multiple query approach taken in 2.2 is that some of the intermediary results will come out empty, in which case not all paths need to be traversed and hence we end up not executing all queries.
However, in the worst case scenario for the multiple query approach, there is data on all paths and several of the queries generated will have similar complexity to the one query we generate in 3.0. There is no data redundancy in the results, but every table on each path needs to be joined and predicates applied several times.
Although the approach we take in 3.0 is similar to EF6 and is relatively well understood, there might be still things we can do to improve the performance that we haven't investigated. I think there is a small chance that some profiling of the queries and trying different combinations may help get better results.
I expect that we will keep our new design to generate a single SQL query from each LINQ query because of all the benefits, but to mitigate cases like this we can consider providing API to explicitly generate and execute all the queries. I can see a spectrum of solutions, ranging from a utility method that takes the original query and does this automatically, to some extension method that simplifies creating a query for a related data based on the original query.

dragnilar commented 4 years ago

@divega Having an API or extension to use the previous behavior would be good for those who don't have the bandwidth to rewrite large queries like this one. I personally would like to rewrite mine since it's inefficient and cumbersome to read, but as it stands I'll have to remain on 2.2.6 for a while longer. I'm sure others are in the same boat.

As far as refactoring it goes, I seldom use explicit loading, so excuse my unfamiliarity with it, but would I be able to use that in tandem with No Tracking to be able to obtain a copy of the entity that can have its keys changed before saving it again?

divega commented 4 years ago

@dragnilar you just made me realize there are two interesting points I didn't think of:

Any workaround that relies on explicit loading or executing multiple queries would require fixup behavior that only happens for tracked queries.
It is possible that some of the negative performance impact of 3.0 comes from the fact that we no longer perform identity resolution in non-tracked queries, which should result in many more objects being created. There are two things you could measure that may give us a hint here:
1. Execute the original LINQ query with tracking enabled
2. Execute the generated SQL directly in SQL Management Studio

dragnilar commented 4 years ago

@divega

Now I remember now that I needed the fixup behaviour, which was why I couldn't use explicit loading in the first place for this particular query.

I tried out what you suggested:

Running the query through EF with tracking enabled has it finish successfully. It takes about ~25 seconds on average.

Running the generated SQL (I obtained it via the profiler) in SSMS takes almost 2 minutes to finish.

The generated SQL is pretty large in this case, but it's not as large as when I have tracking turned off.

Shane32 commented 4 years ago

I am very glad of the new behavior. Many times I had written a query expecting it to run in a single call, when it makes multiple roundtrips to the database to retrieve what I know to be very small amounts of data. Now it is in my hands as a developer whether to make one call to SQL for all the joins, or make separate requests to the database to retrieve all the pertinent information. Rewriting a few complex queries seems a small price to pay for the benefits gained in the majority of queries, at least certainly going forwards.

Ideally, we could provide SQL-compilation hints to the query, where it would run those queries separately. Or maybe opt-in behavior for the query as a whole. However, I never found EF 2.2 to produce very efficient queries in this scenario, so I would hope for more optimized behavior.

dragnilar commented 4 years ago

@shane32 I agree it's better that it adhire to what is expected but at the time it was the seemingly recommended approach based on the documentation provided and what other resources recommended for the cloning scenario. Im fine with rewriting it, but as I said in the OP, I'd rather explore all options first before committing to a rewrite. I'm not sure what others will say, for me it's a pain, but not a super huge impact on my team since my code base uses pretty simple queries for all other operations.

kakone commented 4 years ago

we can consider providing API to explicitly generate and execute all the queries.

Please, provide an API because this behavior change is very problematic. We need to rewrite a lot of our queries in EF Core 3. I can understand this change but the problem is that I don't see how I can easily go back to the previous behavior. I can enable tracking but I must write a lot of queries to do a ThenInclude(...).ThenInclude(...).ThenInclude(...).ThenInclude(...).

erikmf12 commented 4 years ago

I would like to chime in and say that we are also having this issue, and is very problematic for us. An .AsMultipleSqlStatements() extension method would suffice for those of us who need faster execution.

DoughtCom commented 4 years ago

Same here, seeing how the multiple queries returned their own subset of results then were consolidated in the framework that resulted in 100s of results. My query is fairly basic, it's really just: Get me products by brand, their images, their options and those option's values.

Example: await _context.Products.Include("Images").Include("Options.Values").ToListAsync();

To my surprise, when I opened SQL Inspector when I started getting timeouts, I found out it was returning 150,000+ results when running the command manually.

erikmf12 commented 4 years ago

For anyone wondering how to 'hack' your way around this to get 2.2 speed on 3.0, follow the idea on this comment. Basically, you define an initial query to filter your results to the entity you want, and then create dummy variables with .includes to include the related entities you need.

Important: I have only tested if this returns the correct data, nothing else. If you need to do something more than reading/displaying, make sure you test first.

For example, I went from this:

var equipment = await _context.Equipment.Include(x => x.Group)
                .Include(x => x.SystemInfo).ThenInclude(x => x.SystemUsers)
                .Include(x => x.SystemInfo).ThenInclude(x => x.Frameworks)
                .Include(x => x.SystemInfo).ThenInclude(x => x.VideoCards)
                .Include(x => x.SystemInfo).ThenInclude(x => x.StorageDrives)
                .Include(x => x.SystemInfo).ThenInclude(x => x.Software)
                .Include(x => x.SystemInfo).ThenInclude(x => x.NetworkAdapters)
                .Include(x => x.SystemInfo).ThenInclude(x => x.Printers)
                .Include(x => x.Parts).ThenInclude(x => x.ChildrenParts)
                .Include(x => x.Parts).ThenInclude(x => x.ParentParts)
                .Include(x => x.Parts).ThenInclude(x => x.Vendor)
                .Include(x => x.MaintenanceHours)
                .Include(x => x.Attachments)
                .Include(x => x.Notes)
                .Include(x => x.Status)
                .Include(x => x.Area)
                .Include(x => x.EquipmentType)
                .Include(x => x.Department)
                .Include(x => x.PMaintenance)
                .Include(x => x.Request)
                .FirstOrDefaultAsync(x => x.EquipmentId == id);

to this:

var equip = _context.Equipment.Include(x => x.Group).Where(x => x.EquipmentId == id);
var equipment = await equip.Include(x => x.MaintenanceHours)
                .Include(x => x.Attachments)
                .Include(x => x.Notes)
                .Include(x => x.Status)
                .Include(x => x.Area)
                .Include(x => x.EquipmentType)
                .Include(x => x.Department)
                .Include(x => x.PMaintenance)
                .Include(x => x.Request).FirstOrDefaultAsync();

            var d2 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.SystemUsers).FirstOrDefaultAsync();
            var d3 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.Frameworks).FirstOrDefaultAsync();
            var d4 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.VideoCards).FirstOrDefaultAsync();
            var d5 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.StorageDrives).FirstOrDefaultAsync();
            var d6 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.Software).FirstOrDefaultAsync();
            var d7 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.NetworkAdapters).FirstOrDefaultAsync();
            var d8 = await equip.Include(x => x.SystemInfo).ThenInclude(x => x.Printers).FirstOrDefaultAsync();
            var d9 = await equip.Include(x => x.Parts).ThenInclude(x => x.ChildrenParts).FirstOrDefaultAsync();
            var d10 = await equip.Include(x => x.Parts).ThenInclude(x => x.ParentParts).FirstOrDefaultAsync();
            var d11 = await equip.Include(x => x.Parts).ThenInclude(x => x.Vendor).FirstOrDefaultAsync();

You should only need to do this for the entities that need a .ThenInclude(). It's not the best at all, but it allows me to still use .NET Core 3.0 without having to sacrifice load times.

dragnilar commented 4 years ago

@erikmf12 not sure if this is a full work around. I think we tried something like that and ran into a problem with the entity not having been fixed up since we need it to be non tracked so we can reset the primary keys before recomitting the entity back to the database as a copy of the original. Currently we're considering just writing a manual clone.

erikmf12 commented 4 years ago

@dragnilar True, I have not tested anything besides making sure all the data is included. In my case, I just need to display the data on a detail page.

DoughtCom commented 4 years ago

Figured I would reference the original change request for this feature on here so it would be tracked back on that request.

https://github.com/aspnet/EntityFrameworkCore/issues/12098

Shane32 commented 4 years ago

@dragnilar Correct, my workaround only works with tracked entities. You should be able to write a snippet to untrack ALL entities from a dbcontext, if that works for you. Something like this: (not sure if this is a EF6 or EF Core snippet)

public void DetachAllEntities()
{
    var changedEntriesCopy = this.ChangeTracker.Entries()
        .Where(e => e.State == EntityState.Added ||
                    e.State == EntityState.Modified ||
                    e.State == EntityState.Deleted)
        .ToList();

    foreach (var entry in changedEntriesCopy)
        entry.State = EntityState.Detached;
}

If that works for you, you can retrieve the data with tracking, then detach all tracking, then use your existing code to clone the data.

You could probably also write code to inspect any entity object, and utilize the IModel to recursively iterate through all of its navigation members, creating a list of all related entities, then detach them all. Should be relatively easy to do. With a little extra work, it could reset primary keys and foreign keys, which may be necessary as well.

dragnilar commented 4 years ago

@dragnilar Correct, my workaround only works with tracked entities. You should be able to write a snippet to untrack ALL entities from a dbcontext, if that works for you. Something like this: (not sure if this is a EF6 or EF Core snippet)
public void DetachAllEntities()
{
    var changedEntriesCopy = this.ChangeTracker.Entries()
        .Where(e => e.State == EntityState.Added ||
                    e.State == EntityState.Modified ||
                    e.State == EntityState.Deleted)
        .ToList();

    foreach (var entry in changedEntriesCopy)
        entry.State = EntityState.Detached;
}
If that works for you, you can retrieve the data with tracking, then detach all tracking, then use your existing code to clone the data.

You could probably also write code to inspect any entity object, and utilize the IModel to recursively iterate through all of its navigation members, creating a list of all related entities, then detach them all. Should be relatively easy to do. With a little extra work, it could reset primary keys and foreign keys, which may be necessary as well.

Edit: I was probably being too polite. In actuality, we did try this first when we saw what the problem was, since it was an old trick we all had learned back in our EF6 days. Unfortunately it did not work due to a bug (see other comments I made). When we finally found out about the bug and the recommended work around though, the whole thing worked. Hence, the advice was appreciated though. However your remark about IModel did not mean anything to me. Also, I could not find any significant documentation of it besides some vague statements over at MSDN. 😑 It sounds cool but I didn't want to get wrapped up in chasing something blindly; alas I left it alone.

smitpatel commented 4 years ago

Triage decision:

We decided that single query mode is right decision we have made and we will keep the behavior
We will look into ways if we can improve perf of single SQL scenario.
Adding a hook which allows EF to generate split queries is infra costly and it has same bugs which made us change the behavior in first place. So we are not looking to add such a hook.
Due to being expression tree, and extension method which does processing would be much more expensive (perf & implementation cost both) than user rewriting the same query in split themselves. Linq is easy to write in code but hard to generate dynamically.
We will document how to rewrite the query to do split queries similar to how previous version did including stitching in tracking/non-tracking both scenarios.
Depending on complexity of above, we may look into provide some API to user where they can pass multiple queries and we would stitch the result. (Perhaps by new-ing up a context behind the hood to run multi-queries.

dragnilar commented 4 years ago

@smitpatel I think the last point would probably provide the best flexibility, but I'll understand if your team decides against it. In the least, the documentation definitely needs to be updated so others can avoid running into this situation.

sitepodmatt commented 4 years ago

I think this is quickly going to become the most frequent issue/question with EF Core 3.x once people to start to transition and have the WTF moment, and indeed whilst my EAV model is mandated by a legacy system and is a questionable design choice, this new restriction affects even the most simple scenarios - the canonical Blog/Tags/Comments/Upvotes. If we are really restricted to realistically 2 or 3 to .Include 1-to-many relations and/or carry the cognitive burden of computing the avg number of related rows to power of number of .Include so we don't stream too much redundant information then I can't see how EF Core is any better than the micro ORMs for the query side.

I understand the argument that splitting the queries can potentially carry consistency risks that aren't clear when not used with the right transaction modes and so wasn't perfect fit, however it was pragmatic and this strive for the right choice seems like a strive for purity rather than what the consumers of EF Core 3.0 desired.

dragnilar commented 4 years ago

@sitepodmatt Indeed, it was a pretty nasty surprise to see things like this with EF Core 3.0. As I implied, I am in a position where I can go either way and see both sides. And also for what it is worth, me and my team did notice certain areas perform better with 3.0 versus 2.6. Unfortunately, with the number of breaking changes (such as this) and the other issues that have cropped up on here over the past couple of days, we have decided to table upgrading to EF 3.0 for the time being.

Also, I think for the sake of backwards compatibility, they should have left in a hook or some type of flag on the context that you could toggle to use multi-query mode vs single query mode. I believe that would have been more pragmatic.

I almost wonder if @roji was being semi-prophetic when he said this on #12098:

Agreed, but we're not talking about a bad query, but rather about what happens with the default, standard EF Core join construct. At the end of the day, if the proposal is that EF Core translate the standard join syntax (Include()) via an SQL construct that results in extremely bad performance for the common scenario, then I think everyone will end up losing.

I doubt "everyone" lost with this design change. But at this stage I think it's too early to say definitively what was the net gain/loss. I am under the impression that most devs are not rushing to upgrade to Core 3.0, so it will take some time.

Shane32 commented 4 years ago

Just realize that there are a lot of devs, like me, that felt that EF Core 1.x / 2.x was a big step backwards from EF6 where in EF Core 2.x there is no way to issue certain queries without them being broken up into multiple queries. I can provide samples of seemingly simple queries on EF Core 2.x that would run a subquery individually for EVERY result in the primary table, totally breaking performance! I've had to write and rewrite queries many times to get EF 2.x to produce the desired SQL. For us, EF 3.0 is a pleasant relief, and while it would be nice if there was a 'multi-query mode' for EF 3.0, I'd rather take the 3.0 behavior over the 2.x behavior any day.

Ideally, I'd like to see a parameter added to Include to indicate multi-query behavior (e.g. .Include(x => x.Images, separateQuery: true)), plus an .AsSeparateQuery() method for splitting custom joins, but I understand that this might be a great deal of work. Hopefully an enhancement will make it into EF Core 5 to help with these scenarios.

dragnilar commented 4 years ago

@Shane32 I'm in agreement with you there, A LOT of the changes in EF 1.x/2.x were weird at first but me and the rest of the guys I work with got used to it. And again, we definitely saw improvements in a lot of the much simpler queries in the rest of our code base after the testing with EF 3, so I can see where the change back to "all at once" has a lot of benefit.

That being said, it definitely seems like there are others who are impacted by this change and they can't take advantage of the benefits of upgrading. And I imagine that this issue isn't the only factor, since there are a number of other breaking changes in EF 3.0.

Edit: Also, that would be more than ideal if we had API functionality like that added to EF (being able to adjust bits and pieces of the Linq chain as separate or whole query). But yes, as you said, I can also imagine that would be a lot of work due to the complexity of handling prioritization of queries parts, etc.

DoughtCom commented 4 years ago

If we're talking a lot of joins or anything complex I feel like EF shouldn't be used with LINQ but with something like a SQL View or Stored Procedure, especially when performance is of concern.

With that said, imagine my scenario:

I'm wanting to get back 20 products
Each with 5 Lifestyle Images
Each with 5 options
Each with 20 Option Values

Nothing crazy, nothing complicated, just asking for 3 includes. In EF 2.2 that would return 50 accumulative results. In EF 3.0 that would return 10,000 results. In a real world example using actually data where not every product has 5 images, 5 options and 20 values. It went from approximately 1000 results (the EF 2.2 way) to approximately 3500 results for a random brand I just tested.

Now keep in mind, that's someone using a statement like I posted above. await _context.Products.Include(x => x.Images).Include("Options.Values").Where(x => x.BrandId = 'xxx').ToListAsync();

I assume most people will be fine using this? And perhaps my use-case is more of an edge-case, however it seems like anything with an exponential join can be bad for performance. I came from using EF 6.0 and never ran into what I'm running into now, so maybe it is an edge-case.

In my situation I'm going to have to rewrite my entire Service/Repository layer. Because we pass in the includes to the repository so that each service request is only getting information relevant to what it needs, rather than wasteful includes. This will all have to be rewritten before going to EF 3.0, unless there's a way to change how it creates the statements.

With all of that said, I'm sure you all had a reason to do this, especially if it's proven to work well in EF 6.0 and if the frameworks are going to merge in the future that's just another reason. I also realize that doing a single async call for the new Async Streams that's new in core 3.0 is probably another good reason to go to a single query.

Shane32 commented 4 years ago

@doughtcom For simple queries like those, there's a simple workaround shown here https://github.com/aspnet/EntityFrameworkCore/issues/18017#issuecomment-535763068 that should work well for you without rewriting much code. Just write the code like this:

var products = _context.Products.Where(x => x.BrandId = brandId);
var ret = await products.Include(x => x.Images).ToListAsync();
var dummy1 = await products.Include("Options.Values").ToListAsync();
return ret;

This will split the SQL access into two queries, most importantly without returning duplicates of the images. No other code needs to change in your service layer (for this database call), as EF will stitch the models back together, and ret will have all the results (including both images and options/values).

smitpatel commented 4 years ago

I will be posting a detailed work-around later. For easier work-around in tracking queries, you need to rewrite in following way to get exactly same queries as before. Tracking is necessary part to do fix-up. Doing fix-up manually is not impossible but slightly more involved.

Support we have following entities & navigations in the model

Customer
Customer.Address (reference nav)
Customer.Orders (collection nav)
Order.OrderDetails (collection nav)
Order.OrderDiscount (reference nav)

You would need to rewrite only for collection navigations, reference navigations can be part of same query.

var baseQuery = db.Customers.Include(c => c.Address).Where(c => c.CustomerName == "John");
var result = baseQuery.ToList(); // Or async method, If doing FirstOrDefault, add Take(1) to base query
baseQuery.Include(c => c.Orders).ThenInclude(o => o.OrderDiscount).SelectMany(c => c.Orders).Load();
baseQuery.SelectMany(c => c.Orders).SelectMany(o => o.OrderDetails).Load();

This will generate 3 queries to server. It would be slightly more optimized SQL then what EF Core 2.2 generated. And StateManager will fix up navigations. It also avoids duplicating any records coming from server to client.

Generated SQL:

// Customer Include Address
SELECT [c].[Id], [c].[CustomerName], [a].[Id], [a].[City], [a].[CustomerId]
FROM [Customers] AS [c]
LEFT JOIN [Address] AS [a] ON [c].[Id] = [a].[CustomerId]
WHERE ([c].[CustomerName] = N'John') AND [c].[CustomerName] IS NOT NULL

// Order Include Order discount
SELECT [o].[Id], [o].[CustomerId], [o].[OrderDate], [o0].[Id], [o0].[Discount], [o0].[OrderId]
FROM [Customers] AS [c]
INNER JOIN [Order] AS [o] ON [c].[Id] = [o].[CustomerId]
LEFT JOIN [OrderDiscount] AS [o0] ON [o].[Id] = [o0].[OrderId]
WHERE ([c].[CustomerName] = N'John') AND [c].[CustomerName] IS NOT NULL

// OrderDetails
SELECT [o0].[Id], [o0].[OrderId], [o0].[ProductName]
FROM [Customers] AS [c]
INNER JOIN [Order] AS [o] ON [c].[Id] = [o].[CustomerId]
INNER JOIN [OrderDetail] AS [o0] ON [o].[Id] = [o0].[OrderId]
WHERE ([c].[CustomerName] = N'John') AND [c].[CustomerName] IS NOT NULL

Edit: If you do not have CLR navigation to use in SelectMany then you can use EF.Property to reference the collection navigation.

@Shane32 - Your work-around avoids large result set upto an extent but still doing Customer include with Orders will repeat customers in the records. By using SelectMany & selecting Customer separately you remove all duplication.

I have also linked this post from top.

Shane32 commented 4 years ago

@smitpatel ~~With many-to-many relationships, where EF is creating a hidden mapping table, will the suggested method of using SelectMany still work?~~ Not supported by EF Core, so doesn't matter Also, if using FirstOrDefault / Skip / Take, would not the baseQuery be required to be selected, or else only the first ## records of the SelectMany be returned?

smitpatel commented 4 years ago

@Shane32 - Slightly updated my code. The thing is, you would apply Skip/Take to first query and do SelectMany afterwards, so you are not selecting first ## of Orders but rather orders for first ## customers.

smitpatel commented 4 years ago

@Shane32 - M2M is not supported but the navigation may not always be public to use in SelectMany. Using EF.Property in select many also works. Edited the post to add that too.

jostapotcf commented 4 years ago

Hey folks. I must confess I am having a moment of cognitive dissonance with regards to the new Single Query behavior. I understand why we want low roundtrips and how we could argue it improves performance... but the attached zip has a pretty stark example of why EF 3.0 isn't going to be particularly useful for dealing with object graphs.

No Code Changes
Via Nuget, upgraded from EF 2.2 to 3.0
Ran the scenario both ways with SQL Profiler, copy pasted results into SSMS and saved the result sets to compare payload size.

Salient Query

             var result = db.FanProducts.Where(x => x.FanProductId == key)
                    .Include(x => x.FanModel)
                        .ThenInclude(x => x.Amca208fanType)
                    .Include(x => x.FanSize)
                    .Include(x => x.FanClass)
                    .Include(x => x.FanArrangement).ThenInclude(x => x.FanDriveMethod)
                    .Include(x => x.FanDischarge)
                        .ThenInclude(x => x.FanDischargeDirection)
                    .Include(x => x.FanProductAirPerformances)
                    .Include(x => x.InducedFlowFanProperty)
                    .Include(x => x.FanProductPerformanceCorrections)
                        .ThenInclude(y => y.FanPerformanceCorrection)
                        .ThenInclude(x => x.FanPerformanceCorrectionConstants)
                    .Include(x => x.FanProductPerformanceCorrections)
                        .ThenInclude(x => x.FanProductPerformanceCorrectionVariables)
                    .Include(x => x.FanProductImpellers)
                        .ThenInclude(x => x.Impeller)
                            .ThenInclude(x => x.ImpellerType)
                    .Include(x => x.FanProductImpellers)
                        .ThenInclude(x => x.Impeller)
                            .ThenInclude(x => x.Material)
                    .Include(x => x.FanProductImpellers)
                        .ThenInclude(x => x.Impeller)
                            .ThenInclude(x => x.ImpellerRpmderateSet)
                                .ThenInclude(x => x.ImpellerRpmderateFactors)
                    .Include(x => x.FanProductCertifications)
                        .ThenInclude(x => x.Certification)
                            .ThenInclude(x => x.CertifyingBody)
                    .Include(x => x.FanProductSpeedControlRpms)
                    .Include(x => x.FanProductMotorNameplates)
                    .FirstOrDefault();

EF v	Rountrips	Total Payload Size
2.2	10	~11.0KB
3.0	1	~1.22GB

Mileage is going to vary incredibly and exponentially based upon the number of Includes. But look at that payload difference... holy null values resulting in a hot mess batman. (the nulls causing a giant amount of not null data tagging along just so we can say 'yup... that recordset is empty')

I get it. Some people are OCD about roundtrips, and for small apps, it's probably fine... possibly better.

Certainly for our use case, which I would characterize as somewhat unique (read-only, highly mathematical with thousands of discrete and variable graphs), this recent change is a deal breaker. Even with clever stitching. In fact, if we have to tap 'off the beaten path' EF/Linq functionality to defeat the "Single Query" feature, are we truly better off? I'd have to say at that point, the likes of Dapper and hand-rolled serialization and fixup would the more responsible and efficient course of action.

Our API on EF2.2, from a total cold start and unprimed caches can turn the request out in ~2.5 seconds for 38 seed graphs. With EF 3.0, we can't complete 1 before the SqlConnection timeout kicks in and bombs the lot of it (what's that? 30s? 60s?)

Please give serious consideration to at least adding hooks to disable "Sinqle Query" dispatch and using the old approach. (or something that you can inject between Includes to allow manual tuning of the roundtrips. Think .Include().Go().Include().ThenInclude().Go().Include().ToList() )

-joel

PS> I'm happy to give anyone a personal demo if the zip is in doubt.

2.2v3.0_ScriptsAndResults.zip

dragnilar commented 4 years ago

@smitpatel I am concerned by this suggested workaround in the scenario of cloning a large entity. Can you elaborate on what you mean by performing the fix up manually? I presume you mean taking the entities retrieved from the query and manually setting the relationships via the navigation properties?

Also, for what it is worth: One of us had some free time this week and did try testing a multi-part tracking query and then detaching the entities by manipulating the change tracker, but the one who tested it told me it was giving them strange exceptions on save.

I'd have to get more details from the one who tested it, but they think they were running into the problem on https://github.com/aspnet/EntityFrameworkCore/issues/17997 (which btw, as I noted in that issue, affected some other spot in our codebase. Unfortunately, running into that problem made us decide that we cannot upgrade to any 3.X versions until it is verified as fixed).

(Edit: I spoke to them briefly after posting this, they said they were getting the exceptions saying the key already existed even though the ids had been changed before running SaveChanges).

smitpatel commented 4 years ago

I presume you mean taking the entities retrieved from the query and manually setting the relationships via the navigation properties?

The manual stitching is not about letting ChangeTracker do the fixup and then detach the entities. It is about doing that task in the code without involving EF or ChangeTracker.

dragnilar commented 4 years ago

I presume you mean taking the entities retrieved from the query and manually setting the relationships via the navigation properties?

The manual stitching is not about letting ChangeTracker do the fixup and then detach the entities. It is about doing that task in the code without involving EF or ChangeTracker.

@smitpatel Yes, that is what I meant. Setting the relationships for the objects manually in code, doing what the ChangeTracker would have done for us. Thanks for clarifying that.

At this point I think we will probably end up doing what we were trying to avoid doing in the first place - writing our own Clone() methods. The more I think about this, I am not sure if this (out of the box support for cloning huge entities) was really something your team intends to support right now, unless I am mistaken?

BartoszWiszniewski commented 4 years ago

Problem is orderby on every field instead like in 2.2.6 by id only changing orderby in ganareted sql fixes issue, but still i have felling that queries are slower in 3.0 than in 2.2.6 even when orederby changed

dragnilar commented 4 years ago

@BartoszWiszniewski You bring up an interesting point. DevExpress recently did a benchmark that shows that EF Core 3.0 is marginally slower in a lot of cases than 2.2.x:

https://github.com/DevExpress/XPO/tree/master/Benchmarks

It looks like in cases of doing deletes and inserts, the speed improved.

I'm hoping some other benchmarks become available soon to provide comparison/contrast.

Also... I did notice performance improvements when we switched to Net Core 3, which I first thought was due to EF, but even with EF 2.6 in use with the rest of the software using Core 3, the gains appear to have remained.

NetTecture commented 4 years ago

This is actually an important point. Data set size aside (1.22gb is what - 0.15 seconds on a modern backend network), the order by set is ridiculous and seems to speak from bad optimization. And that really puts all that data down into tempdb for a very heavy order by. THAT definitely must be fixed. It is broken SQL being WAY over protective on the required sorts, it seems.

jostapotcf commented 4 years ago

@NetTecture If it were in a format that was ready to be spooled over the network, sure it could be streamed that fast. But the SQL Server's got to do a lot of work serializing that, and then the client has to de-serialize it, all before EF ever even sees it.

The SQLTimeout would indicate that the TDS stream was simply taking too long (ie, EF never even saw the results)

(This was all running on my local machine with ample specs. the situation would be much worse hosting up on azure when DTU throttles come in to play)

NetTecture commented 4 years ago

Nope, not the way I read it ;)

I think the real culprit - assuming a modern high end network, not the 1gb end user stuff many companies use - is NOT the size but....

there is mention of a LOT of orderby hat is not really needed. OrderBy means that:

streaming can only start once order has been established and
1.22gb sorting in tempdb is something MOST servers will struggle with.

It would require some investigating (not that hard, hook into sql server and debug the sql statement) but I would bet we see some serious deloays before straming starts including overloading, possibly, the tempdb space.

jostapotcf commented 4 years ago

I'll grant ORDER BY is likely the biggest culprit in the SQLConnection timeout. Let's say the ORDER BY is instantaneous.

That's has to then get serialized over the wire to the client into memory, unpacked, POCO'ified, and finally fixed up.

I should have recorded the row count difference. I'll wager it is in the order of 10000:1 of useless data to useful.

Even if the SQLCmd and TDS Streaming happened instantly, we'd still end up wrecked by EF's time spend wrapping the results in objects (by virtue of having to contemplate a lot of useless rows)

dragnilar commented 4 years ago

@jostapotcf When we ran into this problem in our environment, we were using a test database running on a separate server that was not on the same machine. As you probably saw in my reply up above, it took too long for it to run in even SSMS (I think it takes like 2-3 minutes for the whole result to finish). The situation was pretty similar to yours, judging from the sample query you provided.

And like you said, I am concerned about the entirety of this change when it comes to dealing with large object graphs. I was under the impression that switching from single query to multi-query with EF Core initially was to get away from the Cartesian explosion problem... and now we're back to where we were before with classic EF. I find this odd.

smitpatel commented 4 years ago

I have create a sample app which shows a way to rewrite a multiple level collection include in LINQ to do split SQL queries and stitch that up. You can access the app as gist here https://gist.github.com/smitpatel/d4cb3619e5b33e8d9ea24d3f2a88333a

I have added code comments for better understanding. Feel free to ask if any questions. The sample covers tracking/non-tracking queries with buffering/streaming both cases. The sample is written for sync query but same can be done with async queries easily. You can also compose the base query to apply filters/orderings/paging.

dragnilar commented 4 years ago

@smitpatel Thanks for posting that gist, we may try something out like that once we feel like we are in a good spot to try upgrading to 3.x (probably 3.1 at this point).

Edit: @smitpatel One question that came to my mind (it's kind of outside the scope of this, sorry): back in my EF6 days, I remember another strategy that I saw being used for cloning large object graphs was using lazy loading and writing manual cloning procedures to allow one to maintain control over what parts of the graph that are cloned and which parts do not. Do you foresee any risks with doing that in EF Core 3.x?

smitpatel commented 4 years ago

@dragnilar - I don't know answer to that. Can you file a new issue for visibility to others?

smitpatel commented 4 years ago

Closing this issue as a simple way to rewrite the query has been posted. To document it in our docs is being tracked at https://github.com/aspnet/EntityFramework.Docs/issues/1854 To improve overall perf of query is being tracked at https://github.com/aspnet/EntityFrameworkCore/issues/15892 Providing a method which would do fix-up needs to create new context behind the scene anyway as the issue with shadow FK to connect objects. Which users can do themselves right now too.

dragnilar commented 4 years ago

@smitpatel Sure, I can do that. I appreciate the honesty.

franzo commented 4 years ago

Providing a method which would do fix-up needs to create new context behind the scene anyway as the issue with shadow FK to connect objects. Which users can do themselves right now too.

Thanks for the example @smipatel. Is there any intention to provide a general method to do fix-ups for AsNoTracking queries? Maybe something along the lines suggested by @jostapotcf above?

smitpatel commented 4 years ago

@franzo - No there is no intention as it defeats the purpose of the design. EF Core doing split queries had bugs and incorrect results in some cases. The sample app I have linked has a way to do reconnection for no-tracking queries too. If you want to have a hook to do split queries then just initialize new DbContext and do tracking query inside of it and throw away the context afterwards.

sitepodmatt commented 4 years ago

EF Core doing split queries had bugs and incorrect results in some cases.

Can you outline some examples out of interest? I ponder this briefly and I think most of them can be mitigated with the correct transaction modes.

smitpatel commented 4 years ago

Can you outline some examples out of interest? I ponder this briefly and I think most of them can be mitigated with the correct transaction modes.

12098 captures the discussion over the design which should satisfy anyone's interest.

roji commented 4 years ago

Specifically for the transactionality, see https://github.com/aspnet/EntityFrameworkCore/issues/9014 as well.

dotnet / efcore

Significant Query Slowdown When Using Multiple Joins Due To Changes In 3.0 #18022

This older issue has been re-purposed to be discussion about different approaches, etc. It will not be locked because it is important to us that we don’t shut down channels of communication with the community.

Steps to reproduce

Results:

Further technical details

12098 captures the discussion over the design which should satisfy anyone's interest.