dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.66k stars 3.15k forks source link

AsSplitQuery sporadically missing navigation properties #25556

Closed mareklinka closed 1 year ago

mareklinka commented 3 years ago

This is not going to be much of a bug report because I'm currently unable to reliably reproduce the issue described but I wanted to maybe get a conversation started that might point me in the right direction.

We use the EF Core 5.0.8 against MS SQL Server 2016 setup in a synchronized cluster.

The situation is as follows: we have a table which for each "entity" contains two rows - a baseline and a modified version. Since the two use the same schema, they are in the same table. There is no self-referencing key binding the original to the modified item.

We had a performance issue with a specific query (the table has a lot of columns + several child collections) so we decided to enable query splitting. Since then, we sporadically (< 5% of executions) get into a situation where one of the child collections fails to load, leading to data loss down the line.

As I said, I'm currently unable to replicate this anywhere except our production environment. The query in question looks like this:

var original = _context.VehicleInspections.Filter(...)
                .Include(i => i.Motors) // collection
                .Include(i => i.Vehicle) // single
                .Include(i => i.Obligations) // collection
                .Include(i => i.ServiceItems) // collection
                .Where(i => i.RecordType == InspectionRecordType.Original);

var modified = _context.VehicleInspections.Filter(...)
                .Where(i => i.RecordType == InspectionRecordType.Modified)
                .Include(i => i.Motors)
                .Include(i => i.Vehicle)
                .Include(i => i.Obligations)
                .Include(i => i.ServiceItems)
                .Where(i => i.State == InspectionState.Closed)
                .Where(i => !i.Acknowledged);

var leftJoin = from mod in modified
                join orig in original on new { mod.CaseId, mod.SequenceNumber } equals new { orig.CaseId, orig.SequenceNumber }
                    into joined
                from o in joined.DefaultIfEmpty()
                select new InspectionPair { Original = o, Modified = mod };
var inspections = await leftJoin.ToArrayAsync(cancellationToken);

This produces a left self-join over the table in question. The query split is performed correctly and the subqueries look reasonable to my eye. However, still the Obligations collection on the Modified entity sometimes remains empty even though we know there are child entities present. Without query splitting, this works as expected via a single large query.

Since I cannot easily produce a repro code, I guess I just wanted to task - is there anything suspicious in this code that could cause such behavior? I noticed there are some navigation expansion bugs being tracked for 5.x but this issue seems to be pop up non-deterministically - most executions still work as expected (or at least we haven't been able to isolate the cause yet).

mareklinka commented 3 years ago

I attempted to isolate the problem from our main code base and I'm now reliably able to repro the situation. It's still a non-deterministic problem (it doesn't occur every time I ran the problematic query) but I encounter several occurrences within minutes of running the repro code.

Repro code (console app, .NET Core 3.1, SDK 3.1.412, EF Core 5.0.8):

var loggerFactory = LoggerFactory.Create(c => c.AddSerilog().AddConsole().AddFile("debug.log"));
var log = loggerFactory.CreateLogger("Program");

var cs = "PRODUCTION CS";

var dbOptions = new DbContextOptionsBuilder<DbContext>().EnableDetailedErrors().EnableSensitiveDataLogging().LogTo(m => log.LogInformation(m)).UseSqlServer(cs);

while (true)
{
    log.LogInformation("Loop start");

    using var db = new DatabaseContext(dbOptions.Options);

    var original = db.VehicleInspections.FilterCanton(CONSTANT)
        .Include(i => i.Motors)
        .Include(i => i.Vehicle)
        .Include(i => i.Obligations)
        .Include(i => i.ServiceItems)
        .Where(i => i.RecordType == InspectionRecordType.Original);

    var modified = db.VehicleInspections.FilterCanton(CONSTANT)
        .Where(i => i.RecordType == InspectionRecordType.Modified)
        .Include(i => i.Motors)
        .Include(i => i.Vehicle)
        .Include(i => i.Obligations)
        .Include(i => i.ServiceItems)
        .Where(i => i.State == InspectionState.Closed)
        .Where(i => !i.Acknowledged);

    var leftJoin =
        from mod in modified
        join orig in original on new { mod.CaseId, mod.SequenceNumber } equals new { orig.CaseId, orig.SequenceNumber }
            into joined
        from o in joined.DefaultIfEmpty()
        select new InspectionPair { Original = o, Modified = mod };

    var inspections = await leftJoin.AsSplitQuery().ToArrayAsync();

    foreach (var pair in inspections)
    {
        if (pair.Modified.Obligations.Count == 0)
        {
            var actualCount = db.Obligations.Count(_ => _.VehicleInspectionId == pair.Modified.Id);

            if (actualCount > 0)
            {
                log.LogCritical($"Bad query detected for {pair.Modified.CaseId}, {pair.Modified.SequenceNumber} ({pair.Original.Obligations.Count} vs. {pair.Modified.Obligations.Count})");
            }
        }
    }

    await Task.Delay(15000);
}

This setup produces a ton of logs, including the SQL statements and which entities are being tracked by the context, e.g.

Context 'DatabaseContext' started tracking 'Obligation' entity with key '{Id: 2318357}'. (52b19afe)

And that's where I THINK (from what I can gather) lies the problem. It seems that under certain conditions, the entity tracker fails to either materialize or start tracking certain entities. In those cases, there are Obligations in the database that do not show in the Context 'DatabaseContext' started tracking 'Obligation' entity with key '...'.

I can see two situations where this could happen. Either the SQL server doesn't SEND the obligations or EF loses them somehow. My assumption is that the first is highly unlikely, therefore this report.

To confirm, I ran the code without AsSplitQuery for the whole day yesterday and received 0 issues. The behavior is almost certainly related to split queries and how EF puts the results back together.

ajcvickers commented 3 years ago

@mareklinka Please include the code for your entity types and DbContext so that we can run this and reproduce the issue.

mareklinka commented 3 years ago

I reduced the code to the bare minimum that still reproes the issue in our production. Please see the attached archive containing the minimal console app. There is also a file called reproDatabaseScript.sql that will create a testing database, including some sample data.

However: With the attached code, I was able to reproduce the problem in our production environment. I was not able to reproduce it anywhere else so far. The testing database might not be capable of reproducing the problem for various reasons. Our production DB contains orders of magnitude more data and is part of a 2-server always-on availability cluster. Neither of these factors are currently available to me so I cannot try to isolate either of them. Production is running SQL Server 2016 (13.0.5888.11)

ReproCode.zip

I will continue digging into this to see if I can isolate the problem further.

mareklinka commented 3 years ago

Tomorrow I will attempt to determine whether the SQL Server actually delivers the "missing" data to EF. I wrote a DbCommandInterceptor that reads the content of the data reader and logs the IDs of all the obligations within. This should, hopefully, confirm whether the problem lies with EF or the SQL Server.

mareklinka commented 3 years ago

Tried the interceptor but then I realized that my testing methodology was flawed, mainly due to how our system operates. I still don't know exactly where the issue is coming from but now I'm reasonably sure it's our application or environment, not EF. And since we have a fix/workaround available (don't use AsSplitQuery for this query) I will not be pursuing this further.

Sorry for the noise and using GH as a rubber duck. Closing.

ajcvickers commented 3 years ago

@mareklinka No problem; thanks for diligently pursuing this on your side.

AppleRyan commented 2 years ago

Met with the same problem in our prod environments when using splitting query.

This is not going to be much of a bug report because I'm currently unable to reliably reproduce the issue described but I wanted to maybe get a conversation started that might point me in the right direction.

We use the EF Core 5.0.8 against MS SQL Server 2016 setup in a synchronized cluster.

The situation is as follows: we have a table which for each "entity" contains two rows - a baseline and a modified version. Since the two use the same schema, they are in the same table. There is no self-referencing key binding the original to the modified item.

We had a performance issue with a specific query (the table has a lot of columns + several child collections) so we decided to enable query splitting. Since then, we sporadically (< 5% of executions) get into a situation where one of the child collections fails to load, leading to data loss down the line.

As I said, I'm currently unable to replicate this anywhere except our production environment. The query in question looks like this:

var original = _context.VehicleInspections.Filter(...)
                .Include(i => i.Motors) // collection
                .Include(i => i.Vehicle) // single
                .Include(i => i.Obligations) // collection
                .Include(i => i.ServiceItems) // collection
                .Where(i => i.RecordType == InspectionRecordType.Original);

var modified = _context.VehicleInspections.Filter(...)
                .Where(i => i.RecordType == InspectionRecordType.Modified)
                .Include(i => i.Motors)
                .Include(i => i.Vehicle)
                .Include(i => i.Obligations)
                .Include(i => i.ServiceItems)
                .Where(i => i.State == InspectionState.Closed)
                .Where(i => !i.Acknowledged);

var leftJoin = from mod in modified
                join orig in original on new { mod.CaseId, mod.SequenceNumber } equals new { orig.CaseId, orig.SequenceNumber }
                    into joined
                from o in joined.DefaultIfEmpty()
                select new InspectionPair { Original = o, Modified = mod };
var inspections = await leftJoin.ToArrayAsync(cancellationToken);

This produces a left self-join over the table in question. The query split is performed correctly and the subqueries look reasonable to my eye. However, still the Obligations collection on the Modified entity sometimes remains empty even though we know there are child entities present. Without query splitting, this works as expected via a single large query.

Since I cannot easily produce a repro code, I guess I just wanted to task - is there anything suspicious in this code that could cause such behavior? I noticed there are some navigation expansion bugs being tracked for 5.x but this issue seems to be pop up non-deterministically - most executions still work as expected (or at least we haven't been able to isolate the cause yet).

Sergey-Nosov-CloudFit commented 2 months ago

Also faced a similar issue. We had a very wide and ugly query, with 113 includes. I am not kidding, 113. The query was bad, but that is beside the point; it worked 99.9% of the time.

On very rare occasions, it would lose data silently; as if you forgot to include an include.

Our developers could not reproduce the issue locally. We could only reproduce it by running in an Azure function, under production-size load for a few hours. This could be an indication that the amount of available memory is a factor.

The query used AsSplitQuery, originally without AsNoTracking. Adding AsNoTracking did not help.

Using Entity Framework Core 8.0.6 and Azure SQL Database. I am sorry, I do not have a reproducible example for you.

adebelyi commented 3 weeks ago

@Sergey-Nosov-CloudFit https://github.com/dotnet/efcore/issues/33826

roji commented 3 weeks ago

On very rare occasions, it would lose data silently; as if you forgot to include an include.

Note that unless explicitly opt into a higher isolation level, split query is inherently prone to race conditions which may cause data corruption; this is called out in the documentation.

adebelyi commented 3 weeks ago

On very rare occasions, it would lose data silently; as if you forgot to include an include.

Note that unless explicitly opt into a higher isolation level, split query is inherently prone to race conditions which may cause data corruption; this is called out in the documentation.

Author of issue i refered to said "I found this scenario a little too unexpected". And in his case I agree with him.

After migration from EF Core 2.0 to EF Core 8.0 I faced similar issue. The first one EF Core used split queries by default without this issue.

Maybe you have to look again at the issue 33826 to see case