Translation of impure functions can result in wrong queries

ranma42 commented 6 months ago

The translation pipeline assumes that SQL expressions are pure, which is not true in general (for example EF.Functions.Random()). Under this assumption, it sometimes duplicates sub-expressions which can lead to inconsistent results, exceptions and (a minor issue, but in some cases still relevant) degraded performance.

See #32519 for a related issue that is specific to the translation performed by the SQL Server provider.

An example program that showcases the bug is:

using System;
using System.Data;
using System.Linq;
using Microsoft.EntityFrameworkCore;

using var db = new BloggingContext();

db.Database.EnsureDeleted();
db.Database.EnsureCreated();

var urls = db.Blogs
    .Select(x => (EF.Functions.Random() > 0.5 ? null : x.Url).Contains("5"))
    .ToList();

public class BloggingContext : DbContext
{
    public DbSet<Blog> Blogs { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder options)
        => options
            .LogTo(Console.WriteLine, Microsoft.Extensions.Logging.LogLevel.Information)
            .UseSqlite($"Data Source=test.db");

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        for (var i = 1; i < 100; i++) {
            modelBuilder.Entity<Blog>().HasData(new Blog { BlogId = i, Url = $"url/{i}" });
        }
    }
}

public class Blog
{
    public int BlogId { get; set; }
    public required string Url { get; set; }
}

Exception

The program terminates with the following exception:

fail: 05/23/2024 04:25:52.836 CoreEventId.QueryIterationFailed[10100] (Microsoft.EntityFrameworkCore.Query) 
      An exception occurred while iterating over the results of a query for context type 'BloggingContext'.
      System.InvalidOperationException: Nullable object must have a value.
         at lambda_method19(Closure, QueryContext, DbDataReader, ResultContext, SingleQueryResultCoordinator)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.Enumerator.MoveNext()
Unhandled exception. System.InvalidOperationException: Nullable object must have a value.
   at lambda_method19(Closure, QueryContext, DbDataReader, ResultContext, SingleQueryResultCoordinator)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.Enumerator.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Program.<Main>$(String[] args)

because it is running the query

      SELECT CASE
          WHEN abs(random() / 9.2233720368547799E+18) > 0.5 THEN NULL
          ELSE "b"."Url"
      END IS NOT NULL AND instr(CASE
          WHEN abs(random() / 9.2233720368547799E+18) > 0.5 THEN NULL
          ELSE "b"."Url"
      END, '5') > 0
      FROM "Blogs" AS "b"

The two random() calls are evaluated independently, hence it is possible for this query to return NULL values (which IIUC the shaper/materializer rightly does not expect).

Include provider and version information

EF Core version: 8.0.5 Database provider: Microsoft.EntityFrameworkCore.Sqlite Target framework: .NET 8.0 Operating system: Linux (/WSL) IDE: Visual Studio Code 1.89.1

roji commented 6 months ago

See a somewhat similar problem with SQL Server's COALESCE, in #32519.

Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.

In any case, EF doesn't currently know which functions are pure and which aren't; this is something we could add to SqlFunctionExpression, and then possibly fail translation if it requires duplication. Of course, if we can avoid duplicating altogether that's ideal, but there are definitely cases (such as the null semantics duplication above) where it's difficult to imagine things otherwise (we may be able to do something with CTEs once we support them).

Any other thoughts on this @ranma42?

ranma42 commented 6 months ago

See a somewhat similar problem with SQL Server's COALESCE, in #32519.

Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.

In any case, EF doesn't currently know which functions are pure and which aren't; this is something we could add to SqlFunctionExpression, and then possibly fail translation if it requires duplication. Of course, if we can avoid duplicating altogether that's ideal, but there are definitely cases (such as the null semantics duplication above)

AFACT the translation for the example query should be:

      SELECT instr(CASE
          WHEN abs(random() / 9.2233720368547799E+18) > 0.5 THEN NULL
          ELSE "b"."Url"
      END, '5') > 0
      FROM "Blogs" AS "b"

I should check, but I believe that #33814 (function nullability) + #33757 (comparison nullability) already takes case of this specific case.

where it's difficult to imagine things otherwise (we may be able to do something with CTEs once we support them).

Any other thoughts on this @ranma42?

I will try to investigate the instances where the duplication can be avoided; I believe it is actually needed in very few cases.

roji commented 6 months ago

I will try to investigate the instances where the duplication can be avoided; I believe it is actually needed in very few cases.

If we can remove duplication, that would absolutely be very welcome - though I very much suspect that some cases will still require it, barring more fancy translations involving CTE etc.

In any case, your work and valuable thoughts on this are very much appreciated!

ranma42 commented 6 months ago

Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.

I just found out that apparently the issue is not just with duplicating, but also with checking whether they are identical:

var urls = db.Blogs
    .Select(x => EF.Functions.Random() == EF.Functions.Random())
    .ToList();

is translated to

      SELECT 1
      FROM "Blogs" AS "b"

while

var urls = db.Blogs
    .Select(x => EF.Functions.Random() == 0 + EF.Functions.Random())
    .ToList();

is translated to

      SELECT abs(random() / 9.2233720368547799E+18) = 0.0 + abs(random() / 9.2233720368547799E+18)
      FROM "Blogs" AS "b"

If the assumption is that each invocation of EF.Functions.Random() is independent, the first translation is invalid. If the assumption is that each invocation of a function in the query returns the same result, the second translation is invalid (SQLite evaluates each random() independently... which is what triggers the issue in the example program).

roji commented 5 months ago

@ranma42 yes, this is true. The general issue is that EF has no knowledge of which functions (or more generally, nodes) are pure/stable; and this sort of optimization can be very valuable in the general case, while functions like Random() are quite rare. So although it's definitely possible to produce contrived cases where this results

Just to have fun, I'd argue that EF's optimization here isn't technically incorrect - two invocations of Random() may happen to yield the same result, it's just quite improbable for that to happen ;)

And one final related thought... PostgreSQL has three categories of "get the current timestamp" functions: those that return the timestamp at transaction start, at statement start and at function invocation; the logic (as explained in the docs) is that it's very useful to have a concept of the "transaction time", as if the transaction occurred in a single instant (e.g. so that multiple modifications within the same transaction bear the same time stamp.). For those functions (as well as the statement start ones), optimizations such equality elimination are valid, whereas for the invocation-time ones they aren't.

But again, specifically for this problem, although we probably have incorrect behavior here, it seems quite edge-casey/contrived.

ranma42 commented 5 months ago

@ranma42 yes, this is true. The general issue is that EF has no knowledge of which functions (or more generally, nodes) are pure/stable; and this sort of optimization can be very valuable in the general case, while functions like Random() are quite rare. So although it's definitely possible to produce contrived cases where this results

Just to have fun, I'd argue that EF's optimization here isn't technically incorrect - two invocations of Random() may happen to yield the same result, it's just quite improbable for that to happen ;)

That's true, just like multiple invocations of gen_random_uuid might return the very same UUID. Note that the same could be said about replacing the invocation of random() with a constant number: random() could very well return that number (even repeatedly 😅 ). OTOH When using these functions, the correct assumption is not that they will basically sample a certain distribution; this could be verified by aggregating several samples.

And one final related thought... PostgreSQL has three categories of "get the current timestamp" functions: those that return the timestamp at transaction start, at statement start and at function invocation; the logic (as explained in the docs) is that it's very useful to have a concept of the "transaction time", as if the transaction occurred in a single instant (e.g. so that multiple modifications within the same transaction bear the same time stamp.). For those functions (as well as the statement start ones), optimizations such equality elimination are valid, whereas for the invocation-time ones they aren't.

AFAICT PostgreSQL provides quite a few ways to play around with impure functions:

gen_random_uuid/random/random_normal etc
setseed (very hard to use effectively without a very fine control of the queries being issues)
pg_sleep & co (optimizing them away or duplicating them would be quite bad; OTOH it is very unlikely for an EFCore user to create an expression containing them)
more?

(transaction_timestamp and statement_timestamp are just some named constants in the context of a single query, but they obviously cannot be optimized away as constants if they are being cached... for example EFCore cannot know what the result of a comparison of them against another constant will be; conversely each of them will definitely be equal to itself)

But again, specifically for this problem, although we probably have incorrect behavior here, it seems quite edge-casey/contrived.

This is definitely a corner case in general and even more so in the context of EFCore. From all of the examples I listed, I believe that only the random generation functions are somewhat relevant, but not really something to focus an effort on, even if the current translation might cause incorrect results when dealing with them.

ranma42 commented 5 months ago

This is a special case of

19897

dotnet / efcore

Translation of impure functions can result in wrong queries #33791

Exception

Include provider and version information

19897