Open ranma42 opened 6 months ago
See a somewhat similar problem with SQL Server's COALESCE, in #32519.
Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.
In any case, EF doesn't currently know which functions are pure and which aren't; this is something we could add to SqlFunctionExpression, and then possibly fail translation if it requires duplication. Of course, if we can avoid duplicating altogether that's ideal, but there are definitely cases (such as the null semantics duplication above) where it's difficult to imagine things otherwise (we may be able to do something with CTEs once we support them).
Any other thoughts on this @ranma42?
See a somewhat similar problem with SQL Server's COALESCE, in #32519.
Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.
In any case, EF doesn't currently know which functions are pure and which aren't; this is something we could add to SqlFunctionExpression, and then possibly fail translation if it requires duplication. Of course, if we can avoid duplicating altogether that's ideal, but there are definitely cases (such as the null semantics duplication above)
AFACT the translation for the example query should be:
SELECT instr(CASE
WHEN abs(random() / 9.2233720368547799E+18) > 0.5 THEN NULL
ELSE "b"."Url"
END, '5') > 0
FROM "Blogs" AS "b"
I should check, but I believe that #33814 (function nullability) + #33757 (comparison nullability) already takes case of this specific case.
where it's difficult to imagine things otherwise (we may be able to do something with CTEs once we support them).
Any other thoughts on this @ranma42?
I will try to investigate the instances where the duplication can be avoided; I believe it is actually needed in very few cases.
I will try to investigate the instances where the duplication can be avoided; I believe it is actually needed in very few cases.
If we can remove duplication, that would absolutely be very welcome - though I very much suspect that some cases will still require it, barring more fancy translations involving CTE etc.
In any case, your work and valuable thoughts on this are very much appreciated!
Yeah, we traditionally haven't given enough thought when duplicating SQL expressions in translation; in addition to the problem of impure functions, expression duplication can create serious performance issues as e.g. a heavy, complex subquery may end up getting evaluated more than once.
I just found out that apparently the issue is not just with duplicating, but also with checking whether they are identical:
var urls = db.Blogs
.Select(x => EF.Functions.Random() == EF.Functions.Random())
.ToList();
is translated to
SELECT 1
FROM "Blogs" AS "b"
while
var urls = db.Blogs
.Select(x => EF.Functions.Random() == 0 + EF.Functions.Random())
.ToList();
is translated to
SELECT abs(random() / 9.2233720368547799E+18) = 0.0 + abs(random() / 9.2233720368547799E+18)
FROM "Blogs" AS "b"
If the assumption is that each invocation of EF.Functions.Random()
is independent, the first translation is invalid.
If the assumption is that each invocation of a function in the query returns the same result, the second translation is invalid (SQLite evaluates each random()
independently... which is what triggers the issue in the example program).
@ranma42 yes, this is true. The general issue is that EF has no knowledge of which functions (or more generally, nodes) are pure/stable; and this sort of optimization can be very valuable in the general case, while functions like Random() are quite rare. So although it's definitely possible to produce contrived cases where this results
Just to have fun, I'd argue that EF's optimization here isn't technically incorrect - two invocations of Random() may happen to yield the same result, it's just quite improbable for that to happen ;)
And one final related thought... PostgreSQL has three categories of "get the current timestamp" functions: those that return the timestamp at transaction start, at statement start and at function invocation; the logic (as explained in the docs) is that it's very useful to have a concept of the "transaction time", as if the transaction occurred in a single instant (e.g. so that multiple modifications within the same transaction bear the same time stamp.). For those functions (as well as the statement start ones), optimizations such equality elimination are valid, whereas for the invocation-time ones they aren't.
But again, specifically for this problem, although we probably have incorrect behavior here, it seems quite edge-casey/contrived.
@ranma42 yes, this is true. The general issue is that EF has no knowledge of which functions (or more generally, nodes) are pure/stable; and this sort of optimization can be very valuable in the general case, while functions like Random() are quite rare. So although it's definitely possible to produce contrived cases where this results
Just to have fun, I'd argue that EF's optimization here isn't technically incorrect - two invocations of Random() may happen to yield the same result, it's just quite improbable for that to happen ;)
That's true, just like multiple invocations of gen_random_uuid
might return the very same UUID.
Note that the same could be said about replacing the invocation of random()
with a constant number: random()
could very well return that number (even repeatedly 😅 ).
OTOH When using these functions, the correct assumption is not that they will basically sample a certain distribution; this could be verified by aggregating several samples.
And one final related thought... PostgreSQL has three categories of "get the current timestamp" functions: those that return the timestamp at transaction start, at statement start and at function invocation; the logic (as explained in the docs) is that it's very useful to have a concept of the "transaction time", as if the transaction occurred in a single instant (e.g. so that multiple modifications within the same transaction bear the same time stamp.). For those functions (as well as the statement start ones), optimizations such equality elimination are valid, whereas for the invocation-time ones they aren't.
AFAICT PostgreSQL provides quite a few ways to play around with impure functions:
gen_random_uuid
/random
/random_normal
etcsetseed
(very hard to use effectively without a very fine control of the queries being issues)pg_sleep
& co (optimizing them away or duplicating them would be quite bad; OTOH it is very unlikely for an EFCore user to create an expression containing them)(transaction_timestamp
and statement_timestamp
are just some named constants in the context of a single query, but they obviously cannot be optimized away as constants if they are being cached... for example EFCore cannot know what the result of a comparison of them against another constant will be; conversely each of them will definitely be equal to itself)
But again, specifically for this problem, although we probably have incorrect behavior here, it seems quite edge-casey/contrived.
This is definitely a corner case in general and even more so in the context of EFCore. From all of the examples I listed, I believe that only the random generation functions are somewhat relevant, but not really something to focus an effort on, even if the current translation might cause incorrect results when dealing with them.
This is a special case of
The translation pipeline assumes that SQL expressions are pure, which is not true in general (for example
EF.Functions.Random()
). Under this assumption, it sometimes duplicates sub-expressions which can lead to inconsistent results, exceptions and (a minor issue, but in some cases still relevant) degraded performance.See #32519 for a related issue that is specific to the translation performed by the SQL Server provider.
An example program that showcases the bug is:
Exception
The program terminates with the following exception:
because it is running the query
The two
random()
calls are evaluated independently, hence it is possible for this query to returnNULL
values (which IIUC the shaper/materializer rightly does not expect).Include provider and version information
EF Core version: 8.0.5 Database provider: Microsoft.EntityFrameworkCore.Sqlite Target framework: .NET 8.0 Operating system: Linux (/WSL) IDE: Visual Studio Code 1.89.1