dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.

https://docs.microsoft.com/ef/

MIT License

13.81k stars 3.2k forks source link

SQL translation should be more generic and operate on end results #12552

Closed mqudsi closed 2 years ago

mqudsi commented 6 years ago

Following up on the discussion in #12517, I believe there's a fundamental issue with the way SQL translation is currently handled. If there is an expression that cannot be directly mapped to an SQL equivalent, but that expression evaluates to a data type that can be directly plugged in to an SQL query, then that expression should be evaluated and its result used instead of the SQL translation giving up and the query being evaluated locally rather than on the database server.

To illustrate with a concrete example:

    public class Database : DbContext
    {
        ....
        public IQueryable<Listing> ActiveListings => Listings
            .Where(x => x.Status != ListingStatus.Sold)
            .OrderByDescending(x => x.RenewTime)
            .Where(x => x.RenewTime > DateTime.UtcNow.AddDays(-7));
    }

In the latest published EFCore w/ SQLite provider releases, this cannot be evaluated in the database as AddDays(xx) does not have a translation implemented.

However, simply doing the following makes SQL translation work and the query is fully evaluated on the DB server rather than the ASP.NET Core host:

    public class Database : DbContext
    {
        ....
        DateTime ExpiryCompare => DateTime.UtcNow.AddDays(-7);
        public IQueryable<Listing> ActiveListings => Listings
            .Where(x => x.Status != ListingStatus.Sold)
            .OrderByDescending(x => x.RenewTime)
            .Where(x => x.RenewTime > ExpiryCompare);
    }

More generally, if the translation of a given subexpression does not have a translation expressly provided by the provider, and that subexpression does not depend on the value of the record being evaluated (i.e. for a lambda x => x == Foo(..), Foo does not utilize the value of x) and the return type of Foo(...) can be used to directly evaluate the expression on the database server, then that subexpression should be evaluated and its result plugged in before translation is attempted once more.

tuespetre commented 6 years ago

I agree. I think a ‘second pass’ at untranslatable expressions that can extract client-side evaluatable expressions into parameters is the way to go. I have mentioned this elsewhere before but I am not sure what the team thinks of it.

jzabroski commented 6 years ago

This should not be a second pass, but rather a first pass, before calling EF.

This can be done outside EF, or on top of EF. Please see the excellent expression transformation library, erecruit.Expr.

This library is agnostic to EF, although there are some clever optimizations that handle problems with how EF caches queries with IEnumerable.Contains expressions.

The library takes a divide-and-conquer approach, by implementing a few basic abstractions:

Common sub-expression substitution via Expression.ReplaceParameter and Subst.Expr ** e.g., suppose you have a interface ICodeName { string Code { get; set; } string Name { get; set; } }, then you can Subst.Expr(objectInheritingICodeName, (ICodeName codeName) => new CodeName { Code = codeName.Code, Name = codeName.Name }); and voila, you have created a re-usable way to extract common fields from the database in your query projection
Boolean combinators, such as And and Or
List comprehensions such as Fold
Function composition via Compose
Syntactic sugar for building expression trees via Expr.Create: var trueStatement = Expr.Create(() => true) will generate a boolean expression that returns true
Expression expansion via Expand

If you're not familiar with this library, it is fundamentally superior to Joseph Albahari's PredicateBuilder. Note that erecruit.Expr does not leak to the caller, while PredicateBuilder does: With PredicateBuilder, you must call AsExpandable() first, meaning the API designer must know ahead of time you want to use PredicateBuilder. Instead, erecruit.Expr lets you just call Expand() at the very end of your query chain. Also, the only penalty for calling Expand multiple times is performance.

The only gotcha with using erecruit.Expr is it has a max recursion depth equal to the available call stack of the underlying .NET host platform.

There are various reasons why doing this request internally is a mistake:

Handling of sargability
Harder to see developer's intentions.
Crazier stack traces.

Please close.

smitpatel commented 6 years ago

Duplicate of #12338

smitpatel commented 6 years ago

I would recommend reading this comment and later discussion.

tuespetre commented 6 years ago

@jzabroski I do not mean a ‘second pass’ as in a pass over the entire expression tree. When EF hits a a selector/predicate/etc. it tries to rewrite certain expressions into SQL-translatable expressions. If it encounters, for instance, a member access or method call expression it cannot rewrite, it tends to return null and the translation of that subtree is aborted. What I am saying is that at that point, there is a ‘second chance’ at extracting that portion of the subtree into a parameterized expression rather than aborting translation. This is not specific to the DateTime/SQLite issue; it is a general solution to a general problem and the ‘base’ SQL query provider (EF Core’s relational provider in this case) is the most suitable place for such a mechanism.

jzabroski commented 6 years ago

@tuespetre There's a couple reasons you're incorrect or coming up with the wrong solution.

Your proposal will add overhead, as you will not be able to cache your query until you do the EF pass! erecruit.Expr caches in front of EF the hoisting mechanism you want. This means EF can just hash the expression tree, without computing anything.
Preserving the semantic meaning of queries is also a concern, as @smitpatel points out in his comment
Why not write a Roslyn code analyzer for EF to detect "If it encounters, for instance, a member access or method call expression it cannot rewrite, it tends to return null and the translation of that subtree is aborted."

A Roslyn code analyzer for EF could also potentially do sarg-ability analysis. I believe there are many ways people have been tripped up with the sarg-ability of LINQ queries in my career:

Linq 2 Sql handled CHAR(n) data type comparisons poorly from t in dbContext.Trades where t.Side == 'B' select t; would generate as: SELECT [t0].Quantity, [t0].Price, [t0].Side FROM [dbo].[Trade] AS [t0] WHERE UNICODE([t0].[Side]) = @p1
string.Equals behaving differently from == operator
Converting enums to strings and comparing the value to another string.

Then there is just "API Coverage" concerns:

Linq 2 Sql didn't support many primitive string methods: from t in dbContext.Persons where string.IsNullOrWhitespace(t.Name) select t; /* RUN-TIME ERROR! */

tuespetre commented 6 years ago

@jzabroski it does not prevent caching. The subtree is pulled out into a lambda expression that receives the correct closure instance as a parameter at runtime, when it is executed to produce the value for that execution of the query. I have implemented it, tested it, and used it in production. It’s good :+1:

mqudsi commented 6 years ago

At the risk of commenting on a decision that's been sealed beyond reconsideration, I feel like there is room for a healthy discussion here out of which some good can come.

Sargability is extremely important to me and I've filed defect reports with several open source database engines and products to offer some (too-clever) workarounds to get certain datetime operations to use the index, so I am not blithely making suggestions that would break sargability here.

As someone that has just begun using EFCore and did not use EF previously, my "fresh take" on it was that the default would be for all efforts to be focused on getting my C# LINQ expression translated to code that could most optimally run on the db server. The existence of the EF.Functions extension class implied to me that to take advantage of certain db-level optimizations (such as maintaining sargability of lookups) I should use these functions from the db provider implementation to get optimal results (instead of doing the equivalent DateTime.xxx or string.xxx in C#).

Anyone that's handwritten queries knows about optimizing the indexes and queries together to come up with fast, efficient queries and is familiar with the benefits of limiting the amount of rows carried back from the db server, etc. But what EFCore has done here is take those common concerns and give them the backseat to an even bigger problem we have to worry about now: whether or not the code will even be executed on the db in the first place. The benefits of C#-to-SQL translation are enormous and it's an extremely laudable effort, but I feel like it should be done in a way where precedence is given to in-db execution instead of to maximized lambda support at the cost of runtime warnings masking massive performance issues for cases that could be trivially supported as a matter of design decision.

All that aside and with the caveat that I don't know how the code that does this currently looks like, I don't understand why SQL translation can't consider the return type of a function call not dependent on the value of a record column in its decision to execute-then-translate or translate-then-execute flow.

For the case where the lambda invokes a C# function on the column/output of the database, it's a clearcut case of "the user wants to invoke local code to evaluate db values, without writing a for loop" which then proceeds to answer the question "is there an optimization we can perform to eliminate the need to bring this data back to the application server for local evaluation" in which case a db construct can be used instead of the local C# function call and the translation would be over. In case the answer is no, it's a shortcut to "unfortunately this needs to be evaluated locally" and again the translation would be over.

But in the case that the db column/value is not an input to a subexpression in the lambda, the same steps can be applied only there's a recourse instead of giving up. You start with the same question, "is there a db-level construct that can optimize the entire expression" only when the result is no, the question becomes "can we reduce the c# expression into a different one that we could perhaps optimize?" which can be iteratively evaluated until that answer becomes "no."

As concrete examples,

.Where(x => x.Day == foo.Day) has a hardcoded rule for DateTime.Day == DateTime.Day that can be applied and the translation generates a sargable expression to be evaluated on the db server
.Where(x => Foo(x)) and EFCore has no knowledge of Foo(), this must be evaluated locally, early abort of translation
.Where(x => x == Foo() where Foo() is something EFCore knows about, for instance, DateTime.UtcNow.ToLocalTime() and a query to do this in the db can be used
.Where(x => x == Foo() where Foo() is not something EFCore knows about but with Foo() evaluating to a DateTime EFCore should be able to reduce that to an expression it can translate
.Where(x => x.Day == Foo().Day exhibits the best of both worlds, where a previously unsupported LINQ expression can be evaluated a step at a time, until it arrives at <DateTime>.Day which is supported yielding the sargable <DateTime>.Day == <DateTime>.Day expression, which can ultimately be evaluated in the db for maximum performance and use the index to boot.

divega commented 6 years ago

Reopening because I agree there are some points in this thread that can use more discussion. I will catch up with @smitpatel next week to make sure I understand his point of view.

smitpatel commented 6 years ago

@mqudsi - Explain in detail the last case .Where(x => x.Day == Foo().Day) would be translated to in SQL and why it is sargable compared to what EF Core currently generates for it.

mqudsi commented 6 years ago

I think you're misunderstanding my proposal, @smitpatel. The same translation would be used, only further C# code would be executed prior to the execution of the translated db query.

Here is my test project:

For the following LINQ expression:

            var today = DateTime.UtcNow;
            var likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == today.Day)
                .FirstOrDefaultAsync();

this is what EFCore currently maps it out to with the SQLite provider:

SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__today_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1

With a simple function returning a DateTime object:

        static DateTime Now()
        {
            return DateTime.UtcNow;
        }

            likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == Now().Day)
                .FirstOrDefaultAsync();

the following SQL is translated:

SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1

which I was not expecting, as that behavior is actually what I am requesting. But perhaps the value was optimized by the compiler, so trying again:

        static DateTime Now2()
        {
            if (DateTime.UtcNow.Year == 2018)
            {
                return DateTime.UtcNow;
            }
            else
            {
                throw new Exception("It's the end of the world");
            }
        }

            likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == Now2().Day)
                .FirstOrDefaultAsync();

which again generates the optimized SQL I was looking for:

SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1

So maybe it's not the function call in and of itself that is the problem but rather the fact that the lambda is really simple?

//Not on the nightlies, so .AddDays() is not yet supported
            likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == DateTime.UtcNow.AddDays(1).AddDays(-1).Day)
                .FirstOrDefaultAsync();

Which finally gives me the situation I was looking for:

Microsoft.EntityFrameworkCore.Query:Warning: The LINQ expression where ([x].Timestamp.Day == DateTime.UtcNow.AddDays(1).AddDays(-1).Day) could not be translated and will be evaluated locally.

and the generated SQL returns all results for local filtering:

SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
ORDER BY "x"."Timestamp"

So you're actually already 99% of the way there. You're already evaluating functions and boiling them down to their return values before translating (if the initial expression could not be mapped). The only thing missing is support for call chains.

As it currently stands, x => x == Foo.Day is fully supported, as is x => Foo().Day where static DateTime Foo() => DateTime.UtcNow.AddDays(1).AddDays(-1);

My proposal calls for only a very modest change (and I don't mean that in a Swiftesque manner). Just as

var foo = db.Foos.Where(x => x.Day == Foo().Day)

is translated to

//pretend this is quoted correctly:
var sql = "SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1)";
using (var cmd = db.Database.Connection.CreateCommand(sql)) {
    cmd.Parameters["@__Now2_Day_0"] = Now2().Day;
    using (var reader = cmd.ExecuteReader()) {
    ....
    }
....
}

I would see

var foo = db.Foos.Where(x => x.Day == DateTime.UtcNow.Day);

translated to

//pretend this is quoted correctly:
var sql = "SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1)";
using (var cmd = db.Database.Connection.CreateCommand(sql)) {
    cmd.Parameters["@__DateTime_UtcNow_Day"] = DateTime.UtcNow.Day;
    using (var reader = cmd.ExecuteReader()) {
    ....
    }
....
}

mqudsi commented 6 years ago

I just realized I didn't address sargability; I was excited to find that substitution was already pretty powerful.

.Where(x => x.Year = Foo.Year) can be expressed as .Where(x => x >= @date1 AND x < @date2) with

var date1 = new DateTime(Foo.Year, 1, 1);
var date2 = new DateTime(Foo.Year + 1, 1, 1);

for any expression matching x => x.Year == <datetime or expression evaluating to datetime>.Year, regardless if the lambda is x => x.Year == Foo.Year, x => x.Year == Foo().Year or x => new Date(long.Parse(Console.ReadLine())).Year

ralmsdeveloper commented 6 years ago

@mqudsi already tried something like this?

static DateTime Now => DateTime.UtcNow;

likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == Now.Day)
                .FirstOrDefaultAsync();

mqudsi commented 6 years ago

Yes, creating a property works (see initial post).

tuespetre commented 6 years ago

Yes! 👏 @mqudsi is doing a much better job of visually demonstrating what I was talking about:

If a subtree cannot be translated into SQL, but it:

produces a value of translatable type and
does not bind any of its enclosing lambdas' parameters,

that subtree can be substituted with a parameter expression and lifted out into a block that is executed and applied to the corresponding DbParameter each time the query is run.

divega commented 6 years ago

@mqudsi, @tuespetre, I wanted to talk to @smitpatel because my initial reaction was that the behavior you described as the current way EF Core works was against some of our fundamental designs. But after talking to him I understand that things a bit more complicated :smile:

Anyway, I want to thank you both for raising this. It helped us realize we needed to think harder.

I will try to summarize the discussion @smitpatel and I had here so that you can understand what our thinking is and keep giving us feedback on it. We may later need to create new issues to track the specific improvements we would like to make to EF Core.

In general, when we find a subexpression that is not store-correlated, we should be able to evaluate it on the client and pass the value in a parameter (i.e. funcletize it).

From your comments, I believe we are all in agreement on this.

So, why are we not doing it for this query?

    likeToday = await db.Records
        .OrderBy(x => x.Timestamp)
        .Where(x => x.Timestamp.Day == DateTime.Now.AddDays(1).AddDays(-1).Day)
        .FirstOrDefaultAsync();

The answer is that the query uses DateTime.Now, which is in our list of non-deterministic expressions.

What we believe we got right

When a provider connects to a server, it is important to use the current time from the server in queries for correctness and consistency. We want to evaluate such expressions n the server because if the time zone or current time of the server and the client are different we could return very inconsistent results. This applies to DateTime.Now and similar variations with UTC and DateTimeOffset.

I am not sure if you will agree with this. From reading your comments it is not clear that this has been a concern for you, but it is for us.

What we believe can be improved

Currently, we achieve this goal by just having DataTime.Now (and the other variations) listed in EvaluatableExpressionFilter, which acts as a blacklist for funcletization.

Unfortunately, for the example query, as soon as we realize that we cannot translate AddDays(), we switch the evaluation of the whole predicate to the client side. If the goal was to make sure we always evaluate DateTime.Now on the server, that is obviously not a great solution.

We should instead:

Assume that DateTime.Now is a function that doesn't have an appropriate client implementation when using a provider that connects to a server (e.g. any relational provider).
Fix https://github.com/aspnet/EntityFrameworkCore/issues/11466, which would result in issuing SELECT GETDATE() as a separate subquery.

Moreover, how many times should we invoke this? Right now EvaluatableExpressionFilter conflates together non-determinism (or the need to evaluate an expression per row) with the need to evaluate something on the server. The blacklist contains two kinds of expressions:

Expressions that we want to evaluate on the server, like DateTime.Now and friends.
Expressions that need to be evaluated per row, like Guid.NewGuid() and Random.Next().

If we believed that DateTime.Now needs to be evaluated once per row to get correct results, the right soltuion would be in fact to issue N+1 queries. The only difference from what we are doign now is that the value of DateTime.Now would be obtained by executing SELECT GETDATE().

But that isn't exactly the case. Although both DateTime.Now and Guid.NewGuid() are strictly speaking non-deterministic (they can return different result every time you call them), DateTime.Now will return the same value if you call it repeated times soon enough, and the server (in this case, SQL Server) can wisely avoid invoking it per row. For example look at the query plans for these two queries:

  -- This evaluates GETDATE() only once and the result becomes a constant
  select r.*, GETDATE() 
  from Records as r
  where r.OrderDate < GETDATE()
  order by GETDATE();

  -- This evaluates NEWID() once per row
  select r.*, NEWID()  
  from Records as r
  order by NEWID()

I am trying to come up with a better explanation about the difference between these two. The best I have is that the non-determnism in GETDATE() is "soft", while the non-determinism in NEWID() is "hard" :smile:.

Furthermore, having Guid.NewGuid() in EvaluatableExpressionFilter has never really been about evaluating it on the server. Besides NEWSEQUENTIALID() (which we could expose as a function) there shouldn't be a difference between evaluating it on the server or on the client. The only requirement is that it needs to be evaluated for each row.

Summary

If we separate how we deal with DateTime.Now (and friends) and how we deal with Guid.NewGuid() and Random.Next() on the other hand, we can probably come up with something like what you are asking for, with the only difference that we would make a single call to SELECT GETDATE().

On sargability

@mqudsi the approach to achieve sargability that you describe sounds quite clever. If this is something you think we could generally rewrite queries to (rather than having customers write the query like that), please create a separate issue.

tuespetre commented 6 years ago

Great writeup.

The best I have is that the non-determnism in GETDATE() is "soft", while the non-determinism in NEWID() is "hard"

It looks like the term SQL Server has used is “nondeterministic runtime constant” for GETDATE().

divega commented 6 years ago

It looks like the term SQL Server has used is “nondeterministic runtime constant” for GETDATE().

Thanks! I will adopt this term. At least it sounds a bit better than soft non-determinism.

jzabroski commented 6 years ago

I keep going back to a conversation I had with Smit back in February regarding group by constant behavior. Why do you hardcode stuff with a fixed strategy?

Why does EF need a "THE PHILOSOPHY OF EF"? While I can appreciate conventions that avoid API consumers from making mistakes, there is also no room to say otherwise.

As for how to separate things that operate per row vs. per batch, why not just call them as you described it. OncePerBatch, OncePerRow, and OncePerOnce.

OncePerOnce literally means every symbolic reference to a non-deterministic expression be evaluated server-side. E.g.

var tmp = (from f in dbContext.Persons select new { Now = DateTime.Now, OvermorrowNow = DateTime.Now.AddDays(2), YesterdayNow = DateTime.Now.AddDays(-1) }).ToList();

In OncePerOnce mode, every occurrence of the "term rewrite rule" DateTime.Now := GetDate() is evaluated locally (in-place), with no distribution law applied. In mathematics, this is basically wrapping the sub expression into a "cast to algebraic group that does not allow law of distributivity". Then, any other rewrite rule knows it cant touch this sub-expression further.

Voila, we have decoupled PHILOSOPHY of EF from the raw mathematics.

Further, if you bite on my gambit so far, this raises an interesting API idea: Directly reifying the rewrite engine to the end user, so that they can choose OncePerOnce, OncePerRow, or OncePerBatch. It also would practically require to reify the "zone" where you put batch stuff like DECLARE @now datetime = GETDATE();

Then reifying the batch zone could open up a whole slew of awesomeness to EF. For example, imagine defining a table #tmp0 (ID int primary key clustered) and seeding it with values, and doing a join on that table, instead of a typical IN clause translation. (This would allow EF to circumvent SQL Server's 2,100 parameter limit on stored procedures, because the temp table would hold the parameters, thereby eliminating the dreaded error The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.)

Let me know if that makes sense. This is exactly how the query generator I wrote 10 years ago worked. Back then, I was greatly inspired by the academic language Maude, which implements a reflective rewriting logic as a language.

Maude calls the ability to put constants like DateTime.Now in-line as "on-the-fly variable declaration", which is convenient because that's exactly what we're trying to achieve.

jzabroski commented 6 years ago

It looks like the term SQL Server has used is “nondeterministic runtime constant” for GETDATE(). Thanks! I will adopt this term. At least it sounds a bit better than soft non-determinism.

But that's only describing it's run-time mechanism of action. You appear to want to describe also the compile-time lexical scope of the action and whether each occurrence is a unique variable or unique assignable reference.

jzabroski commented 6 years ago

@mqudsi already tried something like this?
static DateTime Now => DateTime.UtcNow;

likeToday = await db.Records
                .OrderBy(x => x.Timestamp)
                .Where(x => x.Timestamp.Day == Now.Day)
                .FirstOrDefaultAsync();
Yes, creating a property works (see initial post).

Caveat: The property must be a non-Func. Func<DateTime> = () => DateTime.Now does not work.

Any time the dependency is "deferred", it won't work. You have to capture the value:

Works

Func<IRepository<X>> Xs { get; set; }

public IQueryable<X> GetXs()
{
   var xs = Xs().All;
   return (from x in xs select x);
}

Doesn't Work

Func<IRepository<X>> Xs { get; set; }

public IQueryable<X> GetXs()
{
   return (from x in Xs().All select x);
}

This is rather annoying for integration tests, as you can see the problem especially if I capture DateTime.Now as Now() outside the query: EF won't see the DateTime.Now expression at all, and instead substitute in the client-side evaluated value. Perhaps this example best demonstrates the generality of my solution.

tuespetre commented 6 years ago

@jzabroski I understand your points but there does need to be a “philosophy of EF”. Projects aren’t worth much without some kind of philosophy to them 😛 and even your points have a philosophy behind them.

I agree that behaviors should be easy to modify in a query provider like this but that doesn’t mean the provider can’t have some (philosophically) sensible defaults that can be added to or removed from to taste. In EF Core there are a couple of really composable areas, namely in member and method handling. Past that I feel it’s a bit too rigid. Perhaps it could become less so in the future.

All in all I think we are after the same pie in the sky here: a highly composable and useful query provider.

divega commented 6 years ago

@jzabroski I may be missing some details of what you are saying, but I believe I understand where you are going in general.

I see the concrete goal in this case as coming up with a design that makes something like DateTime.Now behave consistently in the expected way regardless of whether it is being evaluated on the database server or on the client. The server already exists and it has its rules. Regardless of what we may want to do, the simplest way for it to be consistent is to try to build something that follows similar rules when the evaluation needs to happen in memory.

Moreover, the whole problem space is very large, and the work we do involves a lot of divide-and-conquer and prioritizing stuff that we believe customers actually will use. Plus, as @tuespetre mentions, we need sensible defaults because only then you can use the product without deeply understanding all the details.

In that sense, while we could build something that let's the user choose how many times a functions is evaluated at the query or invocation level (I am going to assume that is what you are proposing), I am somewhat skeptical about how much value that would yield. It seems for existing functions, it is reasonable for the expectation to be hard-coded (we just got it wrong in the original design). For new user defined or new function mappings, it certainly seems a good idea to have a way to configure it on a per function basis (cc @pmiddleton, @smitpatel).

Re what you mention about creating temp tables and joining them with queries, it seems to me that for the scenarios in which this would actually help, e.g. there is a collection of values, TVPs on SQL Server would probably be a better option. For all other scenarios, if we just follow the same rules as the server, it seems that either inlining the function or using a simple parameter would be sufficient.

divega commented 6 years ago

@smitpatel could you please create new issues to cover what we discussed in triage?

I think the main thing is separating the concerns currently conflated in the list in EvaluatableExpressionFilter.

Part of the solution is already tracked at https://github.com/aspnet/EntityFrameworkCore/issues/11466.

I wonder if we should have something about configuring this aspect functions when mapped. I remember we talked about having an attribute at some point, but I am not sure we ever implemented it. I'd like to hear what @pmiddleton thinks about it.

Am I missing anything else?

pmiddleton commented 6 years ago

I have talked with @smitpatel on parts of this already.

However Im currently at Disneyland and don't have time to write it up right now. I'll comment next week when I get home. It's funny how you never have any free time when you are on vacation :)

jzabroski commented 6 years ago

...And I thought I was in the land of make believe.

My own 2 cents:

I agree with Diego that we should address a smaller problem first, then seek to generalize the pieces. Biting off too much at once is risky and can demoralize anyone undertaking such a heroic effort to truly open up the query generator.
I never said defaults were bad, but I'm against them if they're hard-coded and cannot be changed. Default and fixed is not the same thing...
I think the defaults EF team picked were great, modulo maybe some surprises @mqudsi and myself have pointed out here.
I think some of the conventions I mentioned might form good vocabulary for making things more generic. If not, I'd be curious to learn why - always looking to be better.

Thanks for listening to all of us!

mqudsi commented 6 years ago

Sorry for not getting a chance to chime in earlier, but I want to thank everyone for the vibrant, open, and healthy discussion here. For what it's worth, @divega's detailed response perfectly explained the situation and it now "makes sense" why things work the way they work. In particular, the bit about the DateTime blacklist explains pretty much all of the weird cases I was running into. I do suggest that perhaps the warning generated by EFCore could include a mention of "blacklisted type/method xxxx" which would make it eminently more useful, though.

@divega my first encounter with EFCore lambda translation had me anxious to see whether values such as UtcNow would be explicitly cached, selected per-row, or left to the database itself to decide, actually. Thanks for mentioning that concern in your explanation! In answer to your question, since I'm actually (currently) using the SQLite provider, there is literally no difference whether the time is evaluated on the db server or the asp net core server, but I can definitely appreciate how that begins to matter immediately in the real world.

@mqudsi the approach to achieve sargability that you describe sounds quite clever. If this is something you think we could generally rewrite queries to (rather than having customers write the query like that), please create a separate issue.

Thank you! It came about when I was trying to understand why WordPress queries were so inefficient out of the box despite plenty of indices sprinkled about and realized that the default permalink format involved a lookup against the year and month. The YEAR(xx) example was easiest to illustrate, but it's easy to extrapolate that (at least in your head) for (year, month, [day]) constraints and even (with some mental gymnastics) (month, [day]) (or even just day but I imagine whether or not that ends up helping really depends on the size of your data!).

But to be honest, I'm not clear on how (blacklisting of DateTime/UtcNow aside) the expansion of chained/nested expressions affects the concerns @smitpatel or @jzabroski had with regards to sargability; I just want to make sure I'm not missing something and perhaps they are right and the presence/absence of subexpression expansion affects the heuristic governing that in some way?

jzabroski commented 6 years ago

@divega

Re what you mention about creating temp tables and joining them with queries, it seems to me that for the scenarios in which this would actually help, e.g. there is a collection of values, TVPs on SQL Server would probably be a better option. For all other scenarios, if we just follow the same rules as the server, it seems that either inlining the function or using a simple parameter would be sufficient.

Why would a TVP on SQL Server be a better option for the EFCore Query generator? With a TVP, you need to create a user-defined table type. Whose responsibility is that going to be, and why would EFCore want to take a dependency on whether the system catalog table sys.types has an appropriate TVP type mapping in place?

The scenario I am suggesting here is where you have a lot of values. If anything, you want to create a temp table with a clustered index so that you can join quickly. Currently, this is what we do with EFCore: We use EntityFramework.Extended and bulk insert into a non-temp table the id's we need to join on. Alternatively, we use .Batch() extension method and then Union in-memory all the separate batches. Either one will avoid a run-time error, but the pseudo-temp table approach is the most performant. The downside is that it requires two columns, a uniqueidentifier column and the actual id int column to populate in the select, so that the sub-select becomes SELECT tx0.[Id] FROM [dbo].[PseudoTempTable] tx0 WHERE tx0.[Guid]='c65e4bea-fdbd-422a-adc9-ed962923f27b') - and the downside to this is you now need some garbage collection process to make sure stuff gets cleaned up after the end of the connection. Moreover, you've now reduced the index seek-ability because the primary key has to first be clustered on that uniqueidentifier. Bogus.

jzabroski commented 6 years ago

But to be honest, I'm not clear on how (blacklisting of DateTime/UtcNow aside) the expansion of chained/nested expressions affects the concerns @smitpatel or @jzabroski had with regards to sargability; I just want to make sure I'm not missing something and perhaps they are right and the presence/absence of subexpression expansion affects the heuristic governing that in some way?

You're a very clever and curious engineer, as evidenced by your amazing WordPress anecdote... but there are indeed some basic academic concepts that can answer your question and perhaps raise your awareness to what your clever intuition and case analysis has revealed to you.

You can think of any compiler as a rewrite engine. Compilers are constantly rewriting things. In rewriting logics, we think of the following properties:

Church-Rosser Theorem or "Confluence" (that rewrite rules will always reduce down the same way every time to what we call a "canonical form" or stopping condition at which point the rewrite engine knows to "give up" and be happy)
Termination guarantees (that rewrite rules don't go in an indirect, infinite loop, e.g. a -> b; b -> c; c-> a;)

The main reason I'm primarily in favor of using erecruit.Expr instead of complicating EntityFrameworkCore's internal query optimizer is that I can guarantee each optimization is limited to a specific "peep hole".

However, at the same time, there are some optimizations I simply cannot do external to EF as they require some extensibility to EF itself.

I'll update this very post with some persuasive examples, but for now I have to run out the door.

divega commented 6 years ago

@mqudsi @jzabroski I am still not sure if there is any direct connection between sargability and the original issue reported here. I believe @smitpatel was just curious to know if anything in @mqudsi proposal would make the query more sargable.

Anyway, I would still like to encourage anyone that can propose sargability improvements we cann apply to EF Core query translation to do so in a separate issue.

While I find this particular discussion very interesting, from the perspective of project management, it helps to keep issues that are narrow in focus and can easily be assigned to a person and a milestone for addressing.

@jzabroski, TVPs solve the problem of streaming an arbitrary number of values up to the server. If you use a temp table, you still have the problem of populating it with the values. No doubts there are ways to do that, e.g. in chunks, but TVPs would likely be more efficient.

An yes, it is unfortunate that TVPs require so much ceremony. I would much prefer to be able to declare the type inline or to be able to use existing tables a the template for TVPs. That said, if we enabled TVPs, some of the table types that we need (e.g. for inserts) would probably be generated and maintained by EF Core Migrations. For completely ad-hoc queries using Enumerable.Contains, we would probably need a table type with a single column of a specific scalar type. We could consider sending DDL to create the table type if it doesn't exist just before executing the query. All in all, I think the resulting experience could become quite good.

smitpatel commented 6 years ago

11466 Once per row server side

12651 Once per server side

pmiddleton commented 6 years ago

IMHO based on my past experience dealing with time and time zones really sucks... like a lot :)

I am in agreement that we should try to evaluate DateTime.Now/DateTime.UTCNow on the server whenever possible. However this is going to be a hard and messy issue to solve, because there are a bunch of edge cases.

Take this query for example (it assumes the current released implementation of SqlLite without AddDays support)

db.Records
.Where(x => x.Timestamp < DateTime.UtcNow)
.Where(x => x.Timestamp < DateTime.UtcNow.AddDays(-1))

This will end up using the current datetime from both the server and the client. The first where clause will be evaluated server side with a mapped call and the second will run client side with a local call. This is because the two lambdas are evaluated and translated in isolation.

To solve this we will need to scan the expression tree while keeping context of what other sub-expressions have already run into, and then backtracking to perform substitutions if we run into a non-translatable method call.

What is the rule in a N+1 query where the outer and inner both have DateTime.UtcNow calls but one is translatable and the other is not? How do we keep track of that decision as all the queries are built?

How many more edge cases like this are lying out there that no one has thought of yet?

Before we can even begin to tackle it we will need a metadata store for all mapped functions (see my comment in #12527) in order to have the data we need to make the evaluation decisions.

A lot of the examples laid out in this thread are pretty contrived examples of taking things to the extreme (mine included). I think resources would be better spent trying to get more of the common DateTime methods translated for more providers. That will solve the majority of real use cases issues.

jzabroski commented 6 years ago

Take this query for example (it assumes the current released implementation of SqlLite without AddDays support)

db.Records .Where(x => x.Timestamp < DateTime.UtcNow) .Where(x => x.Timestamp < DateTime.UtcNow.AddDays(-1))

Why not just add Roslyn analysis to warn engineers of mixed eval?

Before we can even begin to tackle it we will need a metadata store for all mapped functions (see my comment in #12527) in order to have the data we need to make the evaluation decisions.

Probably also true to add a reliable Roslyn Analyzer. I've never written one but imagine with IoC it might be hard to statically infer much. In this example above, an analyzer can't suggest much without knowing the metadatastore. However, suppose the example was:

db.Records
  .Where(x => x.Timestamp < DateTime.UtcNow)
  .Where(x => x.Timestamp > DateTime.UtcNow.AddDays(-1))

Then the analyzer could suggest writing the above as a between statement, in addition to eliminating mixed eval.

Edit: I see that @ajcvickers has already scoped a Roslyn analyzer in #11659 , but the ticket doesn't explicitly scope out query analysis

pmiddleton commented 6 years ago

The list of which functions are mapped is not fully known until run time when the model is generated. This will preclude a Roslyn analyzer from being able to make any determinations.

jzabroski commented 6 years ago

@pmiddleton Preclude means "prevent from happening; make impossible."

You stated yourself that DateTime is such a concern and headache. Are you saying we can't write a Roslyn analyzer for:

DateTime methods
String methods
maybe one or two other functions I'm not thinking of

This would limit the usefulness but why wouldn't that alone be worth it? For functions not analyzable due to run-time model generation, why not emit Information-level severity?

Lastly, I must confess, I'm not an expert in Roslyn analyzers, but how does the basic Code Analysis in Visual Studio allow a CustomDictionary.xml? Couldn't EFCore follow a similar convention? Then architects could share their recommended "dictionaries" of "known server-side verbs" and "known server-side nouns" with others and override a plain default.

EDIT BELOW THIS LINE 7/18/2018 I found this StackOverflow Q&A helpful: What's the best way to make a Roslyn analyzer configurable?

You can pass additional files to the analyzers. These can then be reached from the analysis context. But this approach is not that developed yet in Roslyn. For example if the file changes, the analyzers are not notified about the change. For an example you can check out the SonarLint repository. Also, keep an eye on this GitHub issue, where the discussion is going on how parameters and data sharing should be done in the upcoming Roslyn version.

aspnet-hello commented 6 years ago

We periodically close 'discussion' issues that have not been updated in a long period of time.

We apologize if this causes any inconvenience. We ask that if you are still encountering an issue, please log a new issue with updated information and we will investigate.