OrderBy is ignored when used with Distinct and then Concatenated

bittola commented 2 years ago

I'm selecting from few different tables, projecting to a common model object and concatenating them (translated as UNION ALL) to one single result collection. I also apply OrderBy to each select. Example:

var result = dbSetA.Where(...).AsModel().OrderBy(x => x.Name)
.Concat(dbSetB.Where(...).AsModel().OrderBy(x => x.Name))
.ToList();

This example works as expected. All subqueries are order by name and then UNION ALL is applied. the result is [Sorted result of dbSetA] + [Sorted result of dbSetB]

However when I apply Distinct (doesn't matter on what position in the subquery), the OrderBy is ignored (removed from translated query) example:

var result = dbSetA.Where(...).AsModel().Dictinct().OrderBy(x => x.Name)
.Concat(dbSetB.Where(...).AsModel().Dictinct().OrderBy(x => x.Name))
.ToList();

In this case the translated subqueries contain Distinct but no Order By, so the result is [Unsorted distinct result of dbSetA] + [Unsorted distinct result of dbSetB]

I believe this is a bug.

The only workaround I was able to find was to execute subqueries separately and merge using List methods: example:

var result =  dbSetA.Where(...).AsModel().Dictinct().OrderBy(x => x.Name).ToList();
result.AddRange(dbSetB.Where(...).AsModel().Dictinct().OrderBy(x => x.Name).ToList());

smitpatel commented 2 years ago

See https://github.com/dotnet/efcore/issues/25696 & https://github.com/dotnet/efcore/issues/23267

bittola commented 2 years ago

See #25696 & #23267

Those are not related!!! Please do a better job in triaging

ajcvickers commented 2 years ago

@roji to investigate.

roji commented 2 years ago

tl;dr there seems to be different behavior across databases on this:

SQL Server generally doesn't support ordering in set operation operands - the above query doesn't seem translatable.
~PostgreSQL does seem to support this, and apparently preserves ordering with UNION ALL (but not with UNION) (link).~ There's apparently no guarantee for this, see comment below.

At least for SQL Server, we should probably warn/error. ~Ideally we wouldn't block this for PG where it works, not sure how easy that is.~

SQL Server investigation

SQL Server doesn't support ordering in the operands of set operations, only on the result of the set operation:

DROP TABLE IF EXISTS data;
CREATE TABLE data (
    id int IDENTITY PRIMARY KEY,
    foo int UNIQUE
);
INSERT INTO data (foo) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);

-- Fails with incorrect syntax:
SELECT id FROM data ORDER BY id
UNION ALL
SELECT id FROM data;

-- Works, order by is applied on the result of the set operation:
SELECT id FROM data
UNION ALL
SELECT id FROM data
ORDER BY id;

An operand can be a subquery, but ordering isn't preserved by UNION ALL:

SELECT * FROM (SELECT TOP 100 PERCENT id FROM data ORDER BY id DESC) y
UNION ALL
SELECT * FROM (SELECT TOP 5 id FROM data ORDER BY id DESC) y;

Output:

Note that the first series is ascending, although we've specified DESC in the subquery.

roji commented 2 years ago

An update to this: I got confirmation from a PostgreSQL maintainer that UNION ALL also doesn't guarantee ordering. So basically all set operations are exactly like DISTINCT in that they remove ordering.

ajcvickers commented 2 years ago

@roji So we should probably throw during translation?

roji commented 2 years ago

Yeah, we should do whatever we do for OrderBy before Distinct (do we throw, warn?)

stevendarby commented 2 years ago

I think you can order by if you add offset 0 rows in the sub query - a trick I've used before but not sure if there are downsides / if it applies in this case.

bittola commented 2 years ago

TOP 100 PERCENT Doesn't work

But TOP N works and the subset is sorted.

CREATE TABLE temp.Temp1
(
    Id INT PRIMARY KEY,
    Name NVARCHAR(20)
);

INSERT INTO temp.Temp1
 VALUES (1, 'Beta'), (2, 'Alpha'), (3, 'Gamma'), (4, 'Delta')

CREATE TABLE temp.Temp2
(
    Id INT PRIMARY KEY,
    Name NVARCHAR(20)
);

INSERT INTO temp.Temp2
 VALUES (1, 'Gamma'), (2, 'Alpha'), (3, 'Delta'), (4, 'Beta')

SELECT t.Name FROM (SELECT TOP 100 * FROM temp.Temp1 ORDER BY Name DESC) t
UNION ALL
SELECT t2.Name FROM (SELECT TOP 100  * FROM temp.Temp2  ORDER BY Name DESC) t2

And the result is Gamma Delta Beta Alpha Gamma Delta Beta Alpha

roji commented 2 years ago

@bittola the point is that it may happen to work for your specific example, but no guarantee is provided from the database. That means that results could come back unordered based on any number of factors - some index change, the planner deciding to execute the query in parallel, etc.

In other words, the database doesn't support this, and so we can't translate that LINQ query...

bittola commented 2 years ago

@bittola the point is that it may happen to work for your specific example, but no guarantee is provided from the database. That means that results could come back unordered based on any number of factors - some index change, the planner deciding to execute the query in parallel, etc.

In other words, the database doesn't support this, and so we can't translate that LINQ query...

@roji

SELECT t.Name FROM (SELECT TOP 100 * FROM temp.Temp1 ORDER BY Name DESC) t
UNION ALL
SELECT t2.Name FROM (SELECT TOP 100  * FROM temp.Temp2  ORDER BY Name DESC) t2

is a valid SQL expression and therefore it should be allowed to be translated to. Should be up to a developer to decide whether it's legit for his environment

ajcvickers commented 2 years ago

@roji

Yeah, we should do whatever we do for OrderBy before Distinct (do we throw, warn?)

The difference is that Distinct and Union in LINQ can always discard ordering, but I'm not convinced the same if true for Concat.

roji commented 2 years ago

is a valid SQL expression and therefore it should be allowed to be translated to. Should be up to a developer to decide whether it's legit for his environment

The fact that an SQL expression is valid doesn't mean that it's a valid translation for a LINQ query that the user provides. In the LINQ query, the OrderBy guarantees the ordering, whereas in SQL, the ORDER BY is sometimes taken into account, sometimes not. That's a crucial behavioral discrepancy, even if superficially both LINQ and SQL have the words "order by" in them.

@ajcvickers

The difference is that Distinct and Union in LINQ can always discard ordering, but I'm not convinced the same if true for Concat

DISTINCT and UNION also may preserve ordering in some cases; maybe they do that in less cases than UNION ALL, but we can see clearly that the ordering isn't guaranteed... Wouldn't it be an incorrect translation?

Think of it from the user's perspective - this would be the worst kind of bug, where the user specifies OrderBy and it happens to sort, so they assume it always well. Then, at some future point something changes and boom...

stevendarby commented 2 years ago

One thing I would like to clarify. In this SQL, posted above:

SELECT * FROM (SELECT TOP 100 PERCENT id FROM data ORDER BY id DESC) y
UNION ALL
SELECT * FROM (SELECT TOP 5 id FROM data ORDER BY id DESC) y;

It is not the case that the UNION ALL that breaks the ordering, it's just that the ORDER BY is ignored in those sub-queries using TOP 100 PERCENT. A quirk of SQL Server. You can see this if you just run the top one (remember, these are all DESC, so you expect 10 at the top)

SELECT * FROM (SELECT TOP 100 PERCENT id FROM data ORDER BY id DESC) y

A way you can get ordering working in sub-queries is to use OFFSET 0 ROWS instead:

SELECT * FROM (SELECT id FROM data ORDER BY id DESC OFFSET 0 ROWS) y

For me, the order is also preserved when doing a UNION ALL:

SELECT * FROM (SELECT id FROM data ORDER BY id DESC OFFSET 0 ROWS) y
UNION ALL
SELECT * FROM (SELECT id FROM data ORDER BY id DESC OFFSET 0 ROWS) y

However, I understand there is a doubt about whether UNION ALL preservers the order - I just wanted to explain it probably wasn't the UNION ALL breaking the ordering in your examples. I hope info may be useful in general or in arriving at a solution.

roji commented 2 years ago

[...] A quirk of SQL Server

@stevendarby the fact that ordering isn't preserved for rows coming out of subqueries isn't a quirk: it's a fundamental point in the SQL language in general... A good way to see this is the following:

SELECT * FROM (SELECT 1 AS foo ORDER BY foo) AS bar;

Trying to run this errors with: The ORDER BY clause is invalid ... unless TOP, OFFSET or FOR XML is also specified. In other words, SQL Server only allows ORDER BY in subqueries for the purpose of determining which rows come out, and not how they're ordered.

A way you can get ordering working in sub-queries is to use OFFSET 0 ROWS instead:

If you see ordered rows coming out, that's not guaranteed; it's accidental and could change, and programs cannot rely on it.

You're right that above, the loss of order is related more to the subquery and less to UNION ALL specifically, but it doesn't change the fact that there's no way to guarantee that ordering gets preserved in the operands of UNION ALL.

stevendarby commented 2 years ago

Thanks! SQL Server could do better in explaining this fundamental point in the error given (your explanation was much better). I’ve seen many including myself workaround that error by doing OFFSET 0 ROWS - which appears to preserve order - not realising this is just a lucky implementation detail.

roji commented 2 years ago

Yeah... Their intention wasn't bad - they don't allow ORDER BY in subqueries without top/offset precisely to "communicate" that it doesn't make sense. But then you can just work around that restriction with OFFSET 0, so... :)

For a bit of an anecdote... EF Core has very extensive "specification tests" which any provider can (and should) implement; these ensure that the provider is behaving correctly etc. Every release there's a handful of new tests which implicitly depend on ordering which isn't guaranteed, but which happens to be good on SQL Server. When I run those tests on PostgreSQL they fail, since PostgreSQL tends to not preserve behavior and move things around, when order isn't explicitly specified. Just a bit of fun I go through.

roji commented 2 years ago

Triage: we should warn when OrderBy is specified before any set operation (like we do before Distinct), since relational databases do not guarantee them at all.

The Concat case seems particularly bad, since the enumerable implementation does preserve ordering (whereas the other set operations most probably do not).

I would strongly suggest considering throwing by default (and allowing opting out); the consequences of users assuming order preservation and getting surprised at runtime in production are far worse than the opposite direction. If we do this, working code (where ordering happens to be preserved without the guarantee, or where ordering doesn't actually matter) would start to throw; but we'd guide users to either remove the ordering or suppress the error, which seems very trivial.

dotnet / efcore

OrderBy is ignored when used with Distinct and then Concatenated #28399

SQL Server investigation