Prefer equality when comparing values

ranma42 commented 2 months ago

Most databases can take advantage of indexes (and in some cases even auto-generate them) when queries perform equality comparisons (instead of inequality).

In some places EFCore already tries to lean towards equality https://github.com/dotnet/efcore/blob/20edb631265527a73cb9b9ab474cd0484c49a771/src/EFCore.Relational/Query/SqlExpressionFactory.cs#L657-L666

In other places it disregards this and in fact can even replace equality comparisons with inequalities https://github.com/dotnet/efcore/blob/20edb631265527a73cb9b9ab474cd0484c49a771/src/EFCore.Relational/Query/SqlNullabilityProcessor.cs#L1748-L1757 https://github.com/dotnet/efcore/blob/20edb631265527a73cb9b9ab474cd0484c49a771/src/EFCore.Relational/Query/SqlNullabilityProcessor.cs#L1803-L1804

It would be better to avoid converting equalities into inequalities and, when possible, convert inequalities into equalities.

ranma42 commented 1 month ago

On Sqlite

create table test (a, b);

inequality has a much shorter plan:

explain select * from test as t1, test as t2 where t1.a != t2.b;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     16    0                    0
1     OpenRead       0     2     0     2              0
2     OpenRead       1     2     0     2              0
3     Rewind         0     15    0                    0
4       Rewind         1     15    0                    0
5         Column         0     0     1                    0
6         Column         1     1     2                    0
7         Eq             2     13    1     BINARY-8       81
8         Column         0     0     3                    0
9         Column         0     1     4                    0
10        Column         1     0     5                    0
11        Column         1     1     6                    0
12        ResultRow      3     4     0                    0
13      Next           1     5     0                    1
14    Next           0     4     0                    1
15    Halt           0     0     0                    0
16    Transaction    0     0     1     0              1
17    Goto           0     1     0                    0

but it performs a full table scan of the whole table for each row.

Equal-to-not has a more complex plan:

explain select * from test as t1, test as t2 where t1.a == NOT(t2.b);
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     30    0                    0
1     OpenRead       1     2     0     2              0
2     OpenRead       0     2     0     2              0
3     Rewind         1     29    0                    0
4       Once           0     16    0                    0
5       OpenAutoindex  2     3     0     k(3,B,,)       0
6       Explain        6     0     0     BLOOM FILTER ON t1 (a=?) 0
7       Blob           10000 1     0                    0
8       Rewind         0     16    0                    0
9         Column         0     0     3                    0
10        Column         0     1     4                    0
11        Rowid          0     5     0                    0
12        MakeRecord     3     3     2                    0
13        FilterAdd      1     0     3     1              0
14        IdxInsert      2     2     0                    16
15      Next           0     9     0                    3
16      Column         1     1     2                    0
17      Not            2     6     0                    0
18      IsNull         6     28    0                    0
19      Filter         1     28    6     1              0
20      SeekGE         2     28    6     1              0
21        IdxGT          2     28    6     1              0
22        Column         2     0     7                    0
23        Column         2     1     8                    0
24        Column         1     0     9                    0
25        Column         1     1     10                   0
26        ResultRow      7     4     0                    0
27      Next           2     21    0                    0
28    Next           1     4     0                    1
29    Halt           0     0     0                    0
30    Transaction    0     0     1     0              1
31    Goto           0     1     0                    0

which automatically creates a temporary index for column a and takes advantage of the index.

ranma42 commented 1 month ago

IIUC similar results also hold true in postgres:

create table test (a boolean, b boolean);

explain select * from test as t1, test as t2 where t1.a != t2.b;

QUERY PLAN                                                                     |
-------------------------------------------------------------------------------+
Nested Loop  (cost=0.00..111057.20 rows=3699200 width=4)                       |
  Join Filter: (t1.a <> t2.b)                                                  |
  ->  Seq Scan on test t1  (cost=0.00..37.20 rows=2720 width=2)                |
  ->  Materialize  (cost=0.00..50.80 rows=2720 width=2)                        |
        ->  Seq Scan on test t2  (cost=0.00..37.20 rows=2720 width=2)          |
JIT:                                                                           |
  Functions: 6                                                                 |
  Options: Inlining false, Optimization false, Expressions true, Deforming true|

explain select * from test as t1, test as t2 where t1.a = NOT(t2.b);

QUERY PLAN                                                           |
---------------------------------------------------------------------+
Hash Join  (cost=71.20..41731.20 rows=3699200 width=4)               |
  Hash Cond: (t1.a = (NOT t2.b))                                     |
  ->  Seq Scan on test t1  (cost=0.00..37.20 rows=2720 width=2)      |
  ->  Hash  (cost=37.20..37.20 rows=2720 width=2)                    |
        ->  Seq Scan on test t2  (cost=0.00..37.20 rows=2720 width=2)|

ranma42 commented 1 month ago

SqlServer 2022:

create table test (a bit, b bit);

SET SHOWPLAN_ALL ON;

StmtText                                                                                                           |StmtId|NodeId|Parent|PhysicalOp  |LogicalOp |Argument                                                                            |DefinedValues     |EstimateRows|EstimateIO|EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                            |Warnings|Type    |Parallel|EstimateExecutions|
-------------------------------------------------------------------------------------------------------------------+------+------+------+------------+----------+------------------------------------------------------------------------------------+------------------+------------+----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
select * from test as t1, test as t2 where t1.a != t2.b;                                                           |     1|     1|     0|            |          |1                                                                                   |                  |         1.0|          |           |          |      0.00657068|                                      |        |SELECT  |       0|                  |
  |--Nested Loops(Inner Join, WHERE:([master].[dbo].[test].[a] as [t1].[a]<>[master].[dbo].[test].[b] as [t2].[b]))|     1|     2|     1|Nested Loops|Inner Join|WHERE:([master].[dbo].[test].[a] as [t1].[a]<>[master].[dbo].[test].[b] as [t2].[b])|                  |         1.0|       0.0| 0.00000418|         9|      0.00657068|[t1].[a], [t1].[b], [t2].[a], [t2].[b]|        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t2]))                                                       |     1|     3|     2|Table Scan  |Table Scan|OBJECT:([master].[dbo].[test] AS [t2])                                              |[t2].[a], [t2].[b]|         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t2].[a], [t2].[b]                    |        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t1]))                                                       |     1|     4|     2|Table Scan  |Table Scan|OBJECT:([master].[dbo].[test] AS [t1])                                              |[t1].[a], [t1].[b]|         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t1].[a], [t1].[b]                    |        |PLAN_ROW|       0|               1.0|

StmtText                                                                                                              |StmtId|NodeId|Parent|PhysicalOp    |LogicalOp     |Argument                                                                                 |DefinedValues                                    |EstimateRows|EstimateIO|EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                            |Warnings|Type    |Parallel|EstimateExecutions|
----------------------------------------------------------------------------------------------------------------------+------+------+------+--------------+--------------+-----------------------------------------------------------------------------------------+-------------------------------------------------+------------+----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
select * from test as t1, test as t2 where t1.a = ~(t2.b)                                                             |     1|     1|     0|              |              |1                                                                                        |                                                 |         1.0|          |           |          |     0.024334392|                                      |        |SELECT  |       0|                  |
  |--Hash Match(Inner Join, HASH:([Expr1004])=([t1].[a]), RESIDUAL:([master].[dbo].[test].[a] as [t1].[a]=[Expr1004]))|     1|     2|     1|Hash Match    |Inner Join    |HASH:([Expr1004])=([t1].[a]), RESIDUAL:([master].[dbo].[test].[a] as [t1].[a]=[Expr1004])|                                                 |         1.0|       0.0|0.017765092|         9|     0.024334392|[t1].[a], [t1].[b], [t2].[a], [t2].[b]|        |PLAN_ROW|       0|               1.0|
       |--Compute Scalar(DEFINE:([Expr1004]=~[master].[dbo].[test].[b] as [t2].[b]))                                  |     1|     3|     2|Compute Scalar|Compute Scalar|DEFINE:([Expr1004]=~[master].[dbo].[test].[b] as [t2].[b])                               |[Expr1004]=~[master].[dbo].[test].[b] as [t2].[b]|         1.0|       0.0|  0.0000001|         9|       0.0032832|[t2].[a], [t2].[b], [Expr1004]        |        |PLAN_ROW|       0|               1.0|
       |    |--Table Scan(OBJECT:([master].[dbo].[test] AS [t2]))                                                     |     1|     4|     3|Table Scan    |Table Scan    |OBJECT:([master].[dbo].[test] AS [t2])                                                   |[t2].[a], [t2].[b]                               |         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t2].[a], [t2].[b]                    |        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t1]))                                                          |     1|     5|     2|Table Scan    |Table Scan    |OBJECT:([master].[dbo].[test] AS [t1])                                                   |[t1].[a], [t1].[b]                               |         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t1].[a], [t1].[b]                    |        |PLAN_ROW|       0|               1.0|

ranma42 commented 1 month ago

An example session showing the performance difference on Sqlite:

SQLite version 3.46.0 2024-05-23 13:25:27
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .timer on
sqlite> create table test (a, b);
Run Time: real 0.001 user 0.000172 sys 0.000000
sqlite> insert into test (a,b) select value & 1, NOT(value & 2) from generate_series(1, 10000);
Run Time: real 0.001 user 0.000942 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
50000000
Run Time: real 3.297 user 3.366575 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
50000000
Run Time: real 1.150 user 1.174196 sys 0.000000
sqlite> insert into test (a,b) select value & 1, NOT(value & 2) from generate_series(1, 10000);
Run Time: real 0.001 user 0.000879 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
200000000
Run Time: real 13.137 user 13.424913 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
200000000
Run Time: real 5.320 user 4.663320 sys 0.000000
sqlite> delete from test;
Run Time: real 0.000 user 0.000198 sys 0.000000
sqlite> insert into test (a,b) select
    CASE random() % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END,
    CASE random() % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END
from generate_series(1, 20000);
Run Time: real 0.004 user 0.003828 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
968190
Run Time: real 12.874 user 12.323147 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
968190
Run Time: real 0.030 user 0.031159 sys 0.000000

TL;DR: about 3 times faster for a 50/50 distribution of true/false; much faster for 5/5/90 false/true/null

ranma42 commented 1 month ago

A similar experiment on SqlServer 2022:

set statistics time ON

create table test (a BIT, b BIT);

insert into test (a,b) select value & 1, CAST(value & 2 AS BIT) from generate_series(1, 10000000);

select COUNT_BIG (*) from test as t1, test as t2 where t1.a != t2.b;
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 4027 ms,  elapsed time = 2281 ms.

select COUNT_BIG (*) from test as t1, test as t2 where t1.a = ~(t2.b);
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 3578 ms,  elapsed time = 198 ms.

delete from test;
insert into test (a,b) select
    CASE value % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END,
    CASE (value / 20) % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END
from generate_series(1, 10000000);

select COUNT_BIG (*) from test as t1, test as t2 where t1.a != t2.b;
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 5176 ms,  elapsed time = 3350 ms.

select COUNT_BIG (*) from test as t1, test as t2 where t1.a = ~(t2.b);
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 3682 ms,  elapsed time = 199 ms.

TL;DR: SqlServer saves about 15% of the CPU time when filtering with = and it apparently can run it in parallel (10x faster wall-time).

ErikEJ commented 1 month ago

I see the same perf pattern on my informal test based on the script above!

ranma42 commented 1 month ago

Ah, I mentioned it in the comment https://github.com/dotnet/efcore/pull/34166#discussion_r1675762053 but maybe it should also be repeated here: these benchmarks are completely synthetic and do not represent a real-world workload. They are meant to demonstrate that a = NOT(b) is usually optimized better than <>. One might expect these optimizations to be automatically handled by strictly-typed DB engines, but apparently that is not the case.

dotnet / efcore

Prefer equality when comparing values #34164