duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
22.88k stars 1.82k forks source link

Incorrect results might caused by `INDEX` #6860

Closed DerZc closed 1 year ago

DerZc commented 1 year ago

What happens?

Consider the following program

CREATE TABLE t0(c0 INTEGER);
INSERT INTO t0(c0) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);
CREATE INDEX i3 ON t0(c0);
UPDATE t0 SET c0=(0.1);
INSERT INTO t0(c0) VALUES (-1);

SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0; --32

SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>(SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0)); --32

there are only two values in t0.c0: 0 and -1, both of 0>>32 and -1>>32 equal to 0, so the second SELECT should has empty results. if I remove the CREATE INDEX statement, this query can have correct result.

To Reproduce

build the last version of source code, save this program in test.sql, then run duckdb < test.sql

OS:

ubuntu 22.04

DuckDB Version:

commit version b8cf6a98

DuckDB Client:

CLI

Full Name:

Chi Zhang

Affiliation:

Nanjing University, National University of Singapore

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

taniabogatsch commented 1 year ago

Hi @DerZc. Since your shift depends on the value of the row IDs, the result can indeed differ in the presence of an index. Without the index, the implicit cast to integer during the UPDATE, which does not change the values in c0, does not perform an actual update, so the row IDs stay the same during that statement. In the presence of the index, we do not notice that the update does not cause any changes, so we perform an INSERT of the new values (increasing the row IDs) and then a DELETE of the old values internally. So when shifting, we shift differently depending on the scenario.

I hope I understood your issue correctly and that this helps. Let me know if I missed something. :)

Without index.

CREATE TABLE t0(c0 INTEGER);
INSERT INTO t0(c0) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);
SELECT t0.rowid FROM t0;
┌───────┐
│ rowid │
│ int64 │
├───────┤
│     0 │
│     1 │
...
D UPDATE t0 SET c0=(0.1);
D SELECT t0.rowid FROM t0;
┌───────┐
│ rowid │
│ int64 │
├───────┤
│     0 │
│     1 │
...

With index.

CREATE TABLE t0(c0 INTEGER);
INSERT INTO t0(c0) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);
CREATE INDEX i3 ON t0(c0);
SELECT t0.rowid FROM t0;
┌───────┐
│ rowid │
│ int64 │
├───────┤
│     0 │
│     1 │
...
D UPDATE t0 SET c0=(0.1);
D SELECT t0.rowid FROM t0;
┌───────┐
│ rowid │
│ int64 │
├───────┤
│    16 │
│    17 │
...
DerZc commented 1 year ago

Hi @taniabogatsch , thank you for your reply!

There are two SELECT queries in this program, and the t0.rowid should be unchanged in these two queries.

The first query return 32, so the inner query in the second should also return 32. So the second query is equal to SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>(32)); as there are only 0 and -1 in t0.c0 and both of 0>>32 and -1>>0 equal to 0. because for this query SELECT t0.c0, t0.c0>>32 FROM t0; I get this results:

┌───────┬───────────────┐
│  c0   │ (t0.c0 >> 32) │
│ int32 │     int32     │
├───────┼───────────────┤
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│     0 │             0 │
│    -1 │             0 │
├───────┴───────────────┤
│ 17 rows     2 columns │
└───────────────────────┘

so the condition of the second query always return 0, which equal to FALSE. So the second query should has empty results.

I run this query SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>(32)); and get the result:

┌───────┐
│  c0   │
│ int64 │
├───────┤
│       │
└───────┘

But the current second query return 32.

taniabogatsch commented 1 year ago

Ah, I got you now. I will look into this.

CREATE TABLE t0(c0 INTEGER);
INSERT INTO t0(c0) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);
CREATE INDEX i3 ON t0(c0);
UPDATE t0 SET c0=(0.1);
INSERT INTO t0(c0) VALUES (-1);
SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0;
┌───────┐
│  c0   │
│ int64 │
├───────┤
│    32 │
└───────┘
SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>32);
┌───────┐
│  c0   │
│ int64 │
├───────┤
│       │
└───────┘
# The expected output should be' NULL' when copy-pasting the above query into this query (as a subquery).
SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>(SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0));
┌───────┐
│  c0   │
│ int64 │
├───────┤
│    32 │
└───────┘
taniabogatsch commented 1 year ago

Just putting some progress here, and providing a smaller reproducible example. This seems unrelated to the existence of the index. Instead, implicit conversions from INTEGER to BIGINT 'invalidate' the hard-coded 32-bit shift result.

┌─────────────┴─────────────┐                             
│           FILTER          │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│CAST((CAST(col AS BIGINT) >│                             
│  > SUBQUERY) AS BOOLEAN)  │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           EC: 0           │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             1             │                             
│          (0.00s)          │                             
└─────────────┬─────────────┘ 
# here, our col is of type INTEGER
statement ok
CREATE TABLE tbl(col INTEGER);

statement ok
INSERT INTO tbl(col) VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0), (0);

statement ok
INSERT INTO tbl(col) VALUES (-1);

query I
SELECT tbl.rowid FROM tbl WHERE tbl.col;
----
32

query I
SELECT col FROM tbl WHERE tbl.col >> 32;
----

# rowid is of type BIGINT, so we cast tbl.col to BIGINT, too (we cast to the 'bigger' type)
# this 'incorrectly' outputs -1 instead of zero rows, because we now shift a BIGINT by 32 bits
query I
SELECT col FROM tbl WHERE tbl.col >> (SELECT tbl2.rowid FROM tbl AS tbl2 WHERE tbl2.col);
----

# an explicit cast fixes this
query I
SELECT col FROM tbl WHERE tbl.col >> (SELECT tbl2.rowid FROM tbl AS tbl2 WHERE tbl2.col)::INTEGER;
----
taniabogatsch commented 1 year ago

Accordingly, we can add a typecast to the initial example.

statement ok
CREATE TABLE t0(c0 INTEGER);

statement ok
INSERT INTO t0(c0) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);

statement ok
CREATE INDEX i3 ON t0(c0);

statement ok
UPDATE t0 SET c0=(0.1);

statement ok
INSERT INTO t0(c0) VALUES (-1);

query I
SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0;
----
32

# casting the subquery to INTEGER avoids implicitly converting t0.c0 to BIGINT, resulting in the expected shift result
query I
SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE ((t0.c0)>>(SELECT (MIN(t0.rowid)) AS c0 FROM t0 WHERE t0.c0)::INTEGER);
----
NULL
DerZc commented 1 year ago

I got it, thank you for your patience in explaining!