cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.58k stars 3.71k forks source link

opt: incorrect query plans for trigram similarity filters when `pg_trgm.similarity_threshold=0` #122443

Open mgartner opened 3 months ago

mgartner commented 3 months ago

The optimizer produces incorrect query plans for queries with trigram similarity filters, e.g., col % 'foo', when pg_trgm.similarity_threshold is set to 0. These incorrect query plans produce incorrect results. For example:

CREATE TABLE t (
  k INT PRIMARY KEY,
  s STRING,
  INVERTED INDEX i (s gin_trgm_ops)
);

INSERT INTO t VALUES (1, 'foo'), (2, 'bar');

SET pg_trgm.similarity_threshold =  0;

SELECT * FROM t@primary WHERE s % 'foo';
--   k |  s
-- ----+------
--   1 | foo
--   2 | bar
-- (2 rows)

SELECT * FROM t@i WHERE s % 'foo';
--   k |  s
-- ----+------
--   1 | foo
-- (1 row)

When pg_trgm.similarity_threshold is set to 0, all strings are similar:


SET pg_trgm.similarity_threshold =  0;

SELECT 'foo' % 'bar';
--   ?column?
-- ------------
--      t
-- (1 row)

SELECT 'foo' % '';
--   ?column?
-- ------------
--      t
-- (1 row)

SELECT '' % '';
--   ?column?
-- ------------
--      t
-- (1 row)

Therefore, there is no subset of trigrams in the inverted index we can scan in order to find all rows matching the filter. We'd need to scan the entire inverted index. But that would be less efficient than scanning the entire primary index, so we simply shouldn't plan an inverted index scan at all.

Jira issue: CRDB-37893

mgartner commented 3 months ago

This was discovered by @michae2 here: https://github.com/cockroachdb/cockroach/pull/121973#pullrequestreview-1993090966

mgartner commented 3 months ago

This affects versions v22.2.0+. It has existed since trigram inverted indexes were introduced.