ClickHouse / metabase-clickhouse-driver

ClickHouse database driver for the Metabase business intelligence front-end
Apache License 2.0
476 stars 92 forks source link

Use startsWith, endsWith, and hasToken instead of LIKE #173

Closed taylorchu closed 1 year ago

taylorchu commented 1 year ago

https://github.com/ClickHouse/ClickHouse/issues/15835#issuecomment-706711503

We try to speed up a string column with tokenbf_v1.

Unfortunately, for this driver, it converts metabase contains, startswith, endswith...etc to clickhouse LIKE with %. However, clickhouse LIKE only works if there is a delimiter.

taylorchu commented 1 year ago

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#functions-support we should also make use of this: hasTokenCaseInsensitive. But it only works with lower instead of lowerUtf8.

slvrtrn commented 1 year ago

@taylorchu, please open an issue regarding the linked comment issue in the main repo.

As for now, hasToken does not seem to be a direct replacement for LIKE %...% for all the use cases - some MB tests do not pass.

What if we tried position and positionCaseInsensitiveUTF8 as temporary workaround? Will it work in your use case?

EDIT: I checked with the example from the comment. position will work faster than LIKE %...%, but it still does not use the index properly. As of now, we can release a new version using at least startsWith/endsWith while investigating what is possible to do with :contains.

taylorchu commented 1 year ago

sgtm, also could you share some failed cases?

my understanding is that hasToken will generate some false negatives.

taylorchu commented 1 year ago

we use contain a lot. I wonder if there is a possibility to check if hasToken returns true first; if not, use position. This should work around false negatives (it is a bit ugly, and I wish there is a generic "contains" in clickhouse).

slvrtrn commented 1 year ago

@taylorchu, failing tests:

FAIL in metabase.api.field-test/search-values-with-field-same-as-search-field-test (field_test.clj:693)
make sure it also works if you use the same Field twice 
:clickhouse
expected: [["Fred 62" "Fred 62"] ["Red Medicine" "Red Medicine"]]
  actual: (["Red Medicine" "Red Medicine"])

https://github.com/metabase/metabase/blob/v0.46.5/test/metabase/api/field_test.clj#L690-L703

FAIL in metabase.api.field-test/search-values-test (field_test.clj:660)         8
make sure `search-values` works on with our various drivers 
:clickhouse
expected: [[1 "Red Medicine"] [10 "Fred 62"]]
  actual: [[1 "Red Medicine"]]
    diff: - [nil [10 "Fred 62"]]

https://github.com/metabase/metabase/blob/v0.46.5/test/metabase/api/field_test.clj#L657-L666

Explanation: hasToken(str) does not work here cause Fred 62 does not have a token red.

Regarding possible hasToken or position workaround: it will basically invalidate the index usage.

bee173f4c51c :) select Count(*) from MY_TABLE where hasToken(Route, '3119550599')

SELECT Count(*)
FROM MY_TABLE
WHERE hasToken(Route, '3119550599')

┌─count()─┐
│       2 │
└─────────┘

1 row in set. Elapsed: 0.083 sec. Processed 48.26 thousand rows, 5.58 MB (579.45 thousand rows/s., 67.02 MB/s.)

bee173f4c51c :) select Count(*) from MY_TABLE where hasToken(Route, '3119550599') or position(Route, '3119550599') > 0

SELECT Count(*)
FROM MY_TABLE
WHERE hasToken(Route, '3119550599') OR (position(Route, '3119550599') > 0)

┌─count()─┐
│       2 │
└─────────┘

1 row in set. Elapsed: 0.125 sec. Processed 100.10 million rows, 3.21 GB (798.99 million rows/s., 25.63 GB/s.)

1.1.7 will include startsWith/endsWith usage; as for hasToken specific scenarios, I think the only workaround is using the raw query editor for the heaviest queries for now.

slvrtrn commented 1 year ago

@taylorchu, I see a response in the linked issue. Could you try an inverted index for your use case instead of tokenbf_v1?

taylorchu commented 1 year ago

I saw and tried inverted index for our log storage use case. it is an experimental feature, which crashed prod clickhouse, but in staging it is fine. clickhouse inverted index probably needs more work.

I also tried ngrambf, but the false positive rate is high.

For our use case, contains does not need to consider subword. ie. red will not need to match fred. it just needs to match Red, red carpet, a red etc. I don't know whether there is a special mode in this driver to support this special version of "contains".

slvrtrn commented 1 year ago

@taylorchu, for :contains, is using raw queries in the editor an option in your use case? I see no clear solution here on the driver's side.

taylorchu commented 1 year ago

no, it is too cumbersome to use. I wish there is an env var that we can set to switch between LIKE and hasToken. Of course, hasToken is opted in.

slvrtrn commented 1 year ago

@taylorchu, I will check what is possible with advanced UI options in the driver configuration.

taylorchu commented 1 year ago

any update? thanks!

slvrtrn commented 1 year ago

@taylorchu, yes, PR is up: https://github.com/ClickHouse/metabase-clickhouse-driver/pull/176 We will release 1.1.8 on ~Monday.

taylorchu commented 1 year ago

thanks!

slvrtrn commented 1 year ago

See the discussion: https://github.com/ClickHouse/metabase-clickhouse-driver/pull/176