Closed taylorchu closed 1 year ago
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#functions-support we should also make use of this: hasTokenCaseInsensitive. But it only works with lower instead of lowerUtf8.
@taylorchu, please open an issue regarding the linked comment issue in the main repo.
As for now, hasToken
does not seem to be a direct replacement for LIKE %...%
for all the use cases - some MB tests do not pass.
What if we tried position
and positionCaseInsensitiveUTF8
as temporary workaround? Will it work in your use case?
EDIT: I checked with the example from the comment. position
will work faster than LIKE %...%
, but it still does not use the index properly. As of now, we can release a new version using at least startsWith
/endsWith
while investigating what is possible to do with :contains
.
sgtm, also could you share some failed cases?
my understanding is that hasToken will generate some false negatives.
we use contain
a lot. I wonder if there is a possibility to check if hasToken returns true first; if not, use position. This should work around false negatives (it is a bit ugly, and I wish there is a generic "contains" in clickhouse).
@taylorchu, failing tests:
FAIL in metabase.api.field-test/search-values-with-field-same-as-search-field-test (field_test.clj:693)
make sure it also works if you use the same Field twice
:clickhouse
expected: [["Fred 62" "Fred 62"] ["Red Medicine" "Red Medicine"]]
actual: (["Red Medicine" "Red Medicine"])
https://github.com/metabase/metabase/blob/v0.46.5/test/metabase/api/field_test.clj#L690-L703
FAIL in metabase.api.field-test/search-values-test (field_test.clj:660) 8
make sure `search-values` works on with our various drivers
:clickhouse
expected: [[1 "Red Medicine"] [10 "Fred 62"]]
actual: [[1 "Red Medicine"]]
diff: - [nil [10 "Fred 62"]]
https://github.com/metabase/metabase/blob/v0.46.5/test/metabase/api/field_test.clj#L657-L666
Explanation: hasToken(str)
does not work here cause Fred 62
does not have a token red
.
Regarding possible hasToken
or position
workaround: it will basically invalidate the index usage.
bee173f4c51c :) select Count(*) from MY_TABLE where hasToken(Route, '3119550599')
SELECT Count(*)
FROM MY_TABLE
WHERE hasToken(Route, '3119550599')
┌─count()─┐
│ 2 │
└─────────┘
1 row in set. Elapsed: 0.083 sec. Processed 48.26 thousand rows, 5.58 MB (579.45 thousand rows/s., 67.02 MB/s.)
bee173f4c51c :) select Count(*) from MY_TABLE where hasToken(Route, '3119550599') or position(Route, '3119550599') > 0
SELECT Count(*)
FROM MY_TABLE
WHERE hasToken(Route, '3119550599') OR (position(Route, '3119550599') > 0)
┌─count()─┐
│ 2 │
└─────────┘
1 row in set. Elapsed: 0.125 sec. Processed 100.10 million rows, 3.21 GB (798.99 million rows/s., 25.63 GB/s.)
1.1.7 will include startsWith
/endsWith
usage; as for hasToken
specific scenarios, I think the only workaround is using the raw query editor for the heaviest queries for now.
@taylorchu, I see a response in the linked issue. Could you try an inverted index for your use case instead of tokenbf_v1?
I saw and tried inverted index for our log storage use case. it is an experimental feature, which crashed prod clickhouse, but in staging it is fine. clickhouse inverted index probably needs more work.
I also tried ngrambf, but the false positive rate is high.
For our use case, contains does not need to consider subword. ie. red
will not need to match fred
. it just needs to match Red, red carpet, a red
etc. I don't know whether there is a special mode in this driver to support this special version of "contains".
@taylorchu, for :contains
, is using raw queries in the editor an option in your use case? I see no clear solution here on the driver's side.
no, it is too cumbersome to use. I wish there is an env var that we can set to switch between LIKE
and hasToken. Of course, hasToken is opted in.
@taylorchu, I will check what is possible with advanced UI options in the driver configuration.
any update? thanks!
@taylorchu, yes, PR is up: https://github.com/ClickHouse/metabase-clickhouse-driver/pull/176 We will release 1.1.8 on ~Monday.
thanks!
See the discussion: https://github.com/ClickHouse/metabase-clickhouse-driver/pull/176
https://github.com/ClickHouse/ClickHouse/issues/15835#issuecomment-706711503
We try to speed up a string column with tokenbf_v1.
Unfortunately, for this driver, it converts metabase contains, startswith, endswith...etc to clickhouse
LIKE with %
. However, clickhouseLIKE
only works if there is a delimiter.