ClickHouse / metabase-clickhouse-driver

ClickHouse database driver for the Metabase business intelligence front-end
Apache License 2.0
461 stars 84 forks source link

Switch betwen LIKE and hasToken for :contains #176

Closed slvrtrn closed 9 months ago

slvrtrn commented 1 year ago

Summary

Resolves #173 Adds an option to switch between LIKE %str% (default behavior) and hasToken via an advanced UI setting (the description does not look very good, but I couldn't do anything about it).

image

Checklist

taylorchu commented 1 year ago

could you add a test case for separator? you might need to extract tokens from needle, and chain all with AND hasToken

clickhouse-logging-0.clickhouse-logging-cluster.production.svc.cluster.local :) select count(1) from app.logs where hasToken(message, 'test test')

SELECT count(1)
FROM app.logs
WHERE hasToken(message, 'test test')

Query id: e3fa6d5b-b6db-4f0a-a3f0-5853a37392aa

0 rows in set. Elapsed: 2.392 sec.

Received exception from server (version 23.5.1):
Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Needle must not contain whitespace or separator characters: while executing 'FUNCTION hasToken(message :: 0, 'test test' :: 1) -> hasToken(message, 'test test') UInt8 : 2'. (BAD_ARGUMENTS)
slvrtrn commented 1 year ago

@taylorchu, if I understand it correctly (split foo bar into hasToken(foo) AND hasToken(bar)), I don't think this is necessary for the driver, as it is possible to chain filters in the UI:

image

taylorchu commented 1 year ago

it is inconvenient to use because for LIKE the needle can be any string, but for hasToken work, it has to be manually split in this pr. I think splitting should be easy?

slvrtrn commented 1 year ago

@taylorchu,

but for hasToken work, it has to be manually split in this pr

It is likely that I still don't get it. Can you please provide an example where just two or more "contains" filters won't work?

taylorchu commented 1 year ago

we often search logs with needle like "app_version": "7ed76f5" or "jid": "l18dlrtvyXFqt13Db7XdL23R". In this pr, even if it is on, hasToken will return errors because it contains separator.

It is an usability issue to manually split token because we don't previously have to do that with LIKE (ie. it is implementation detail/limitation for hasToken.)

slvrtrn commented 1 year ago

@taylorchu, have you considered creating a materialized view with those properties extracted from the JSON blob as separate fields?

taylorchu commented 1 year ago

Yes but there are columns that do not make sense to be extracted

slvrtrn commented 1 year ago

@taylorchu, I am a bit concerned about adding extra split logic as this is way too specific (beyond hasToken/LIKE switch feature itself, as it definitely has some general use).

There are also some concerning corner cases (what to consider a separator? Space or comma, or something else? Do we need a configuration? Are we just going to split by some predefined known separators?).

I suggest we merge this as-is, release 1.1.8, close #173, and open a new issue regarding possible enhancements to the hasToken behavior. If it is easy to implement, the logic is sound, and can be thoroughly tested, we can add it to the driver.

One more thing: we welcome community contributions, and I want to encourage you to try to make a PoC PR for that feature on top of this one, as you have much more information about a possible use case of that enhancement. And I am always available in the community Slack in case you need help with your contribution.

Please let me know what you think. Cheers.

taylorchu commented 1 year ago

The definition of separator is from clickhouse's tokens function. (any non-alphanumeric chars).

While I do not think it is hard to add, I am not familiar with how metabase driver is implemented.

slvrtrn commented 1 year ago

@taylorchu, considering your scenario

"app_version": "7ed76f5" or "jid": "l18dlrtvyXFqt13Db7XdL23R"

SELECT tokens('"app_version": "7ed76f5"') AS tokens

┌─tokens──────────────────────┐
│ ['app','version','7ed76f5'] │
└─────────────────────────────┘

which means that there might be false positives (for example, if you had a field "name": "app" as well, that will be matched?).

UUIDs will be split as well:

SELECT tokens('"app_version": "5f88ebd0-1beb-11ee-be56-0242ac120002"') AS tokens

┌─tokens───────────────────────────────────────────────────────────┐
│ ['app','version','5f88ebd0','1beb','11ee','be56','0242ac120002'] │
└──────────────────────────────────────────────────────────────────┘

which looks odd.

taylorchu commented 1 year ago

Yeah, this can be better if tokenbf supports custom tokenizers but it is already more efficient than ngrambf

slvrtrn commented 9 months ago

After several attempts to fix this, I found no clean way to implement this feature correctly. Unfortunately, this particular scenario will have to rely on raw SQL queries.