Closed slvrtrn closed 9 months ago
could you add a test case for separator?
you might need to extract tokens from needle, and chain all with AND hasToken
clickhouse-logging-0.clickhouse-logging-cluster.production.svc.cluster.local :) select count(1) from app.logs where hasToken(message, 'test test')
SELECT count(1)
FROM app.logs
WHERE hasToken(message, 'test test')
Query id: e3fa6d5b-b6db-4f0a-a3f0-5853a37392aa
0 rows in set. Elapsed: 2.392 sec.
Received exception from server (version 23.5.1):
Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Needle must not contain whitespace or separator characters: while executing 'FUNCTION hasToken(message :: 0, 'test test' :: 1) -> hasToken(message, 'test test') UInt8 : 2'. (BAD_ARGUMENTS)
@taylorchu, if I understand it correctly (split foo bar
into hasToken(foo) AND hasToken(bar)
), I don't think this is necessary for the driver, as it is possible to chain filters in the UI:
it is inconvenient to use because for LIKE the needle can be any string, but for hasToken work, it has to be manually split in this pr. I think splitting should be easy?
@taylorchu,
but for hasToken work, it has to be manually split in this pr
It is likely that I still don't get it. Can you please provide an example where just two or more "contains" filters won't work?
we often search logs with needle like "app_version": "7ed76f5"
or "jid": "l18dlrtvyXFqt13Db7XdL23R"
.
In this pr, even if it is on, hasToken will return errors because it contains separator.
It is an usability issue to manually split token because we don't previously have to do that with LIKE (ie. it is implementation detail/limitation for hasToken.)
@taylorchu, have you considered creating a materialized view with those properties extracted from the JSON blob as separate fields?
Yes but there are columns that do not make sense to be extracted
@taylorchu, I am a bit concerned about adding extra split logic as this is way too specific (beyond hasToken
/LIKE
switch feature itself, as it definitely has some general use).
There are also some concerning corner cases (what to consider a separator? Space or comma, or something else? Do we need a configuration? Are we just going to split by some predefined known separators?).
I suggest we merge this as-is, release 1.1.8, close #173, and open a new issue regarding possible enhancements to the hasToken
behavior. If it is easy to implement, the logic is sound, and can be thoroughly tested, we can add it to the driver.
One more thing: we welcome community contributions, and I want to encourage you to try to make a PoC PR for that feature on top of this one, as you have much more information about a possible use case of that enhancement. And I am always available in the community Slack in case you need help with your contribution.
Please let me know what you think. Cheers.
The definition of separator is from clickhouse's tokens
function. (any non-alphanumeric chars).
While I do not think it is hard to add, I am not familiar with how metabase driver is implemented.
@taylorchu, considering your scenario
"app_version": "7ed76f5" or "jid": "l18dlrtvyXFqt13Db7XdL23R"
SELECT tokens('"app_version": "7ed76f5"') AS tokens
┌─tokens──────────────────────┐
│ ['app','version','7ed76f5'] │
└─────────────────────────────┘
which means that there might be false positives (for example, if you had a field "name": "app"
as well, that will be matched?).
UUIDs will be split as well:
SELECT tokens('"app_version": "5f88ebd0-1beb-11ee-be56-0242ac120002"') AS tokens
┌─tokens───────────────────────────────────────────────────────────┐
│ ['app','version','5f88ebd0','1beb','11ee','be56','0242ac120002'] │
└──────────────────────────────────────────────────────────────────┘
which looks odd.
Yeah, this can be better if tokenbf supports custom tokenizers but it is already more efficient than ngrambf
After several attempts to fix this, I found no clean way to implement this feature correctly. Unfortunately, this particular scenario will have to rely on raw SQL queries.
Summary
Resolves #173 Adds an option to switch between
LIKE %str%
(default behavior) andhasToken
via an advanced UI setting (the description does not look very good, but I couldn't do anything about it).Checklist