GreptimeTeam / docs

Document for GreptimeDB
https://docs.greptime.com/
Apache License 2.0
37 stars 37 forks source link

Fulltext case-sensitive index behavior #1145

Open gar1t opened 2 months ago

gar1t commented 2 months ago

In Quick Start, the query:

SELECT 
  ts,
  api_path,
  log
FROM
  app_logs
WHERE
  matches(log, 'timeout');

shows results that are case-sensitive:

+---------------------+------------------+--------------------+
| ts                  | api_path         | log                |
+---------------------+------------------+--------------------+
| 2024-07-11 20:00:10 | /api/v1/billings | Connection timeout |
| 2024-07-11 20:00:10 | /api/v1/resource | Connection timeout |
+---------------------+------------------+--------------------+
2 rows in set (0.01 sec)

However, the table def is this:

Create Table: CREATE TABLE IF NOT EXISTS `app_logs` (
...
`log` STRING NULL FULLTEXT WITH(analyzer = 'English', case_sensitive = 'false'),
...)

The docs for CREATE indicate that case_sensitive for FULLTEXT is true. Based on what I'm seeing, following Quick Start, the default is false.

In any event, the query behavior is case sensitive.

Issues as I see them:

zhongzc commented 2 months ago

Thank you for your thorough review; the issue does indeed exist.

The specific reason is that the calculation for matches is separate between frontend and datanode. Datanode does respect the case-sensitive configuration, but this part has not yet been completed in frontend (see TODO): https://github.com/GreptimeTeam/greptimedb/blob/9c1704d4cbbfab8af07a77da598a1cfe2a5e7b22/src/common/function/src/scalars/matches.rs#L75-L95. As it stands, the implementation is currently case-sensitive.

Therefore, until this part of the work is completed, to maintain consistency, I think we can either hardcode this configuration to true and make it unchangeable, or hardcode it to false, but then change https://github.com/GreptimeTeam/greptimedb/blob/9c1704d4cbbfab8af07a77da598a1cfe2a5e7b22/src/common/function/src/scalars/matches.rs#L205 to use ilike, which would be more practical.

In any case, it was indeed an oversight, and I will arrange for a prompt fix.

cc @waynexia