dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.58k stars 498 forks source link

REGEXP isn't collation aware #8117

Closed kennethklee closed 1 month ago

kennethklee commented 1 month ago

In MySQL, the regular expression is case insensitive.

However, dolt's is case sensitive:

MySQL [dolt]> create table test (text text);
Query OK, 0 rows affected (0.001 sec)

MySQL [dolt]> insert into test values ('abcDEF');
Query OK, 1 row affected (0.001 sec)

MySQL [dolt]> select * from test where text REGEXP 'abc';
+--------+
| text   |
+--------+
| abcDEF |
+--------+
1 row in set (0.001 sec)

MySQL [dolt]> select * from test where text REGEXP 'def';
Empty set (0.001 sec)

Meanwhile in MySQL:

mysql> create table test (text text);
Query OK, 0 rows affected (0.02 sec)

mysql> insert into test values ('abcDEF');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where text REGEXP 'abc';
+--------+
| text   |
+--------+
| abcDEF |
+--------+
1 row in set (0.01 sec)

mysql> select * from test where text REGEXP 'def';
+--------+
| text   |
+--------+
| abcDEF |
+--------+
1 row in set (0.00 sec)

Thanks dolt team!

fulghum commented 1 month ago

Hi @kennethklee, thanks for the issue. Part of what's going on here is that Dolt uses a different default collation than MySQL, and unlike on MySQL, that default collation is not case-insensitive on Dolt. You can find more details about that in https://github.com/dolthub/dolt/issues/7856. Feel free to provide any other feedback on that issue if the difference in default collation is causing issues for you.

However... when I recreated the table on Dolt using a case-insensitive collation (i.e. the same collation as MySQL is using), I still didn't get the same result as MySQL:

create table test (text text) collate=utf8mb4_0900_ai_ci;

insert into test values ('abcDEF');

select * from test where text REGEXP('def' COLLATE utf8mb4_0900_ai_ci);
Empty set (0.00 sec)

So... in addition to the difference in the default collation, I think there's another issue here where REGEXP doesn't always use the right collation. @Hydrocharged is much more of a collation expert than me though, so he may see something I'm just not specifying correctly in the query.

fulghum commented 1 month ago

I heard an update from the team that this is a known issue. There's also a potential workaround with like (instead of REGEXP), which does honor the collation of the column being compared. Dolt's behavior matches MySQL here.

So, for the example above... a workaround on Dolt could be:

-- Override the default (case-sensitive) collation to be case-insensitive
create table test (text text) collate=utf8mb4_0900_ai_ci;

insert into test values ('abcDEF');

-- like will use the collation of the column, which is case-insensitive in this case
select * from test where text like '%def%';
+--------+
| text   |
+--------+
| abcDEF |
+--------+
1 row in set (0.00 sec)

Let us know if the workaround above won't work for any reason and we'll be happy to see what we can do to help.

kennethklee commented 1 month ago

Hi @fulghum, thanks for the updates. I understand the difference now.

The use case is actually a rules engine with regex, and the request is to add case insensitive flag. I'm just gonna lowercase everything in the SQL =)

All good. I actually prefer it case sensitive, but MySQL.

Do you want me to close this or leave it to track the REGEXP collation awareness?

fulghum commented 1 month ago

Perfect! Happy to hear you aren't blocked on this.

It's a valid issue, so I think we should keep this open in case other people hit this and need help understanding what's going on and possible workarounds.

fulghum commented 1 month ago

Daylon was able to get this one sorted out, and we've got the change queued up for a release tomorrow.

Thanks for helping us find this! 🙏

fulghum commented 1 month ago

This fix has been released in Dolt version 1.42.0.

Thank you for taking the time to report this issue so we could get it fixed up. 🙏 Let us know if we can do anything else to help you build with Dolt!