StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.74k stars 1.75k forks source link

Regex does not recognize non-english characters as word characters #43184

Open andr-c opened 6 months ago

andr-c commented 6 months ago

Steps to reproduce the behavior (Required)

mysql -u root  --host 127.0.0.1 --port 9030
mysql> select regexp_replace('**FAQ 常见问题', '\\\W+', '');

Expected behavior (Required)

The above command should give a string with only star characters removed:

'FAQ 常见问题'

Real behavior (Required)

In fact it removes all non-english chars:

+--------------------------------------------------+
| regexp_replace('**FAQ 常见问题', '\\W+', '')     |
+--------------------------------------------------+
| FAQ                                              |
+--------------------------------------------------+
1 row in set (0,00 sec)

StarRocks version (Required)

version info
Version: 3.2.3
Git: a40e2f8
Build Info: StarRocks@localhost
Build Time: 2024-02-08 19:28:21

was run from registry.starrocks.io/starrocks/allin1-ubuntu

bytebishal commented 6 months ago

Hi @andr-c , Anyone working on this issues, if not could you please assign this to me?

Please correct me, if understood wrong: You're trying to remove only star characters from a string, but it's removing all non-English characters instead.

Solution for this issues: For regex operations, command should be mysql> select regexp_replace('**FAQ 常见问题', '^\\**', '');

Expected behaviour (Required) 'FAQ 常见问题'

Real behaviour (Required) +---------------------------------------------------+ | regexp_replace('FAQ 常见问题', '^\', '') | +---------------------------------------------------+ | FAQ 常见问题 | +---------------------------------------------------+

andr-c commented 6 months ago

Hi @bytebishal, yes, correct - i would expect only star characters removed, all text (both English/Chinese) should be left intact. At least this is the behavior I see for similar query in postgres. Not sure I have the rights to assign it to anyone though...

bytebishal commented 6 months ago

Hi @andr-c ,

please do let me know if I can work on this. Thank you.

github-actions[bot] commented 1 day ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!