Meaningful difference between single-byte and UTF8 strings

AlexBekhtin commented 4 days ago

-- Firebird-5.0.1.1469-0-windows-x64
-- Database page size = 16384
-- Connection charset = WIN1251

CREATE TABLE TEST_WIN1251 (
  VAL VARCHAR(1024) CHARACTER SET WIN1251
)

CREATE TABLE TEST_UTF8 (
  VAL VARCHAR(1024) CHARACTER SET UTF8
)

-- Import/Generate 100'000 records with 1024 characters
-- Strings like 'OXIkQmIVcGdGNlQSIW7Pg4Q6FIbImfZct5wWCaMug8Rr7lzjxnAX0CPhKRz...'
-- Use test_generate_data.sql

SELECT COUNT(*) FROM TEST_WIN1251
-- COUNT
-- 100 000
SELECT COUNT(*) FROM TEST_UTF8
-- COUNT
-- 100 000

-- All next queries return the same result ~65465

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL CONTAINING 'az';
-- Execute time = 938ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL CONTAINING 'az'
-- Execute time = 922ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 203ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 204ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SELECT COUNT(*) FROM TEST_WIN1251 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 985ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 1s 63ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL LIKE '%az%'
                                     OR VAL LIKE '%Az%'
                                     OR VAL LIKE '%aZ%'
                                     OR VAL LIKE '%AZ%'
-- Execute time = 422ms

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL LIKE '%az%'
                                  OR VAL LIKE '%Az%'
                                  OR VAL LIKE '%aZ%'
                                  OR VAL LIKE '%AZ%'
-- Execute time = 812ms

CREATE COLLATION WIN1251_CI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT SENSITIVE

CREATE COLLATION WIN1251_CI_AI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT INSENSITIVE  

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI LIKE '%az%'
-- Execute time = 281ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI_AI LIKE '%az%'
-- Execute time = 235ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI LIKE '%az%'
-- Execute time = 907ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN
1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- Execute time = 218ms (!)
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI) > 0
-- Execute time = 891ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

test_generate_data.zip

AlexBekhtin commented 4 days ago

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

mrotteveel commented 3 days ago

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

asfernandes commented 2 days ago

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SIMILAR TO is done with libre2 using UTF-8. So the first case is converted to UTF-8, it's expected to be slower.

asfernandes commented 2 days ago

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

asfernandes commented 2 days ago

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)

SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

The slow operation is ICU utrans_transUChars for removal of accents.

Do you have a better alternative?

AlexBekhtin commented 1 day ago

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

The basic premise is that there is a lot of loss when dealing with strings. The internal DBMS engine is slower than libre2. In my opinion, the regular expression engine is more complex and potentially slower. But it outperforms the internal DBMS engine methods, even the simplest ones like position.

AlexBekhtin commented 1 day ago

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

AlexBekhtin commented 1 day ago

My fault, I didn't explain it well at first.

UNICODE_CI_AI is very slow
libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

+ POSITION with COLLATE WIN1251_CI produces unexpected results

asfernandes commented 1 day ago

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

asfernandes commented 1 day ago

2. libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

AlexBekhtin commented 1 day ago

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

Do you mean FOR EXTERNAL clause?

Is this should work or am I doing something wrong again?

CREATE COLLATION WIN1251_EX_CI
  FOR WIN1251
  FROM EXTERNAL ('WIN1251_UNICODE')
  CASE INSENSITIVE
  ACCENT SENSITIVE

-- Latin letters
SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

-- Cyrillic letters
SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'фя' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'ФЯ' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

Case-Insensitive Searching

For strings in a character set that has a case-insensitive collation available, you can apply the collation, to compare the search argument and the searched strings directly. For example, using the WIN1251 character set, the collation PXW_CYRL is case-insensitive for this purpose:

AlexBekhtin commented 1 day ago

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

The main point is precisely the slowness of the ICU. ICU is a standard for many projects. Is there any way to change this in future versions of Firebird?

sim1984 commented 21 hours ago

There is no point in comparing what works correctly and what works incorrectly in terms of speed. When the result is identical, then comparisons in terms of speed still make sense.

FirebirdSQL / firebird

Meaningful difference between single-byte and UTF8 strings #8329