Open AlexBekhtin opened 4 days ago
Two queries have different results
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms
Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).
SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%' -- Execute time = 250ms SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%' -- Execute time = 156ms (!)
SIMILAR TO
is done with libre2 using UTF-8. So the first case is converted to UTF-8, it's expected to be slower.
Two queries have different results
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 -- 23401 -- wrong -- Execute time = 562ms SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0 -- 65465 -- Execute time = 218ms
And:
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316
So what? Where is the error?
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%' -- Execute time = 8s 344ms (!)
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0 -- Execute time = 8s 281ms
The slow operation is ICU utrans_transUChars
for removal of accents.
Do you have a better alternative?
Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).
The basic premise is that there is a lot of loss when dealing with strings.
The internal DBMS engine is slower than libre2. In my opinion, the regular expression engine is more complex and potentially slower. But it outperforms the internal DBMS engine methods, even the simplest ones like position
.
Two queries have different results
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 -- 23401 -- wrong -- Execute time = 562ms SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0 -- 65465 -- Execute time = 218ms
And:
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0; -- 65316
So what? Where is the error?
The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?
My fault, I didn't explain it well at first.
UNICODE_CI_AI
is very slowlibre2
outperforms internal DBMS mechanisms, although its template search is formally more complicated+ POSITION
with COLLATE WIN1251_CI
produces unexpected results
The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?
Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.
It's why we have created option to use collations name in format <charset>_unicode
.
2.
libre2
outperforms internal DBMS mechanisms, although its template search is formally more complicated
libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.
Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.
It's why we have created option to use collations name in format
<charset>_unicode
.
Do you mean FOR EXTERNAL
clause?
Is this should work or am I doing something wrong again?
CREATE COLLATION WIN1251_EX_CI
FOR WIN1251
FROM EXTERNAL ('WIN1251_UNICODE')
CASE INSENSITIVE
ACCENT SENSITIVE
-- Latin letters
SELECT
(_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE
SELECT
(_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE
-- Cyrillic letters
SELECT
(_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'фя' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE
SELECT
(_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'ФЯ' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE
For strings in a character set that has a case-insensitive collation available, you can apply the collation, to compare the search argument and the searched strings directly. For example, using the WIN1251 character set, the collation PXW_CYRL is case-insensitive for this purpose:
libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.
The main point is precisely the slowness of the ICU. ICU is a standard for many projects. Is there any way to change this in future versions of Firebird?
There is no point in comparing what works correctly and what works incorrectly in terms of speed. When the result is identical, then comparisons in terms of speed still make sense.
test_generate_data.zip