FirebirdSQL / firebird

Firebird server, client and tools
https://www.firebirdsql.org/
1.26k stars 217 forks source link

Sort order is wrong when ordering by multiple columns starting with collate UNICODE_CI or UNICODE_CI_AI [CORE5940] #6196

Open firebird-automations opened 6 years ago

firebird-automations commented 6 years ago

Submitted by: Hiro Nonomura (hiro)

Votes: 2

If a sorting was ordered by a single column with collate UNICODE, the result could be shared with collate UNICODE_CI or UNICODE_CI_AI. But when it comes to ordering by multiple columns, it should be a different story.

UNICODE_CI and UNICODE_CI_AI are working as if requested by collate UNICODE even in ordering by multiple columns.

I have tested in Firebird 2.5 and 3.0. The following is a result by Firebird 3.0; iSQL (Windows 7)

SQL> CREATE DATABASE 'TEST.FDB'; SQL> CREATE TABLE A (F1 INTEGER NOT NULL PRIMARY KEY, F2 VARCHAR(10) CHARACTER SET UTF8 NOT NULL); SQL> INSERT INTO A (F1,F2) VALUES (1,'a'); SQL> INSERT INTO A (F1,F2) VALUES (2,'A'); SQL> INSERT INTO A (F1,F2) VALUES (3,'a'); SQL> SELECT F2,F1 FROM A ORDER BY F2 COLLATE UNICODE_CI, F1;

F2 F1 ========== ============ a 1 a 3 A 2

Correct result should be:

F2 F1 ========== ============ a 1 A 2 a 3

firebird-automations commented 6 years ago

Commented by: @livius2

Your observation is correct. Especially that for:

SELECT F2,F1 FROM A ORDER BY F2 COLLATE UNICODE_CI, F1 DESC;

we get only ordering in "a" 'group'

F2 F1 ========== ============ a 3 a 1 A 2

for me it is a bug

----------------------- Interesting - how group by should work in this situation?

SELECT a.F2 COLLATE UNICODE_CI, COUNT(*) FROM A a GROUP BY A.F2 COLLATE UNICODE_CI

F2 F1 ========== ============ a 3

i know that 'a' and 'A' is same in UNICODE_CI but ... ;-)

firebird-automations commented 6 years ago

Commented by: @asfernandes

Firebird uses non-interleaved keys in sort.

That means multi-level keys are placed one after another, instead of mixing levels.

Interleaved keys has others problems, like make multi-level indices unusable in some conditions.

firebird-automations commented 6 years ago

Commented by: @livius2

Adriano,

can you bring more light on this? Because sort of two column with integers work ok. Is this about key storage or collation processing?

firebird-automations commented 6 years ago
Modified by: Hiro Nonomura (hiro) Version: 2\.5\.3 \[ 10461 \]
firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

Adriano,

For me, rather it seems the matter of what to save as sort key value. My suggestion is to apply icu's collation strength properly to make sort key according to the field's collate property; default(unicode), primary(unicode_ci_ai) and secondary(unicode_ci)

Hiro

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

Having checked ICU's site and then browsed the source code of FB though I'm not familier with cpp :) And awared that the sort key to save is ok as unicode(default strength) for those collations.

Problem seems the usage of the key. If the index was unique, FB seemed to check the key properly. I'm not so sure but having observed and tested those problematic cases, my guess is that if the index is not unique or the sorting is not related to an unique index, FB compares falsely the full length of the key for CI or CI_AI.

The followings are sort key examples from (http://demo.icu-project.org/icu-bin/collation.html )

for collate unicode: c 2D , 05 , 05 . C 2D , 05 , DC . ç 2D , 45 A0 , 06 .

unicode_ci_ai: ç 2D . C 2D . c 2D .

unicode_ci: C 2D , 05 . c 2D , 05 . ç 2D , 45 A0 .

firebird-automations commented 6 years ago

Commented by: @asfernandes

For a key with fields (a, b) with multi-level collations, Firebird creates sort key as: <level 1 for a>...<level n for a><level 1 for b>...<level n for b>

To make what you want, it needs to generate <level 1 for a><level 1 for b>...<level n for a><level n for b>

That requires various changes.

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

If the key fields were of integers or of an unique index, the sort works perfectly. So Firebird is doing just good for those cases, but not if it was utf8 charset and not related to an unique index. It's hard for me to understand why this is relating to interleaved or compound sort key...

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

What I understood from my test cases, Firebird's methods for creating sort key is just fine. Because even in the case that I reported here, the result would not be produced if the methods were wrong or the sort keys created were wrong. There was nothing that lacks. She is just always checking three buckets where one or two shoulld be enough :)

If a result was produced as if collate was unicode_ci or ci_ai for unicode, then we would be able to say that the method or the key created might be wrong. Because it shows there was a lack.

firebird-automations commented 6 years ago

Commented by: @asfernandes

https://firebirdsql.org/refdocs/langrefupd21-ddl-collation.html

If multi-level is not important for you, just create a collation with MULTI-LEVEL=0 attribute.

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

If it was not important for me I would have not posted this issue...

firebird-automations commented 6 years ago

Commented by: @asfernandes

You definitively didn't understood the problem nor the solution.

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

I hate to tell you this but have you read our posts carefully? Our questions on your comments were not that simple?

We commented that we didn't understand your explanation why that interleaved or compound sort key was relating to this issue. You ignored those and just repeated that interleaved thing.

I have observed that if ordering keys were of an unique index, Firebird sorted it perfectly even for CI and CI_AI. In this very case as I reported as a bug, how ever, it was not related to an unique index.

May I ask you again, why and how this issue is relating to that interleaved thing? Why only uniue indices are covered in multi-level for ci and ci_ai?

firebird-automations commented 6 years ago

Commented by: @asfernandes

> My suggestion is to apply icu's collation strength properly to make sort key according to the field's collate property; > default(unicode), primary(unicode_ci_ai) and secondary(unicode_ci)

It already does:

    if \(\(attributes & \(TEXTTYPE\_ATTR\_CASE\_INSENSITIVE \| TEXTTYPE\_ATTR\_ACCENT\_INSENSITIVE\)\) ==
        \(TEXTTYPE\_ATTR\_CASE\_INSENSITIVE \| TEXTTYPE\_ATTR\_ACCENT\_INSENSITIVE\)\)
    \{
        icu\-\>ucolSetAttribute\(compareCollator, UCOL\_STRENGTH, UCOL\_PRIMARY, &status\);
    \}
    else if \(attributes & TEXTTYPE\_ATTR\_CASE\_INSENSITIVE\)
        icu\-\>ucolSetAttribute\(compareCollator, UCOL\_STRENGTH, UCOL\_SECONDARY, &status\);

But it does for the compare collator only, not to the sort collator.

And why it doesn't?

Because the time I asked for MULTI-LEVEL=0 a long time ago the team didn't supported.

Even not supporting it, I added to the narrow collations. See the same test with WIN_PTBR:

RECREATE TABLE B (F1 INTEGER NOT NULL PRIMARY KEY, F2 VARCHAR(10) CHARACTER SET WIN1252 NOT NULL); INSERT INTO B (F1,F2) VALUES (1,'a'); INSERT INTO B (F1,F2) VALUES (2,'A'); INSERT INTO B (F1,F2) VALUES (3,'a');

SELECT F2,F1 FROM B ORDER BY F2 COLLATE WIN_PTBR, F1;

F2 F1 ========== ============ a 1 A 2 a 3

F1 orders correctly. But F2 orders inconsistently (at the case level).

To fix both problems at the same time, the keys should be generated differently (see about <level 1 for a>...<level n for a><level 1 for b>...<level n for b> vs <level 1 for a><level 1 for b>...<level n for a><level n for b>.

This is also a good reference: http://firebird.1100200.n4.nabble.com/UNICODE-case-accent-insensitive-bugs-td1125686.html

I personally support the idea of allow MULTI-LEVEL=0 attribute for UNICODE collations. Every one could use it if wants on customized collations.

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

Adriano,

Thank you for the explanation.

>It already does: I know and have already corrected my previous comment.

>But it does for the compare collator only, not to the sort collator. It also DOES for the sort collater IF the ordering fields were of UNIQUE index. That's why I asked you again.

>WIN1252 >COLLATE WIN_PTBR Sorry. I'm not sure about these.

>To fix both problems at the same time, the keys should be generated differently Ok, let me explain:

(My concern is only about UTF8 + ICU collations in multiple fields ordering)

The results that I reported shows there are two possibilites of the bug: 1) What to save as sort key was wrong. Or, 2) What to compare was wrong.

What was saved as for "unicode_ci"? The reported case and all of my test cases show that those are of "unicode" *(A).

*(A) can be used for all of those icu collations. (See my comment on 16/Oct/18 04:07 PM) In sorting routine, if collate was "unicode_ci", then two buckets out of three must be compared. But three buckets were compared falsely. The part to be fixed might be only this.

But hey, if it is of unique index the sort result is ok. For unique index, she compares two buckets properly or saves two buckets only.

No matter if it was good or bad, it is clear that she has all of the buckets needed for sorting in any case. Therefore I can't agree with your explanation.

firebird-automations commented 6 years ago

Commented by: @asfernandes

Unique keys (for UNICODE_CI) cannot have 'a' and 'A' at the same time, so this key's level is eliminated as it will not matter for sort.

With an unique UNICODE_CI, you probably will have the same problem when you go to the accent level:

SQL> INSERT INTO A (F1,F2) VALUES (1,'a'); SQL> INSERT INTO A (F1,F2) VALUES (2,'Á'); SQL> INSERT INTO A (F1,F2) VALUES (3,'a');

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

>...cannot have 'a' and 'A' at the same time

There is no problem, right? ('A', 1) and ('a', 2) collate unicode_ci (in non unique case) 'a', 2 'A', 1 ('A', 1) and ('a', 2) collate unicode_ci (in unique case) 'A', 1 'a', 2

I have already tested those three of icu collations in multi columns. The results were ok only for unique index

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

>...cannot have 'a' and 'A' at the same time

that's a story where the index consists of a single column we're talking abount multi column ordering

firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

In the source code cited below, what about to add one more condition other than INTL_KEY_SORT, something like : if (attributes & TEXTTYPE_ATTR_CASE_INSENSITIVE)... to determine whether compareCollator or sortCollator?

https://github.com/Alexpux/firebird-git-svn/blob/master/src/common/unicode_util.cpp#L1460

    case INTL\_KEY\_UNIQUE:
        coll = compareCollator;
        srcLenLong \*= sizeof\(\*src\);
        normalize\(&srcLenLong, &src, true, buffer\);
        srcLenLong /= sizeof\(\*src\);
        break;

    case INTL\_KEY\_SORT:
        coll = sortCollator;
        break;
firebird-automations commented 6 years ago

Commented by: Hiro Nonomura (hiro)

Though I think the following should be applied also for sortCollator ...:

https://github.com/Alexpux/firebird-git-svn/blob/master/src/common/unicode_util.cpp#L1334

    if \(\(attributes & \(TEXTTYPE\_ATTR\_CASE\_INSENSITIVE \| TEXTTYPE\_ATTR\_ACCENT\_INSENSITIVE\)\) ==
        \(TEXTTYPE\_ATTR\_CASE\_INSENSITIVE \| TEXTTYPE\_ATTR\_ACCENT\_INSENSITIVE\)\)
    \{
        icu\-\>ucolSetAttribute\(compareCollator, UCOL\_STRENGTH, UCOL\_PRIMARY, &status\);
    \}
    else if \(attributes & TEXTTYPE\_ATTR\_CASE\_INSENSITIVE\)
        icu\-\>ucolSetAttribute\(compareCollator, UCOL\_STRENGTH, UCOL\_SECONDARY, &status\);
firebird-automations commented 6 years ago

Commented by: Luis Forra (luisforra)

Hiro suggestion was tested and doesn't solve the problem.