duckdb / duckdb

DuckDB is an in-process SQL OLAP Database Management System
http://www.duckdb.org
MIT License
17.2k stars 1.49k forks source link

Extended Collation Support #604

Open Mytherin opened 4 years ago

Mytherin commented 4 years ago

There are still a few places where collations have to be properly handled where they are currently not considered:

Mytherin commented 1 year ago

Collations also do not work yet with the IN clause:

set default_collation='nocase.noaccent';
select 'e'='E';
┌─────────────┐
│ ('e' = 'E') │
├─────────────┤
│ true        │
└─────────────┘
select 'e' in ('E','E');
┌─────────────────────┐
│ ('e' IN ('E', 'E')) │
├─────────────────────┤
│ false               │
└─────────────────────┘
github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been stale for 30 days with no activity.

Mytherin commented 8 months ago

Still planned

dioptre commented 7 months ago

Just want to add this here so it doesnt get lost https://github.com/duckdb/duckdb/issues/8413

AlexKAparavi commented 5 months ago

@Mytherin Hello! Is there known release version to have it fixed?

tanner-lai commented 3 months ago

9724 supports in ops

tkhuawei commented 2 weeks ago

There are still a few places where collations have to be properly handled where they are currently not considered:

  • LIKE/REGEX queries (this is not really possible for arbitrary collations, but possible for NOCASE).

Hi @Mytherin, just for better understating this point, why LIKE/REGEX operator are not possible to work with other collations, may you share your thoughts on that?

Mytherin commented 2 weeks ago

Maybe it is possible, but it would require more in-depth knowledge of the collations which is quite an endeavour for ICU collations. It is my understanding that ICU collations have mappings where e.g. different byte sequences are considered equivalent, which would complicate the semantics for e.g. regular expressions.

At the very least you would need to decompose the regular expressions/like statements, then apply collations to the "static" sequences, but how that would work for the matching components is unclear to me.

tkhuawei commented 2 weeks ago

Maybe it is possible, but it would require more in-depth knowledge of the collations which is quite an endeavour for ICU collations. It is my understanding that ICU collations have mappings where e.g. different byte sequences are considered equivalent, which would complicate the semantics for e.g. regular expressions.

At the very least you would need to decompose the regular expressions/like statements, then apply collations to the "static" sequences, but how that would work for the matching components is unclear to me.

Thank for all explanation. What do you think before the LIKE operator:

  1. Applying the nfc_normalize on both side: input string and pattern string;
  2. Applying the specified (arbitrary) collation;
  3. Applying the LIKE operator;
  4. Projecting the original matched string.
Mytherin commented 2 weeks ago

I don't think the nfc_normalize function has to be used - the collation should take care of this already.

For the like operator, what I think might work is to parse the pattern into a sequence (e.g. literal%literal_ becomes literal, %, literal, _) and then apply the collation only to the literal parts, then reconstruct the pattern. Note that you might need to escape % and _ that are introduced into the literal as well by the collation.