Open Mytherin opened 4 years ago
Collations also do not work yet with the IN
clause:
set default_collation='nocase.noaccent';
select 'e'='E';
┌─────────────┐
│ ('e' = 'E') │
├─────────────┤
│ true │
└─────────────┘
select 'e' in ('E','E');
┌─────────────────────┐
│ ('e' IN ('E', 'E')) │
├─────────────────────┤
│ false │
└─────────────────────┘
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stale for 30 days with no activity.
Still planned
Just want to add this here so it doesnt get lost https://github.com/duckdb/duckdb/issues/8413
@Mytherin Hello! Is there known release version to have it fixed?
There are still a few places where collations have to be properly handled where they are currently not considered:
- LIKE/REGEX queries (this is not really possible for arbitrary collations, but possible for NOCASE).
Hi @Mytherin, just for better understating this point, why LIKE/REGEX operator are not possible to work with other collations, may you share your thoughts on that?
Maybe it is possible, but it would require more in-depth knowledge of the collations which is quite an endeavour for ICU collations. It is my understanding that ICU collations have mappings where e.g. different byte sequences are considered equivalent, which would complicate the semantics for e.g. regular expressions.
At the very least you would need to decompose the regular expressions/like statements, then apply collations to the "static" sequences, but how that would work for the matching components is unclear to me.
Maybe it is possible, but it would require more in-depth knowledge of the collations which is quite an endeavour for ICU collations. It is my understanding that ICU collations have mappings where e.g. different byte sequences are considered equivalent, which would complicate the semantics for e.g. regular expressions.
At the very least you would need to decompose the regular expressions/like statements, then apply collations to the "static" sequences, but how that would work for the matching components is unclear to me.
Thank for all explanation. What do you think before the LIKE operator:
I don't think the nfc_normalize
function has to be used - the collation should take care of this already.
For the like operator, what I think might work is to parse the pattern into a sequence (e.g. literal%literal_
becomes literal
, %
, literal
, _
) and then apply the collation only to the literal parts, then reconstruct the pattern. Note that you might need to escape %
and _
that are introduced into the literal as well by the collation.
There are still a few places where collations have to be properly handled where they are currently not considered: