apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6k stars 1.14k forks source link

`COLLATION` Support #9192

Open alamb opened 7 months ago

alamb commented 7 months ago

Is your feature request related to a problem or challenge?

"Collation" generically means how to compare and sort string values.

Soem databases, most notably Postgres, allow you to change the default collation order to control this more carefully to match whatever the user wants rather than what the standard sort order means

Someone asked about this on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1205554368292855868

Here are some details on how this works in Postgres: https://www.postgresql.org/docs/current/collation.html

Describe the solution you'd like

Someone to design and implement COLLATION

This probably looks like a SessionConfig setting to control collation at the session level and possibly some way to define it as part of the table definition

Describe alternatives you've considered

No response

Additional context

This would likely require adding collation support to arrow-rs as well, though I am not 100% sure

alamb commented 7 months ago

cc @gruuya as I believe you mentioned you might also be interested in this feature

tustvold commented 7 months ago

I think a first step would be to identify a mature Rust library for supporting collations, as I suspect this is not something we wish to implement ourselves, much like we use chrono for temporal functionality.

I also wonder if there might be a middle ground where we provide specialised UDFs or similar for manipulating collations, as full native support would be a very substantial undertaking. This would also provide a good story for making this functionality optional

alamb commented 7 months ago

I believe collation is an important feature for postgres compatibility, but it is not as widely used in other databases.

I agree having something that was optional would be ideal

gruuya commented 7 months ago

Yeah, for reference DuckDB also provides optional general collations, using an extension for the ICU project: https://duckdb.org/docs/sql/expressions/collations.html#icu-collations

There's a corresponding Rust crate as well: https://docs.rs/icu/latest/icu/collator/index.html