delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request] Adding collations in the kernel #3699

Open stefankandic opened 2 months ago

stefankandic commented 2 months ago

Feature request

Which Delta project/connector is this regarding?

Overview & Motivation

Collations are a set of rules that define how string comparison and sorting should be handled, particularly in terms of character sequence and case sensitivity. They are vital in managing string-based operations, ensuring that data is compared and ordered consistently across different languages, regions, and character sets.

In many databases and data processing systems, collations play a crucial role in enabling case-insensitive comparisons or language-specific sorting rules, which are essential for improving both query performance and result accuracy.

The upcoming Spark 4.0 release introduces collation support, allowing users to define and apply collation rules during string comparisons and data sorting operations. This enhancement will enable more accurate data processing in multi-language datasets and offer better control over how string data is managed. Furthermore, with collation-aware expressions and predicates.

In the context of Delta Lake, this feature will be integrated to ensure that schema handling and predicate pushdowns are collation-aware, leading to more consistent data access and better performance for case-insensitive or locale-specific queries. This feature request focuses on extending Delta’s capabilities to support these new collation rules, starting with updates to the StringType schema and adding collation-aware data skipping for enhanced query optimization.

Project plan

ID Task description PR Status Author
1 Update kernel's StringType to have a collation information TODO NOT STARTED @ilicmarkodb
2 Support schema read/write for collations as per the delta protocol TODO NOT STARTED @ilicmarkodb
3 Add collation aware expressions/predicates TODO NOT STARTED @ilicmarkodb
4 Add ability to do data skipping for predicates with collations TODO NOT STARTED @ilicmarkodb

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?