facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Document Unicode version supported in Velox #10370

Open mbasmanova opened 4 months ago

mbasmanova commented 4 months ago

Description

Velox uses a copy of utf8proc 2.5.0 library which supports Unicode 13.0: https://juliastrings.github.io/utf8proc/releases/

It would be helpful to document that Velox supports Unicode 13.0 and also figure out how to continuous upgrade utf8proc to add support for latest Unicode version.

Latest Unicode version now is 15.1: https://www.unicode.org/versions/Unicode15.1.0/

More context: https://github.com/prestodb/presto/issues/22975

CC: @rui-mo @PHILO-HE @majetideepak @amitkdutta @kagamiori

mbasmanova commented 4 months ago

@rui-mo Unicode version used in Velox affects the results of lower / upper and other functions that use utf8proc library and handle Unicode characters. We are seeing some discrepancies with Presto Java which supports Unicode 11.0 (different from Unicode 13 supported in Velox). Wondering what Unicode version is supported by different versions of Spark and whether you are also seeing discrepancies.

rui-mo commented 4 months ago

@mbasmanova Thanks for the notice. I tried the query to_hex(cast(lower('Შ') as varbinary)) in Spark on JDK 11, and got the same result E1B2A8 with Presto on JDK 11. From code level, looks like Spark is following Unicode 10.0. https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L71

PHILO-HE commented 4 months ago

By investigating Spark code, I found in Spark's all releases, lower/upper function depends on JDK's java.lang.String to do the conversion. But in Spark's master branch, icu4j (ICU lib for java) has been introduced to do the conversion. The used ICU version is 75.1, which should support 15.1 Unicode standard.

mbasmanova commented 4 months ago

@rui-mo @PHILO-HE Thank you for checking out Unicode version support in Spark. It sounds to me that Unicode version in Gluten / Velox does not align with Spark and that implies the behavior of functions like lower / upper would be different. A particularly confusing scenario is when Spark's code is used for constant folding, in which case a user may experience inconsistent behavior within a single query.

PHILO-HE commented 4 months ago

@rui-mo @PHILO-HE Thank you for checking out Unicode version support in Spark. It sounds to me that Unicode version in Gluten / Velox does not align with Spark and that implies the behavior of functions like lower / upper would be different. A particularly confusing scenario is when Spark's code is used for constant folding, in which case a user may experience inconsistent behavior within a single query.

@mbasmanova, exactly.

mbasmanova commented 4 months ago

@PHILO-HE Do you know if Gluten uses Velox for constant folding? If not, are there any plans to do something like that? Presto is working on this: https://github.com/prestodb/rfcs/pull/13

PHILO-HE commented 4 months ago

@PHILO-HE Do you know if Gluten uses Velox for constant folding? If not, are there any plans to do something like that? Presto is working on this: prestodb/rfcs#13

@mbasmanova, currently constant folding is done by Spark. If we disable it via Spark config, Gluten will pass constant expression to Velox for evaluation at execution time, like non-constant expression. I think we don't have any plan to let Gluten use Velox for constant folding. cc @FelixYBW