Open mbasmanova opened 4 months ago
@rui-mo Unicode version used in Velox affects the results of lower / upper and other functions that use utf8proc library and handle Unicode characters. We are seeing some discrepancies with Presto Java which supports Unicode 11.0 (different from Unicode 13 supported in Velox). Wondering what Unicode version is supported by different versions of Spark and whether you are also seeing discrepancies.
@mbasmanova Thanks for the notice. I tried the query to_hex(cast(lower('Შ') as varbinary))
in Spark on JDK 11, and got the same result E1B2A8
with Presto on JDK 11. From code level, looks like Spark is following Unicode 10.0.
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L71
By investigating Spark code, I found in Spark's all releases, lower/upper function depends on JDK's java.lang.String
to do the conversion. But in Spark's master branch, icu4j (ICU lib for java) has been introduced to do the conversion. The used ICU version is 75.1, which should support 15.1 Unicode standard.
@rui-mo @PHILO-HE Thank you for checking out Unicode version support in Spark. It sounds to me that Unicode version in Gluten / Velox does not align with Spark and that implies the behavior of functions like lower / upper would be different. A particularly confusing scenario is when Spark's code is used for constant folding, in which case a user may experience inconsistent behavior within a single query.
@rui-mo @PHILO-HE Thank you for checking out Unicode version support in Spark. It sounds to me that Unicode version in Gluten / Velox does not align with Spark and that implies the behavior of functions like lower / upper would be different. A particularly confusing scenario is when Spark's code is used for constant folding, in which case a user may experience inconsistent behavior within a single query.
@mbasmanova, exactly.
@PHILO-HE Do you know if Gluten uses Velox for constant folding? If not, are there any plans to do something like that? Presto is working on this: https://github.com/prestodb/rfcs/pull/13
@PHILO-HE Do you know if Gluten uses Velox for constant folding? If not, are there any plans to do something like that? Presto is working on this: prestodb/rfcs#13
@mbasmanova, currently constant folding is done by Spark. If we disable it via Spark config, Gluten will pass constant expression to Velox for evaluation at execution time, like non-constant expression. I think we don't have any plan to let Gluten use Velox for constant folding. cc @FelixYBW
Description
Velox uses a copy of utf8proc 2.5.0 library which supports Unicode 13.0: https://juliastrings.github.io/utf8proc/releases/
It would be helpful to document that Velox supports Unicode 13.0 and also figure out how to continuous upgrade utf8proc to add support for latest Unicode version.
Latest Unicode version now is 15.1: https://www.unicode.org/versions/Unicode15.1.0/
More context: https://github.com/prestodb/presto/issues/22975
CC: @rui-mo @PHILO-HE @majetideepak @amitkdutta @kagamiori