apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 439 forks source link

[GLUTEN-7749][VL] Trim ISOControl characters in string for casting to integral type #7806

Closed wForget closed 3 weeks ago

wForget commented 3 weeks ago

What changes were proposed in this pull request?

Fixes: #7749, to align with Spark's change introduced by https://github.com/apache/spark/pull/41535.

How was this patch tested?

added unit test

github-actions[bot] commented 3 weeks ago

https://github.com/apache/incubator-gluten/issues/7749

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI on x86

wForget commented 3 weeks ago

Run Gluten Clickhouse CI on x86

wForget commented 3 weeks ago

@wForget, thanks for your fix! Could you give me some details about where these control chars are skipped when casting to integral types in Spark? I only found white space characters are checked and skipped.

Introduced from #41535, the relevant calls are as follows:

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L785

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1617

github-actions[bot] commented 3 weeks ago

Run Gluten Clickhouse CI on x86

PHILO-HE commented 3 weeks ago

@wForget, I see. What I checked is Spark-3.3.1, which doesn't cover that change.

Let's note this is not applicable to all supported Spark versions. But I think it may be acceptable to end users.