apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 415 forks source link

[VL] Column name containing parts of Cyrillic cannot be read correctly #6843

Open zml1206 opened 1 month ago

zml1206 commented 1 month ago

Backend

VL (Velox)

Bug description

      import testImplicits._
      Seq((1, 2)).toDF("Товары", "овары").write.mode("overwrite").parquet("tmp/t1")
      spark.read.parquet("tmp/t1").show()
      spark.conf.set("spark.gluten.enabled", false)
      spark.read.parquet("tmp/t1").show()

enable gluten

+------+-----+
|Товары|овары|
+------+-----+
|  null|    2|
+------+-----+

disable gluten

+------+-----+
|Товары|овары|
+------+-----+
|     1|    2|
+------+-----+

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

zml1206 commented 4 weeks ago

@zhztheplayer @rui-mo Is this a velox problem? Can it be solved by native or fallback the scan of column names containing Cyrillic?

rui-mo commented 4 weeks ago

Hi @zml1206, are you saying a full Cyrillic name can be read correctly while a mixed name cannot?

zml1206 commented 4 weeks ago

No, it’s just that some Cyrillic letters cannot be parsed, for example "Т"

rui-mo commented 4 weeks ago

Could you please check the written file's content to determine whether the problem is on read or write?

zml1206 commented 4 weeks ago

Confirmed problem that it is read.

zml1206 commented 3 weeks ago

Roman numerals are not correct either, for example col name 国Ⅵ.

rui-mo commented 3 weeks ago

The non-ASCII characters are not well supported in Velox tokenizer. Opened https://github.com/facebookincubator/velox/issues/10796 to discuss its support. Thanks.