NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
37 stars 27 forks source link

Use ISO-8859 codec to load CSV files #171

Closed wjxiz1992 closed 1 year ago

wjxiz1992 commented 1 year ago

To close #170 .

before this change:

// df for "customer" table
scala> df.select("c_birth_country").filter(df("c_birth_country").contains("IVOIRE"))show()
+---------------+
|c_birth_country|
+---------------+
|  C�TE D'IVOIRE|
...
+---------------+

after:

// df for "customer" table
scala> df.select("c_birth_country").filter(df("c_birth_country").contains("IVOIRE"))show()
+---------------+
|c_birth_country|
+---------------+
|  CÔTE D'IVOIRE|
...
+---------------+

One concern for this change: we only support UTF-8 encoded data loading for CSV(https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCSVScan.scala#L121). We may see performance drop when converting with the spark-rapids plugin. A better solution is to check special characters for all tables, we only specify ISO-8859 for tables that contain those international characters. I will check how many tables have such issue. If not many, I'll apply this specific process.