NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
38 stars 28 forks source link

[BUG] nds_transcode.py is not handling international characters correctly #170

Closed jbrennan333 closed 1 year ago

jbrennan333 commented 1 year ago

Describe the bug The raw data files from TPC-DS are in ISO-8859 format. In this format, the Ô character is encoded as 0xd4. In nds_transcode.py, we read these raw CSV files with the default encoding of UTF-8, so we don't handle the international characters correctly.

Comment from TPC-DS spec:

The data generated by dsdgen includes some international characters. Examples of international
characters are Ô and É. The database must preserve these characters during loading and processing by using a
character encoding such as ISO/IEC 8859-1 that includes these characters

If we do the transcoding with the GPU, the 0xd4 character is passed through to the resulting output file, and it is an invalid UTF8 character. I ran into this when comparing CPU transcoded data (specifically the customer table) to GPU transcoded data. In the case of CPU, it translates the invalid character to 0xefbfbd, so if you try to compare the resulting output files, all rows with these international characters are found to differ. But both the CPU and GPU generated files are incorrect in that these international characters have been replaced with the wrong encoding.

If you modify nds_transcode.py by adding .option("encoding", "ISO-8859-1") to the csv read, then we correctly transcode it to 0xc394 when we write it in UTF-8 format.

Steps/Code to reproduce bug Use the nds_transcode.py script to transcode the customer file to parquet format with no compression. Use a binary viewer like xxd to examine the output file and verify that the character is correct. It appears in the string CÔTE D'IVOIRE, so I usually search for VOIR and then look at the encoding for the Ô character. For example:

04199e50: 0043 c394 5445 2044 2749 564f 4952 450d  .C..TE D'IVOIRE.

Expected behavior International characters should be transcoded correctly from ISO-8859 to the output encoding.

wjxiz1992 commented 1 year ago

Thanks for narrowing down to the root, I'll make a fix for it!