Describe the bug
The raw data files from TPC-DS are in ISO-8859 format. In this format, the Ô character is encoded as 0xd4.
In nds_transcode.py, we read these raw CSV files with the default encoding of UTF-8, so we don't handle the international characters correctly.
Comment from TPC-DS spec:
The data generated by dsdgen includes some international characters. Examples of international
characters are Ô and É. The database must preserve these characters during loading and processing by using a
character encoding such as ISO/IEC 8859-1 that includes these characters
If we do the transcoding with the GPU, the 0xd4 character is passed through to the resulting output file, and it is an invalid UTF8 character. I ran into this when comparing CPU transcoded data (specifically the customer table) to GPU transcoded data. In the case of CPU, it translates the invalid character to 0xefbfbd, so if you try to compare the resulting output files, all rows with these international characters are found to differ. But both the CPU and GPU generated files are incorrect in that these international characters have been replaced with the wrong encoding.
If you modify nds_transcode.py by adding .option("encoding", "ISO-8859-1") to the csv read, then we correctly transcode it to 0xc394 when we write it in UTF-8 format.
Steps/Code to reproduce bug
Use the nds_transcode.py script to transcode the customer file to parquet format with no compression.
Use a binary viewer like xxd to examine the output file and verify that the character is correct.
It appears in the string CÔTE D'IVOIRE, so I usually search for VOIR and then look at the encoding for the Ô character. For example:
Describe the bug The raw data files from TPC-DS are in ISO-8859 format. In this format, the Ô character is encoded as 0xd4. In nds_transcode.py, we read these raw CSV files with the default encoding of UTF-8, so we don't handle the international characters correctly.
Comment from TPC-DS spec:
If we do the transcoding with the GPU, the 0xd4 character is passed through to the resulting output file, and it is an invalid UTF8 character. I ran into this when comparing CPU transcoded data (specifically the customer table) to GPU transcoded data. In the case of CPU, it translates the invalid character to 0xefbfbd, so if you try to compare the resulting output files, all rows with these international characters are found to differ. But both the CPU and GPU generated files are incorrect in that these international characters have been replaced with the wrong encoding.
If you modify
nds_transcode.py
by adding.option("encoding", "ISO-8859-1")
to the csv read, then we correctly transcode it to 0xc394 when we write it in UTF-8 format.Steps/Code to reproduce bug Use the
nds_transcode.py
script to transcode the customer file to parquet format with no compression. Use a binary viewer like xxd to examine the output file and verify that the character is correct. It appears in the stringCÔTE D'IVOIRE
, so I usually search for VOIR and then look at the encoding for the Ô character. For example:Expected behavior International characters should be transcoded correctly from ISO-8859 to the output encoding.