Open Feng-Jiang28 opened 2 months ago
Note that cudf is capable of writing uncompressed data when the compressed version would be larger which can happen in exception cases with random data.
It looks like the issue is with the test. The data being compressed is a single value in each column, and that value will not compress well. The compressed version will be larger than the uncompressed version. libcudf's write code on the GPU will not use compression when the compressed data is larger than the uncompressed data.
I recommend updating the test to use data that is expected to produce a smaller size when compressed, and see if the issue persists afterwards.
I tried to increase the row size of the table, then compress the table with "SNAPPY" option, then it shows that SNAPPY in metadata.
import csv
# Define the data
data = [{"col1": i, "p": i % 2 } for i in range(1, 1001)] # Creates 100 rows
# Specify the file name
csv_file = "data.csv"
# Write to the CSV file
with open(csv_file, mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=["col1", "p"])
writer.writeheader()
writer.writerows(data)
print(f"CSV file '{csv_file}' created successfully.")
import org.apache.spark.sql.execution.datasources.parquet
import org.apache.parquet.hadoop.{Footer, ParquetFileReader, ParquetFileWriter, ParquetOutputFormat}
import org.apache.hadoop.fs.{FileSystem, Path}
val df = spark.read.option("header", "true").csv("/home/fejiang/Downloads/data.csv")
(df.write
.mode("overwrite")
.format("parquet")
.option("path", "/home/fejiang/Downloads/compressionTmp")
.option("parquet.compression", "SNAPPY")
.partitionBy("p")
.saveAsTable("tableName")
)
val path2 = new Path("/home/fejiang/Downloads/compressionTmp/p=1")
ParquetFileReader.readAllFootersInParallel(hadoopConf, fs.getFileStatus(path2))
metadata: {org.apache.spark.version=3.3.0, org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}}]}}}, blocks: [BlockMetaData{500, 3472 [ColumnMetaData{SNAPPY [col1] optional binary col1 (STRING) [PLAIN, RLE], 4}]}]}}]
So, let's adjust the test case to use more data to verify the SNAPPY compression.(means excluding the original test case, and create a new case by ourselves.)
When rapids creates a table with 'SNAPPY' compression, it results in an UNCOMPRESSED ColumnMetaData.
You can replace SNAPPY with GZIP or ZSTD and find that you can still get an UNCOMPRESSED ColumnMeraData. Reproduce:
CPU:
GPU: