databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Write custom metadata to output files with dataframe.to_parquet? #2212

Open thehomebrewnerd opened 2 years ago

thehomebrewnerd commented 2 years ago

Is it possible to save custom metadata in the file when writing to a parquet file?

For example, with Dask, users can add custom metadata to the output files with this:

custom_metadata = {"custom_metadata": "my custom metadata"}
dataframe.to_parquet(path, custom_metadata=custom_metadata)

This code will add the custom metadata to the metadata of the saved parquet files, and the metadata then be read back in with pyarrow.parquet.read_metadata.

Is it possible to do something similar with Koalas? So far, I have not been able to find a way. I also attempted to manually update the metadata in the files after writing the parquet files with ks.DataFrame.to_parquet, but that is causing a checksum mismatch when trying to read the files back in to a dataframe with koalas.read_parquet.

HyukjinKwon commented 2 years ago

Can we file a JIRA in Apache Spark JIRA (https://issues.apache.org/jira/projects/SPARK)? This repository is in maintenance mode