Closed ConstantinoSchillebeeckx closed 2 years ago
This is the culprit. From PyArrow's docs:
Spark places some constraints on the types of Parquet files it will read. The option flavor='spark' will set these options automatically and also sanitize field characters unsupported by Spark SQL.
In short, when flavor
is set to spark
, arrow sanitizes the columns to ensure they comply with Spark SQL.
One option could be to keep spark
as the default flavor but allow the user to override it through the pyarrow_additional_kwargs
argument to to_parquet
. Will raise a PR and evaluate the impact
Describe the bug
I'm writing a dataframe to parquet with
dataset=False
, the DF has spaces in the column names. Even withsanitize_columns=False
, the column names seem to be getting sanitized (space replaced with underscore)Environment
To Reproduce