Open pwmcintyre opened 3 years ago
Basically when you inform the database and table name the code is going to sanitized the columns
We have this note in the docs https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.to_parquet.html#awswrangler.s3.to_parquet
Note
If database and table arguments are passed, the table name and all column names will be automatically sanitized using wr.catalog.sanitize_table_name and wr.catalog.sanitize_column_name. Please, pass sanitize_columns=True to enforce this behaviour always.
Also in the below code lines we can see the behavior https://github.com/awslabs/aws-data-wrangler/blob/54a266c860a1a9a6d4ec31aaaf6f22ccf92c1b5c/awswrangler/s3/_write_parquet.py#L504
# Sanitize table to respect Athena's standards
if (sanitize_columns is True) or (database is not None and table is not None):
df, dtype, partition_cols = _sanitize(df=df, dtype=dtype, partition_cols=partition_cols)
if the database and table is present the code will call the _sanitize function
The idea regards the sanitize_columns is to always sanitize when you work with data catalog and if you are working only with S3 you can choose if want the sanitize or not.
For example
df = pd.DataFrame({"meta.format": [1, 2], "value": ["foo", "boo"]})
print(df.head())
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://case-5720283871/dataset/",
dataset=True,
sanitize_columns=False,
)
the above script will write the files with the columns as is
@pwmcintyre, Could you please share with me why you can't use the normal format ?
@pwmcintyre what engines are you using integrated with the Glue Catalog? (Athena, Redshift Spectrum, Hive, PrestoDB, Spark)?
OK, so the best way here is to make sanitize_columns
True by default allowing users to disable that at any time regardless to integration with the Glue Catalog.
This will be a significant breaking change and should be address for the next major (3.0.0).
hey @igorborgest
the setup we have is: json files as input, and Glue Jobs (PySpark) to convert to parquet on S3. Then Athena to query (PrestoDB).
We found nested structures cause read errors in Athena when the schema evolves, and had success flattening everything first. And so we use the Relationalize in our jobs.
This makes our meta: { id: string }
look like meta.id: string
FWIW — i can just turn off database
and sanitize
... i just think it's a powerful feature to prevent unintended schema corruption
@patrick-muller — what's a "normal format"?
@pwmcintyre thanks for the details. Your use case/scenario makes sense at all. We will address that on version 3.0.0
.
Hi
Similar to existing issues:
I have Glue tables with
-
and them, and fields with.
, and while I understand they're not supported, they do work!I'm trying to use this package to help write some parquet files, but this "feature" is preventing me.
Table
Table i'm trying to add data to:
Dataset
Dataset i'm trying to write:
Sample:
Code
Error
Seeking advice.