aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.91k stars 699 forks source link

possible to disable sanitize_columns? #533

Open pwmcintyre opened 3 years ago

pwmcintyre commented 3 years ago

Hi

Similar to existing issues:

I have Glue tables with - and them, and fields with ., and while I understand they're not supported, they do work!

I'm trying to use this package to help write some parquet files, but this "feature" is preventing me.

Table

Table i'm trying to add data to:

image

Dataset

Dataset i'm trying to write:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 27 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   p_version                                        1 non-null      object 
 1   asset                                            1 non-null      object 
 2   date                                             1 non-null      object 
 3   meta.format                                      1 non-null      object 
...

Sample:

df_flat.head()

image

Code

wr.s3.to_parquet(
    df=df_flat,
    path=path,
    dataset=True,
    mode="append",
    database=database,
    table=table,
    sanitize_columns=False, # ignored!
    partition_cols=partition_cols,
    schema_evolution=False, # prevent accidental Catalogue updates
)

Error

InvalidArgumentValue: Schema change detected: New column meta_format with type string. Please pass schema_evolution=True to allow new columns behaviour.

Seeking advice.

patrick-muller commented 3 years ago

Basically when you inform the database and table name the code is going to sanitized the columns

We have this note in the docs https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.to_parquet.html#awswrangler.s3.to_parquet

Note

If database and table arguments are passed, the table name and all column names will be automatically sanitized using wr.catalog.sanitize_table_name and wr.catalog.sanitize_column_name. Please, pass sanitize_columns=True to enforce this behaviour always.

Also in the below code lines we can see the behavior https://github.com/awslabs/aws-data-wrangler/blob/54a266c860a1a9a6d4ec31aaaf6f22ccf92c1b5c/awswrangler/s3/_write_parquet.py#L504

# Sanitize table to respect Athena's standards
if (sanitize_columns is True) or (database is not None and table is not None):
    df, dtype, partition_cols = _sanitize(df=df, dtype=dtype, partition_cols=partition_cols)

if the database and table is present the code will call the _sanitize function

patrick-muller commented 3 years ago

The idea regards the sanitize_columns is to always sanitize when you work with data catalog and if you are working only with S3 you can choose if want the sanitize or not.

For example

df = pd.DataFrame({"meta.format": [1, 2], "value": ["foo", "boo"]})

print(df.head())

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://case-5720283871/dataset/",
    dataset=True,
    sanitize_columns=False,
)

the above script will write the files with the columns as is

patrick-muller commented 3 years ago

@pwmcintyre, Could you please share with me why you can't use the normal format ?

igorborgest commented 3 years ago

@pwmcintyre what engines are you using integrated with the Glue Catalog? (Athena, Redshift Spectrum, Hive, PrestoDB, Spark)?

igorborgest commented 3 years ago

OK, so the best way here is to make sanitize_columns True by default allowing users to disable that at any time regardless to integration with the Glue Catalog.

This will be a significant breaking change and should be address for the next major (3.0.0).

pwmcintyre commented 3 years ago

hey @igorborgest

the setup we have is: json files as input, and Glue Jobs (PySpark) to convert to parquet on S3. Then Athena to query (PrestoDB).

We found nested structures cause read errors in Athena when the schema evolves, and had success flattening everything first. And so we use the Relationalize in our jobs.

This makes our meta: { id: string } look like meta.id: string

FWIW — i can just turn off database and sanitize ... i just think it's a powerful feature to prevent unintended schema corruption

@patrick-muller — what's a "normal format"?

igorborgest commented 3 years ago

@pwmcintyre thanks for the details. Your use case/scenario makes sense at all. We will address that on version 3.0.0.