aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 701 forks source link

to_parquet with dataset=False does not respect sanitize_columns=False #1045

Closed ConstantinoSchillebeeckx closed 2 years ago

ConstantinoSchillebeeckx commented 2 years ago

Describe the bug

I'm writing a dataframe to parquet with dataset=False, the DF has spaces in the column names. Even with sanitize_columns=False, the column names seem to be getting sanitized (space replaced with underscore)

Environment

``` Package Version --------------------------------- --------- agate 1.6.1 alabaster 0.7.12 asn1crypto 1.4.0 attrs 21.2.0 Authlib 0.15.5 autopep8 1.5.7 awswrangler 2.12.1 azure-common 1.1.27 azure-core 1.20.1 azure-storage-blob 12.9.0 Babel 2.9.1 backports.entry-points-selectable 1.1.1 bcrypt 3.2.0 beautifulsoup4 4.10.0 boto3 1.20.3 botocore 1.23.3 bump2version 1.0.1 CacheControl 0.12.10 cachetools 4.2.4 cachy 0.3.0 Cerberus 1.3.4 certifi 2021.10.8 cffi 1.15.0 cfgv 3.3.1 chardet 4.0.0 charset-normalizer 2.0.7 cleo 0.8.1 click 7.1.2 clikit 0.6.2 cloudpickle 2.0.0 colorama 0.4.4 colorlog 6.6.0 coverage 6.1.2 crashtest 0.3.1 croniter 1.0.15 cryptography 3.4.7 dask 2021.11.1 dbt 0.20.2 dbt-bigquery 0.20.2 dbt-core 0.20.2 dbt-extractor 0.4.0 dbt-postgres 0.20.2 dbt-redshift 0.20.2 dbt-snowflake 0.20.2 decorator 5.1.0 Deprecated 1.2.13 distlib 0.3.3 distributed 2021.11.1 doc8 0.9.1 docker 5.0.3 docutils 0.16 dwh 0.22.0 epc 0.0.5 et-xmlfile 1.1.0 filelock 3.3.2 flake8 3.9.2 freezegun 1.1.0 fsspec 2021.11.0 ftfy 6.0.3 google-api-core 1.31.4 google-auth 1.35.0 google-cloud-bigquery 2.30.1 google-cloud-core 1.7.2 google-crc32c 1.3.0 google-resumable-media 2.1.0 googleapis-common-protos 1.53.0 graphviz 0.17 grpcio 1.41.1 HeapDict 1.0.1 hologram 0.0.14 html5lib 1.1 identify 2.3.5 idna 2.10 imagesize 1.3.0 importlib-metadata 4.8.2 importmagic 0.1.7 iniconfig 1.1.1 isodate 0.6.0 isort 5.10.1 Jinja2 2.11.3 jmespath 0.10.0 json-rpc 1.13.0 jsonpath-ng 1.5.3 jsonschema 3.1.1 keyring 21.8.0 leather 0.3.4 locket 0.2.1 lockfile 0.12.2 Logbook 1.5.3 logging-tree 1.9 lxml 4.6.4 mailchecker 4.0.16 MarkupSafe 2.0.1 marshmallow 3.14.0 marshmallow-oneofschema 3.0.1 mashumaro 2.5 mccabe 0.6.1 minimal-snowplow-tracker 0.0.2 more-itertools 8.11.0 moto 2.2.13 msgpack 1.0.2 msrest 0.6.21 mypy-extensions 0.4.3 networkx 2.6.3 nodeenv 1.6.0 numpy 1.21.4 oauthlib 3.1.1 openpyxl 3.0.9 opensearch-py 1.0.0 oscrypto 1.2.1 packaging 20.9 pandas 1.3.4 paramiko 2.8.0 parsedatetime 2.6 partd 1.2.0 pastel 0.2.1 pbr 5.7.0 pendulum 2.1.2 pexpect 4.8.0 pg8000 1.21.3 phonenumbers 8.12.36 pip 21.3.1 pkginfo 1.7.1 platformdirs 2.4.0 pluggy 1.0.0 ply 3.11 poetry 1.1.11 poetry-core 1.0.7 pre-commit 2.13.0 prefect 0.15.5 progressbar2 3.55.0 proto-plus 1.19.8 protobuf 3.19.1 psutil 5.8.0 psycopg2-binary 2.9.1 ptvsd 4.3.2 ptyprocess 0.7.0 py 1.11.0 pyarrow 5.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycodestyle 2.7.0 pycparser 2.21 pycryptodomex 3.11.0 pydantic 1.8.2 pyflakes 2.3.1 PyGithub 1.55 Pygments 2.10.0 PyJWT 2.3.0 pylev 1.4.0 PyMySQL 1.0.2 PyNaCl 1.4.0 pyOpenSSL 20.0.1 pyparsing 3.0.5 pyrsistent 0.18.0 pysftp 0.2.9 pytest 6.2.5 pytest-cov 2.12.1 pytest-freezegun 0.4.2 pytest-sftpserver 1.3.0 python-benedict 0.24.3 python-box 5.4.1 python-dateutil 2.8.2 python-decouple 3.4 python-dotenv 0.19.2 python-fsutil 0.5.0 python-slugify 5.0.2 python-utils 2.5.6 pytimeparse 1.1.8 pytz 2021.3 pytzdata 2020.1 PyYAML 6.0 redshift-connector 2.0.889 requests 2.25.1 requests-aws4auth 1.1.1 requests-mock 1.9.3 requests-oauthlib 1.3.0 requests-toolbelt 0.9.1 responses 0.15.0 restructuredtext-lint 1.3.2 rsa 4.7.2 s3path 0.3.2 s3transfer 0.5.0 scramp 1.4.1 setuptools 57.4.0 sexpdata 0.0.3 shellingham 1.4.0 simple-salesforce 1.11.4 six 1.16.0 smart-open 5.2.1 snowballstemmer 2.1.0 snowflake-connector-python 2.4.6 sortedcontainers 2.4.0 soupsieve 2.3.1 Sphinx 4.3.0 sphinx-rtd-theme 0.5.2 sphinxcontrib-applehelp 1.0.2 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.0 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 sqlparse 0.3.1 stevedore 3.5.0 tabulate 0.8.9 tblib 1.7.0 text-unidecode 1.3 toml 0.10.2 tomlkit 0.7.2 toolz 0.11.2 tornado 6.1 typer 0.4.0 typing-extensions 3.10.0.2 urllib3 1.26.7 virtualenv 20.10.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.2.1 Werkzeug 2.0.2 wrapt 1.13.3 xmltodict 0.12.0 zict 2.0.0 zipp 3.6.0 ```

To Reproduce

import pandas as pd
from awswrangler import s3

df = pd.DataFrame(
    [["a", "b"], ["c", "d"]],
    columns=["col 1", "col 2"],
)

out = "s3://big-bad-bucket/foo.parquet"

s3.to_parquet(df, out, sanitize_columns=False, dataset=False)

# this returns Index(['col_1', 'col_2'], dtype='object') not Index(['col 1', 'col 2'], dtype='object')
s3.read_parquet(out).columns 
jaidisido commented 2 years ago

This is the culprit. From PyArrow's docs:

Spark places some constraints on the types of Parquet files it will read. The option flavor='spark' will set these options automatically and also sanitize field characters unsupported by Spark SQL.

In short, when flavor is set to spark, arrow sanitizes the columns to ensure they comply with Spark SQL.

One option could be to keep spark as the default flavor but allow the user to override it through the pyarrow_additional_kwargs argument to to_parquet. Will raise a PR and evaluate the impact