aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 698 forks source link

Handling encoding issues in athena.read_sql_query #317

Closed DavideBossoli88 closed 4 years ago

DavideBossoli88 commented 4 years ago

Hi,

I tried to run a query (setting ctas_approach = False) in a sagemaker processing job but I got some encoding errors. Then, I tried to run the same query (with the same setting) in a sagemaker notebook instance and it worked fine. I noticed that everything worked fine in the sagemaker processing job by setting ctas_approach = True, but I have no idea why.

In general, what would be the best way to handle encodings in athena.read_sql_query? Would it be possible to pass extra pandas parameters?

Thanks!

igorborgest commented 4 years ago

Hi @DavideBossoli88, thanks for reaching out.

Could you copy and paste the error Traceback (log) here, please?

DavideBossoli88 commented 4 years ago

Hi Igor,

below the error on the sagemaker processing job.

Traceback (most recent call last): File "/opt/ml/processing/input/code/00_create_base_contratti.py", line 135, in boto3_session = session File "/usr/local/lib/python3.6/site-packages/awswrangler/athena.py", line 542, in read_sql_query session=session, File "/usr/local/lib/python3.6/site-packages/awswrangler/athena.py", line 665, in _resolve_query_without_cache boto3_session=session, File "/usr/local/lib/python3.6/site-packages/awswrangler/s3/_read.py", line 437, in read_csv pandas_kwargs, File "/usr/local/lib/python3.6/site-packages/awswrangler/s3/_read.py", line 129, in _read_text for p in paths File "/usr/local/lib/python3.6/site-packages/awswrangler/s3/_read.py", line 129, in for p in paths File "/usr/local/lib/python3.6/site-packages/awswrangler/s3/_read.py", line 198, in _read_text_full df: pd.DataFrame = parser_func(f, pandas_kwargs) File "/usr/local/lib64/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib64/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read parser = TextFileReader(fp_or_buf, kwds) File "/usr/local/lib64/python3.6/site-packages/pandas/io/parsers.py", line 880, in init self._make_engine(self.engine) File "/usr/local/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/usr/local/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1891, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.cinit File "pandas/_libs/parsers.pyx", line 720, in pandas._libs.parsers.TextReader._get_header File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4857: ordinal not in range(128)

igorborgest commented 4 years ago

Thank you @DavideBossoli88, we will try to replicate it.

igorborgest commented 4 years ago

@DavideBossoli88, I was not able to replicate your issue, did you have some news from your side?

Also, could you provide more details?

  1. What kind of query are you running? Simple SELECT queries?
  2. Are you using some encryption or workgroup argument?
  3. Is your workgroup configured with some encryption?
igorborgest commented 4 years ago

@DavideBossoli88 in the meanwhile could you test the version in our development branch?

pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev

We've refactored the Athena module (PR #325) and maybe it solves your issue.

igorborgest commented 4 years ago

Released in 1.7.0!

igorborgest commented 4 years ago

Hi @DavideBossoli88

It is totally off-topic, but we are stating a "Who uses AWS Data Wrangler?" section. So feel free to add yourself if you want 😄 .

cotrariello84 commented 1 year ago

has the bug been fixed?

malachi-constant commented 1 year ago

@cotrariello84 Yes this was released in 1.7.0

cotrariello84 commented 1 year ago

Hi @malachi-constant I still have the bug in version 2.17.0. I had to use pyathena to solve it. but I 'd like to use wrangler.

malachi-constant commented 1 year ago

@cotrariello84 Can you open a new issue with specifics on the error you're encountering as well as steps to replicate the scenario?