aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.86k stars 683 forks source link

UnicodeDecodeError in wr.athena.read_sql_query() #1131

Closed diegopmayer closed 2 years ago

diegopmayer commented 2 years ago

Describe the bug This error had not occurred in Linux, it's all ok, works perfectly. but has occurred when I ran the code on windows (me and other devs in the team), and just give the error when I put the parameter ctas_approach=False, if it is =True, there no error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 606: character maps to

Environment pip freeze asn1crypto==1.4.0 awswrangler==2.13.0 beautifulsoup4==4.10.0 boto3==1.20.26 botocore==1.23.26 certifi==2021.10.8 charset-normalizer==2.0.9 cramjam==2.5.0 decorator==5.1.0 et-xmlfile==1.1.0 fastparquet==0.7.1 fsspec==2021.11.0 idna==3.3 jmespath==0.10.0 jsonpath-ng==1.5.3 lxml==4.7.1 numpy==1.21.4 openpyxl==3.0.9 opensearch-py==1.0.0 packaging==21.3 pandas==1.3.4 pg8000==1.22.1 ply==3.11 progressbar2==3.55.0 pyarrow==6.0.0 PyMySQL==1.0.2 pyodbc==4.0.32 pyparsing==3.0.6 python-dateutil==2.8.2 python-utils==2.6.3 pytz==2021.3 redshift-connector==2.0.902 requests==2.26.0 requests-aws4auth==1.1.1 s3fs==0.4.2 s3transfer==0.5.0 scramp==1.4.1 six==1.16.0 soupsieve==2.3.1 thrift==0.15.0 urllib3==1.26.7

To Reproduce Code import pandas as pd
import awswrangler as wr

df = wr.athena.read_sql_query( sql="SELECT * FROM benefits WHERE CAST(landing_date as date) = CAST(current_date as date)", database="bronze", ctas_approach=False)

jaidisido commented 2 years ago

Hi @diegopmayer - we don't have access to a Windows machine to replicate this sadly.

To help us narrow down the issue, could you please provide the full error trace so we can at least know at which point it breaks?

Ideally, we would need some sample anonymised data too

diegopmayer commented 2 years ago

Hello @jaidisido, first of all, thanks for your return, and following the code with an error, thanks.

import pandas as pd
import awswrangler as wr
df = wr.athena.read_sql_query( sql="SELECT * FROM benefits WHERE CAST(landing_date as date) = CAST(current_date as date)", database="bronze", ctas_approach=False)

Traceback (most recent call last):
File "", line 1, in
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler_config.py", line 450, in wrapper
return function(args)
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\athena_read.py", line 863, in read_sql_query
return _resolve_query_without_cache(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\athena_read.py", line 581, in _resolve_query_without_cache
return _resolve_query_without_cache_regular(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\athena_read.py", line 512, in _resolve_query_without_cache_regular
return _fetch_csv_result(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\athena_read.py", line 296, in _fetch_csv_result
ret = s3.read_csv(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\s3_read_text.py", line 294, in read_csv
return _read_text(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\s3_read_text.py", line 149, in _read_text
ret = _read_text_file(
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\awswrangler\s3_read_text.py", line 91, in _read_text_file
df: pd.DataFrame = parser_func(f,
pandas_kwargs) File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\util_decorators.py", line 311, in wrapper
return func(*args, kwargs)
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer,
kwds)
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 811, in init
self._engine = self._make_engine(self.engine)
File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 1040, in _make_engine
return mapping[engine](self.f, self.options) # type: ignore[call-arg] File "C:\data_lake\elt_vilesoft.venv\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 69, in init
self._reader = parsers.TextReader(self.handles.handle,
kwds) File "pandas_libs\parsers.pyx", line 542, in pandas._libs.parsers.TextReader.cinit
header, table_width, unnamed_cols = self._get_header(prelim_header) File "pandas_libs\parsers.pyx", line 642, in pandas._libs.parsers.TextReader._get_header
self._tokenize_rows(hr + 2)
File "pandas_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
raise_parser_error('Error tokenizing data', self.parser)
File "pandas_libs\parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
raise old_exc UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 606: character maps to

jaidisido commented 2 years ago

So the error occurs here after the results of the query are saved as a csv file and Wrangler attempts to read it. The default encoding seems to be working fine on Linux but causing issues on Windows.

As you mention ctas_approach=True works because it's reading a parquet file, not csv. In the next version of Wrangler (2.14.0 - soon to be released), the unload_approach will enable reading a file as a parquet file instead of csv with ctas set to False (see unload_approach=True and ctas_approach=False in the latest docs link)

diegopmayer commented 2 years ago

Thanks @jaidisido