aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 701 forks source link

From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

Open lampretl opened 1 year ago

lampretl commented 1 year ago

Describe the bug

In our private redshift database on AWS, I have a table with a column probability of type numeric(38, 20), its values are between 0 and 1. In AWS SageMaker jupyter notebook, I query/download the content of that table. The values obtained via read_sql_query are also negative, but the ones obtained via unload are all positive.

How to Reproduce

When executing the following code

query = "SELECT user_id, probability p FROM models.churn_short_term_predictions"
df1 = wr.redshift.read_sql_query(query, con=con)
print(df1.p.min())
df2 = wr.redshift.unload(query, con=con, path=staging_area()) 
print(df2.p.min())

the output is unexpectedly:

-0.09223216322371685616
0.00028089172873606330

Expected behavior

Outputs df1.p.min() and df2.p.min() should be equal. And certainly, both must be between 0 and 1.

Your project

No response

Screenshots

photo_2023-04-20_18-37-54

Screenshot at 2023-04-20 18-41-48

OS

Linux (AWS)

Python version

3.8.16

AWS SDK for pandas version

3.0.0

Additional context

No response

kukushking commented 1 year ago

Thanks @lampretl looking into it

kukushking commented 1 year ago

HI @lampretl I've added a test case to simulate overflow which is what most likely happens here but wasn't able to reproduce so far. Is this consistently reproducing for you? What are the actual MIN and MAX values in Redshift?

lampretl commented 1 year ago

If I run that jupyter cell several times, over different days, I always get the slightly negative result with read_sql_query.

According to DBeaver, the min and max values in my table are 0.00028089172873606330 and 0.98937731753215710000. The table has 2360571 rows.

kukushking commented 1 year ago

Could you share the output of pip freeze, please?

lampretl commented 1 year ago

Of course, if I run ! pip freeze in a jupyter notebook in SageMaker Studio on AWS, I get:

alembic==1.10.3
appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
awscli==1.27.70
awsio @ https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
awswrangler==3.0.0
backcall==0.2.0
bcrypt==4.0.1
beautifulsoup4==4.12.2
bokeh==2.4.3
boto3==1.26.70
botocore==1.29.70
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764652625/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
certifi==2022.12.7
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179356964/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.1.3
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1674202310934/work
cmaes==0.9.1
colorama==0.4.4
colorlog==6.7.0
conda==22.11.1
conda-content-trust @ file:///home/conda/feedstock_root/build_artifacts/conda-content-trust_1621370699668/work
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1669907009957/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1669733752472/work
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1675828607636/work
cycler==0.11.0
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1673054071802/work
decorator==5.1.1
dgl==0.9.1.post1
dill==0.3.6
docutils==0.16
executing==1.2.0
fonttools==4.38.0
fsspec==2023.1.0
gevent==22.10.2
google-pasta==0.2.0
greenlet==2.0.2
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1675704810568/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
imageio==2.25.1
importlib-metadata==4.13.0
importlib-resources==5.10.2
inotify-simple==1.2.1
ipykernel==5.5.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1663332044897/work
jsonschema==4.17.3
jupyter-client==6.1.5
jupyter-core==4.9.2
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/libmambapy
lightgbm==3.3.5
llvmlite==0.36.0
lxml==4.9.2
Mako==1.2.4
mamba @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/mamba
MarkupSafe==2.1.2
matplotlib==3.5.3
matplotlib-inline==0.1.6
multiprocess==0.70.14
natsort==8.3.1
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1673151334029/work
numba==0.53.1
numpy==1.21.6
opencv-python==4.7.0.68
optuna==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1673482170163/work
pandas==1.3.5
paramiko==3.0.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathos==0.3.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
plotly==5.9.0
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1667232663820/work
pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1643032624649/work
portalocker==2.7.0
pox==0.3.2
ppft==1.7.6.6
prompt-toolkit==3.0.36
protobuf==3.20.2
protobuf3-to-dict==0.1.5
psutil==5.9.0
psycopg2-binary==2.9.6
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==11.0.0
pyasn1==0.4.8
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666834293299/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pyfunctional==1.4.3
Pygments==2.14.0
pyinstrument==3.4.2
pyinstrument-cext==0.2.4
PyNaCl==1.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1672659226110/work
pyparsing==3.0.9
pyrsistent==0.19.3
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1673864280276/work
PyYAML==5.4.1
pyzmq==19.0.0
redshift-connector==2.0.910
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1673863902341/work
retrying==1.3.4
rsa==4.7.2
ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1666827402316/work
ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1670412724006/work
s3fs==0.4.2
s3transfer==0.6.0
sagemaker==2.146.0
sagemaker-experiments==0.1.42
sagemaker-pytorch-training==2.7.0
sagemaker-training==4.4.5
schema==0.7.5
scikit-learn==1.0.2
scipy==1.7.3
scramp==1.4.4
seaborn==0.12.2
seedir==0.4.2
shap @ file:///home/conda/feedstock_root/build_artifacts/shap_1655716944211/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
slicer @ file:///home/conda/feedstock_root/build_artifacts/slicer_1608146800664/work
smclarify==0.3
smdebug==1.0.24b20230214
smdebug-rulesconfig==1.0.1
soupsieve==2.4.1
SQLAlchemy==1.4.47
sqlalchemy-redshift==0.8.14
stack-data==0.6.2
tabulate==0.9.0
tenacity==8.2.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1643647933166/work
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch @ https://aws-pytorch-unified-cicd-binaries.s3.us-west-2.amazonaws.com/r1.12.1_sm/20230106-033032/626f1a6ec58817e524705e0917dbab97c81b440e/torch-1.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cpu/torchaudio-0.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchdata @ https://download.pytorch.org/whl/test/torchdata-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.13.1%2Bcpu-cp38-cp38-linux_x86_64.whl
tornado==6.2
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
traitlets==5.9.0
typing_extensions==4.4.0
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1673452138552/work
wcwidth==0.2.6
Werkzeug==2.2.2
widgetsnbextension==4.0.7
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1675982654259/work
zope.event==4.6
zope.interface==5.5.2
zstandard==0.19.0
lampretl commented 1 year ago

@kukushking Have you been able to reproduce the bug? If not, shall I post here a dataframe containing the table that causes the issue?

kukushking commented 1 year ago

@lampretl Unfortunately no I wasn't able to reproduce this. Yes please, if you could share here a test case that would be much appreciated!

lampretl commented 1 year ago

@kukushking I was not able to reproduce the wrong output by uploading a dataframe, but I have found the source of the problem: internal casting of redshift's numeric(38,20) to pandas's float64. Have a look at my screenshot:

Screenshot at 2023-05-31 16-54-49 When I run the usual query, read_sql_query fetches the data incorrectly, but when I use explicit casting with ::numeric(18,17), the obtained result seems correct. Also, you can see how much the results differ, e.g. in the first row, we have -2% instead of 70%.

So my questions are:

kukushking commented 1 year ago

Hi @lampretl yes looks like that's the issue. I still wasn't able to reproduce with the same numpy, pandas, and arrow versions - all returned values are decimals wrapped in pandas object, but there is something I think you can do:

import pyarrow as pa

df1 = wr.redshift.read_sql_query(query, con=redshift_con, dtype={"c0": "int64", "p": pa.decimal128(38, 37)})

You can force read_sql_query to return a decimal with expected precision/scale. Please give it a try. If that matches your db schema, you should have no issues.