Open lampretl opened 1 year ago
Thanks @lampretl looking into it
HI @lampretl I've added a test case to simulate overflow which is what most likely happens here but wasn't able to reproduce so far. Is this consistently reproducing for you? What are the actual MIN and MAX values in Redshift?
If I run that jupyter cell several times, over different days, I always get the slightly negative result with read_sql_query
.
According to DBeaver, the min and max values in my table are 0.00028089172873606330
and 0.98937731753215710000
. The table has 2360571
rows.
Could you share the output of pip freeze
, please?
Of course, if I run ! pip freeze
in a jupyter notebook in SageMaker Studio on AWS, I get:
alembic==1.10.3
appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
awscli==1.27.70
awsio @ https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
awswrangler==3.0.0
backcall==0.2.0
bcrypt==4.0.1
beautifulsoup4==4.12.2
bokeh==2.4.3
boto3==1.26.70
botocore==1.29.70
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764652625/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
certifi==2022.12.7
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179356964/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.1.3
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1674202310934/work
cmaes==0.9.1
colorama==0.4.4
colorlog==6.7.0
conda==22.11.1
conda-content-trust @ file:///home/conda/feedstock_root/build_artifacts/conda-content-trust_1621370699668/work
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1669907009957/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1669733752472/work
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1675828607636/work
cycler==0.11.0
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1673054071802/work
decorator==5.1.1
dgl==0.9.1.post1
dill==0.3.6
docutils==0.16
executing==1.2.0
fonttools==4.38.0
fsspec==2023.1.0
gevent==22.10.2
google-pasta==0.2.0
greenlet==2.0.2
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1675704810568/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
imageio==2.25.1
importlib-metadata==4.13.0
importlib-resources==5.10.2
inotify-simple==1.2.1
ipykernel==5.5.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1663332044897/work
jsonschema==4.17.3
jupyter-client==6.1.5
jupyter-core==4.9.2
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/libmambapy
lightgbm==3.3.5
llvmlite==0.36.0
lxml==4.9.2
Mako==1.2.4
mamba @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/mamba
MarkupSafe==2.1.2
matplotlib==3.5.3
matplotlib-inline==0.1.6
multiprocess==0.70.14
natsort==8.3.1
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1673151334029/work
numba==0.53.1
numpy==1.21.6
opencv-python==4.7.0.68
optuna==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1673482170163/work
pandas==1.3.5
paramiko==3.0.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathos==0.3.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
plotly==5.9.0
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1667232663820/work
pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1643032624649/work
portalocker==2.7.0
pox==0.3.2
ppft==1.7.6.6
prompt-toolkit==3.0.36
protobuf==3.20.2
protobuf3-to-dict==0.1.5
psutil==5.9.0
psycopg2-binary==2.9.6
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==11.0.0
pyasn1==0.4.8
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666834293299/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pyfunctional==1.4.3
Pygments==2.14.0
pyinstrument==3.4.2
pyinstrument-cext==0.2.4
PyNaCl==1.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1672659226110/work
pyparsing==3.0.9
pyrsistent==0.19.3
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1673864280276/work
PyYAML==5.4.1
pyzmq==19.0.0
redshift-connector==2.0.910
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1673863902341/work
retrying==1.3.4
rsa==4.7.2
ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1666827402316/work
ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1670412724006/work
s3fs==0.4.2
s3transfer==0.6.0
sagemaker==2.146.0
sagemaker-experiments==0.1.42
sagemaker-pytorch-training==2.7.0
sagemaker-training==4.4.5
schema==0.7.5
scikit-learn==1.0.2
scipy==1.7.3
scramp==1.4.4
seaborn==0.12.2
seedir==0.4.2
shap @ file:///home/conda/feedstock_root/build_artifacts/shap_1655716944211/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
slicer @ file:///home/conda/feedstock_root/build_artifacts/slicer_1608146800664/work
smclarify==0.3
smdebug==1.0.24b20230214
smdebug-rulesconfig==1.0.1
soupsieve==2.4.1
SQLAlchemy==1.4.47
sqlalchemy-redshift==0.8.14
stack-data==0.6.2
tabulate==0.9.0
tenacity==8.2.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1643647933166/work
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch @ https://aws-pytorch-unified-cicd-binaries.s3.us-west-2.amazonaws.com/r1.12.1_sm/20230106-033032/626f1a6ec58817e524705e0917dbab97c81b440e/torch-1.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cpu/torchaudio-0.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchdata @ https://download.pytorch.org/whl/test/torchdata-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.13.1%2Bcpu-cp38-cp38-linux_x86_64.whl
tornado==6.2
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
traitlets==5.9.0
typing_extensions==4.4.0
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1673452138552/work
wcwidth==0.2.6
Werkzeug==2.2.2
widgetsnbextension==4.0.7
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1675982654259/work
zope.event==4.6
zope.interface==5.5.2
zstandard==0.19.0
@kukushking Have you been able to reproduce the bug? If not, shall I post here a dataframe containing the table that causes the issue?
@lampretl Unfortunately no I wasn't able to reproduce this. Yes please, if you could share here a test case that would be much appreciated!
@kukushking I was not able to reproduce the wrong output by uploading a dataframe, but I have found the source of the problem: internal casting of redshift's numeric(38,20)
to pandas's float64
. Have a look at my screenshot:
When I run the usual query, read_sql_query
fetches the data incorrectly, but when I use explicit casting with ::numeric(18,17)
, the obtained result seems correct. Also, you can see how much the results differ, e.g. in the first row, we have -2% instead of 70%.
So my questions are:
read_sql_query
?unload
?Hi @lampretl yes looks like that's the issue. I still wasn't able to reproduce with the same numpy, pandas, and arrow versions - all returned values are decimals wrapped in pandas object, but there is something I think you can do:
import pyarrow as pa
df1 = wr.redshift.read_sql_query(query, con=redshift_con, dtype={"c0": "int64", "p": pa.decimal128(38, 37)})
You can force read_sql_query to return a decimal with expected precision/scale. Please give it a try. If that matches your db schema, you should have no issues.
Describe the bug
In our private redshift database on AWS, I have a table with a column
probability
of typenumeric(38, 20)
, its values are between 0 and 1. In AWS SageMaker jupyter notebook, I query/download the content of that table. The values obtained viaread_sql_query
are also negative, but the ones obtained viaunload
are all positive.How to Reproduce
When executing the following code
the output is unexpectedly:
Expected behavior
Outputs
df1.p.min()
anddf2.p.min()
should be equal. And certainly, both must be between 0 and 1.Your project
No response
Screenshots
OS
Linux (AWS)
Python version
3.8.16
AWS SDK for pandas version
3.0.0
Additional context
No response