8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.12k stars 168 forks source link

Bump Scikit version (Issue #53) #54

Closed nielsuit227 closed 3 years ago

nielsuit227 commented 3 years ago

Ran the tests and all passed. One future warning from scikit for version 1.1:

FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
FlorianWetschoreck commented 3 years ago

Thank you for trying this out!

It would be great if we can require sklearn <2 instead of having to rely on <1.1

Do you understand the FutureWarning and where this is triggered within ppscore? And how we might fix this already now?

nielsuit227 commented 3 years ago

Don't know where it happens in calculation.py. The warnings is triggered by the last assertion in test_score. This uses a categorical variable 'Sex_object' in the input which is fed as a string.

fwetdb commented 3 years ago

Understood - thank you for that insight. It would be highly appreciated if you can dig even one level deeper to see where it occurs within ppscore/calculation.py and to understand what we can (or maybe cannot when it happens in numpy or so?) do to prevent that warning at all in order to be future safe

JDM288 commented 3 years ago

Hey, I copied the master branch to my own system and changed "scikit-learn>=0.20.2,<1.0.0" to "scikit-learn>=0.20.2,<2.0.0" and I did not get this error. Has anymore work been done to find or fix this issue?

fwetdb commented 3 years ago

@JDM288 thanks for weighing in - can you share the pip list for your environment?

JDM288 commented 3 years ago

Package Version


absl-py 0.12.0 alabaster 0.7.12 albumentations 0.1.12 altair 4.1.0 appdirs 1.4.4 argcomplete 1.12.3 argon2-cffi 21.1.0 arviz 0.11.4 astor 0.8.1 astropy 4.3.1 astunparse 1.6.3 atari-py 0.2.9 atomicwrites 1.4.0 attrs 21.2.0 audioread 2.1.9 autograd 1.3 Babel 2.9.1 backcall 0.2.0 beautifulsoup4 4.6.3 bleach 4.1.0 blis 0.4.1 bokeh 2.3.3 Bottleneck 1.3.2 branca 0.4.2 bs4 0.0.1 CacheControl 0.12.10 cached-property 1.5.2 cachetools 4.2.4 catalogue 1.0.0 certifi 2021.10.8 cffi 1.15.0 cftime 1.5.1.1 chardet 3.0.4 charset-normalizer 2.0.7 click 7.1.2 cloudpickle 1.3.0 cmake 3.12.0 cmdstanpy 0.9.5 colorcet 2.0.6 colorlover 0.3.0 community 1.0.0b1 contextlib2 0.5.5 convertdate 2.3.2 coverage 3.7.1 coveralls 0.5 crcmod 1.7 cufflinks 0.17.3 cvxopt 1.2.7 cvxpy 1.0.31 cycler 0.11.0 cymem 2.0.6 Cython 0.29.24 daft 0.0.4 dask 2.12.0 datascience 0.10.6 debugpy 1.0.0 decorator 4.4.2 defusedxml 0.7.1 descartes 1.1.0 dill 0.3.4 distributed 1.25.3 dlib 19.18.0 dm-tree 0.1.6 docopt 0.6.2 docutils 0.18 dopamine-rl 1.0.5 earthengine-api 0.1.288 easydict 1.9 ecos 2.0.7.post1 editdistance 0.5.3 en-core-web-sm 2.2.5 entrypoints 0.3 ephem 4.1 et-xmlfile 1.1.0 fa2 0.3.5 fastai 1.0.61 fastdtw 0.3.4 fastprogress 1.0.0 fastrlock 0.8 fbprophet 0.7.1 feather-format 0.4.1 filelock 3.3.2 firebase-admin 4.4.0 fix-yahoo-finance 0.0.22 Flask 1.1.4 flatbuffers 2.0 folium 0.8.3 future 0.16.0 gast 0.4.0 GDAL 2.2.2 gdown 3.6.4 gensim 3.6.0 geographiclib 1.52 geopy 1.17.0 gin-config 0.5.0 glob2 0.7 google 2.0.3 google-api-core 1.26.3 google-api-python-client 1.12.8 google-auth 1.35.0 google-auth-httplib2 0.0.4 google-auth-oauthlib 0.4.6 google-cloud-bigquery 1.21.0 google-cloud-bigquery-storage 1.1.0 google-cloud-core 1.0.3 google-cloud-datastore 1.8.0 google-cloud-firestore 1.7.0 google-cloud-language 1.2.0 google-cloud-storage 1.18.1 google-cloud-translate 1.5.0 google-colab 1.0.0 google-pasta 0.2.0 google-resumable-media 0.4.1 googleapis-common-protos 1.53.0 googledrivedownloader 0.4 graphviz 0.10.1 greenlet 1.1.2 grpcio 1.41.1 gspread 3.0.1 gspread-dataframe 3.0.8 gym 0.17.3 h5py 3.1.0 HeapDict 1.0.1 hijri-converter 2.2.2 holidays 0.10.5.2 holoviews 1.14.6 html5lib 1.0.1 httpimport 0.5.18 httplib2 0.17.4 httplib2shim 0.0.3 humanize 0.5.1 hyperopt 0.1.2 ideep4py 2.0.0.post3 idna 2.10 imageio 2.4.1 imagesize 1.3.0 imbalanced-learn 0.8.1 imblearn 0.0 imgaug 0.2.9 importlib-metadata 4.8.2 importlib-resources 5.4.0 imutils 0.5.4 inflect 2.1.0 iniconfig 1.1.1 intel-openmp 2021.4.0 intervaltree 2.1.0 ipykernel 4.10.1 ipython 5.5.0 ipython-genutils 0.2.0 ipython-sql 0.3.9 ipywidgets 7.6.5 itsdangerous 1.1.0 jax 0.2.21 jaxlib 0.1.71+cuda111 jdcal 1.4.1 jedi 0.18.0 jieba 0.42.1 Jinja2 2.11.3 joblib 1.1.0 jpeg4py 0.1.4 jsonschema 2.6.0 jupyter 1.0.0 jupyter-client 5.3.5 jupyter-console 5.2.0 jupyter-core 4.9.1 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.2 kaggle 1.5.12 kapre 0.3.5 keras 2.7.0 Keras-Preprocessing 1.1.2 keras-vis 0.4.1 kiwisolver 1.3.2 korean-lunar-calendar 0.2.1 libclang 12.0.0 librosa 0.8.1 lightgbm 2.2.3 llvmlite 0.34.0 lmdb 0.99 LunarCalendar 0.0.9 lxml 4.2.6 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.2.2 matplotlib-inline 0.1.3 matplotlib-venn 0.11.6 missingno 0.5.0 mistune 0.8.4 mizani 0.6.0 mkl 2019.0 mlxtend 0.14.0 more-itertools 8.11.0 moviepy 0.2.3.5 mpmath 1.2.1 msgpack 1.0.2 multiprocess 0.70.12.2 multitasking 0.0.9 murmurhash 1.0.6 music21 5.5.0 natsort 5.5.0 nbclient 0.5.8 nbconvert 5.6.1 nbformat 5.1.3 nest-asyncio 1.5.1 netCDF4 1.5.8 networkx 2.6.3 nibabel 3.0.2 nltk 3.2.5 notebook 5.3.1 numba 0.51.2 numexpr 2.7.3 numpy 1.19.5 nvidia-ml-py3 7.352.0 oauth2client 4.1.3 oauthlib 3.1.1 okgrade 0.4.3 opencv-contrib-python 4.1.2.30 opencv-python 4.1.2.30 openpyxl 2.5.9 opt-einsum 3.3.0 osqp 0.6.2.post0 packaging 21.2 palettable 3.3.0 pandas 1.1.5 pandas-datareader 0.9.0 pandas-gbq 0.13.3 pandas-profiling 1.4.1 pandocfilters 1.5.0 panel 0.12.1 param 1.12.0 parso 0.8.2 pathlib 1.0.1 patsy 0.5.2 pep517 0.12.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 7.1.2 pip 21.1.3 pip-tools 6.2.0 plac 1.1.3 plotly 4.4.1 plotnine 0.6.0 pluggy 0.7.1 pooch 1.5.2 portpicker 1.3.9 ppscore 1.2.0 prefetch-generator 1.0.1 preshed 3.0.6 prettytable 2.4.0 progressbar2 3.38.0 prometheus-client 0.12.0 promise 2.3 prompt-toolkit 1.0.18 protobuf 3.17.3 psutil 5.4.8 psycopg2 2.7.6.1 ptyprocess 0.7.0 py 1.11.0 pyarrow 3.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.2 pycoingecko 2.2.0 pycparser 2.21 pyct 0.4.8 pydata-google-auth 1.2.0 pydot 1.3.0 pydot-ng 2.0.0 pydotplus 2.0.2 PyDrive 1.3.1 pyemd 0.5.1 pyerfa 2.0.0.1 pyglet 1.5.0 Pygments 2.6.1 pygobject 3.26.1 pymc3 3.11.4 PyMeeus 0.5.11 pymongo 3.12.1 pymystem3 0.2.0 PyOpenGL 3.1.5 pyparsing 2.4.7 pyrsistent 0.18.0 pysndfile 1.3.8 PySocks 1.7.1 pystan 2.19.1.1 pytest 3.6.4 python-apt 0.0.0 python-chess 0.23.11 python-dateutil 2.8.2 python-louvain 0.15 python-slugify 5.0.2 python-utils 2.5.6 pytrends 4.7.3 pytz 2018.9 pyviz-comms 2.1.0 PyWavelets 1.2.0 PyYAML 3.13 pyzmq 22.3.0 qdldl 0.1.5.post0 qtconsole 5.2.0 QtPy 1.11.2 regex 2019.12.20 requests 2.23.0 requests-oauthlib 1.3.0 resampy 0.2.2 retrying 1.3.3 rpy2 3.4.5 rsa 4.7.2 scikit-image 0.18.3 scikit-learn 1.0.1 scipy 1.4.1 screen-resolution-extra 0.0.0 scs 2.1.4 seaborn 0.11.2 semver 2.13.0 Send2Trash 1.8.0 setuptools 57.4.0 setuptools-git 1.2 Shapely 1.8.0 simplegeneric 0.8.1 six 1.15.0 sklearn 0.0 sklearn-pandas 1.8.0 smart-open 5.2.1 snowballstemmer 2.1.0 sortedcontainers 2.4.0 SoundFile 0.10.3.post1 spacy 2.2.4 Sphinx 1.8.5 sphinxcontrib-serializinghtml 1.1.5 sphinxcontrib-websupport 1.2.4 SQLAlchemy 1.4.26 sqlparse 0.4.2 srsly 1.0.5 statsmodels 0.10.2 sympy 1.7.1 tables 3.4.4 tabulate 0.8.9 tblib 1.7.0 tensorboard 2.7.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.7.0 tensorflow-datasets 4.0.1 tensorflow-estimator 2.7.0 tensorflow-gcs-config 2.7.0 tensorflow-hub 0.12.0 tensorflow-io-gcs-filesystem 0.22.0 tensorflow-metadata 1.4.0 tensorflow-probability 0.14.1 termcolor 1.1.0 terminado 0.12.1 testpath 0.5.0 text-unidecode 1.3 textblob 0.15.3 Theano-PyMC 1.1.2 thinc 7.4.0 threadpoolctl 3.0.0 tifffile 2021.11.2 toml 0.10.2 tomli 1.2.2 toolz 0.11.2 torch 1.10.0+cu111 torchsummary 1.5.1 torchtext 0.11.0 torchvision 0.11.1+cu111 tornado 5.1.1 tqdm 4.62.3 traitlets 5.1.1 tweepy 3.10.0 TwitterAPI 2.7.7 typeguard 2.7.1 typing-extensions 3.10.0.2 tzlocal 1.5.1 uritemplate 3.0.1 urllib3 1.24.3 vega-datasets 0.9.0 wasabi 0.8.2 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 1.0.1 wheel 0.37.0 widgetsnbextension 3.5.2 wordcloud 1.5.0 wrapt 1.13.3 xarray 0.18.2 xgboost 0.90 xkit 0.0.0 xlrd 1.1.0 xlwt 1.3.0 yellowbrick 1.3.post1 zict 2.0.0 zipp 3.6.0

fwetdb commented 3 years ago

Thank you, since you have sklearn 1.0.x this means that the warning might be introduced in 1.1

JDM288 commented 3 years ago

@fwetdb Ah, any luck yet in figuring out what is causing it?

JDM288 commented 3 years ago

@fwetdb I figured out what was causing it. I opened a new pull request with the files changed. When calculations.py called cross_val_score and mean_absolute_error from sklearn, calculations.py was sending pandas series as inputs. Those series were set to "Int64" by default, but it was not explicitly stated. So my original solution was to use .to_numpy() on the pandas series, but once I dug a little deeper, I realized using .astype("int64") or .astype("float") etc works just fine too. It just needs to be explicitly declared. This error only pops up when a pandas series with undeclared data type is sent to a sklearn function from sklearn version <=1.0.0. The only places that I could find that happening were in _calculate_model_cvscore() and _mae_normalizer() in calculations.py. Changing those series to numpy arrays or using .astype() fixes this issue. Let me know if you have any more questions. I hope this will make some progress

JDM288 commented 3 years ago

@tkrabel @8080labs ^^^^^^^^^^^^^

fwetdb commented 3 years ago

Thank you a lot for looking into this! I added a small comment to #57 and would appreciate your input

8080labs commented 3 years ago

Closed in favor of #57