databricks / spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark
Apache License 2.0
1.08k stars 229 forks source link

Update to latest scikit-learn release for deprecation and compatibility #53

Open dsackin opened 7 years ago

dsackin commented 7 years ago

Using the current head 0.2.0 release of spark-sklearn and the current release of scikit-learn (0.18.1), I'm getting the following deprecation warning:

/.../python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)

the library needs to be updated to use the new model_selection module and iterator interfaces.

In addition, due to changes in sklearn.model_selection.GridSearchCV, the attributes available on the fitted spark-sklearn.GridSearchCV are out of date.

sklearn.model_selection.GridSearchCV now has:

While spark-sklearn.GridSearchCV has:

The most critical difference is that sklearn added the more comprehensive cvresults which adds data that the formerly compatible gridscores is lacking.

ajaysaini725 commented 7 years ago

I've been working on this and almost have a PR ready. It will be out this upcoming Monday.

dsackin commented 7 years ago

Thank you for the quick attention. Is anything required of me? I see I was CCed on the related issue, but it looks like that was just for info.

ajaysaini725 commented 7 years ago

An update making spark-sklearn compatible with sklearn version >= 0.18.1 has been merged.

dsackin commented 7 years ago

I'm just about to adopt this update. Can you mark a new release in github and and update the version in PyPi? I currently rely on pip for the installs in my environments. I was hoping not to have to change to git just for this package.

gordontsai commented 7 years ago

@dsackin Did you end up doing the git install? I'm also running into version issues when installing through pip.

dsackin commented 7 years ago

No. I haven't updated yet. I was hoping they would push it into PyPi before I switched to git install.

gordontsai commented 7 years ago

Got it. Just an fyi, ended up doing the git install, and it worked.

emceemouli commented 7 years ago

Can you please let us know when a new release be marked and push to PyPi would happen.

emceemouli commented 7 years ago

@gordontsai @dsackin I am quite new to git install...can you tell me how to perform git install while we wait this to be pushed pypi

srowen commented 5 years ago

@thunterdb this is more about what it might take to support 0.20. We have a related issue about not setting things like bestparams at https://github.com/databricks/spark-sklearn/issues/73, which seems like an easy fix but the simple fix doesn't run. This PR might also contain some of the necessary changes: https://github.com/databricks/spark-sklearn/pull/74 . This much I haven't looked into yet.

thunterdb commented 5 years ago

I see, this is more than pointing to the right package. The 0.20 release is less than 2 months old, so let us focus on the 0.1x releases until there is a more general need for that. What are your thoughts?

srowen commented 5 years ago

Yeah, certainly more concerned this second with a new release to fix some bugs, and maybe get random search in. If you have a sec to look at https://github.com/databricks/spark-sklearn/issues/73 you might know the quick answer; that might also be a quick fix relevant to 0.19