amueller commented 6 years ago

scikit-learn will release 0.19.1 on 10/22 or 10/23 using this tag: https://github.com/scikit-learn/scikit-learn/releases/tag/0.19.1

It would be cool if you could package it and let us know, so we can do a simultaneous release. Thanks folks!

msarahan commented 6 years ago

I can do this first thing tomorrow (10/23) morning.

amueller commented 6 years ago

@msarahan sweet, thanks :)

dheerajinampudi commented 6 years ago

This could resolve the issue `"""California housing dataset. The original database is available from StatLib http://lib.stat.cmu.edu/datasets/ The data contains 20,640 observations on 9 variables. This dataset contains the average house value as target variable and the following input variables (features): average income, housing average age, average rooms, average bedrooms, population, average occupation, latitude, and longitude in that order. References

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297. """

Authors: Peter Prettenhofer

License: BSD 3 clause

from os.path import exists from os import makedirs, remove import tarfile

import numpy as np import logging

from .base import get_data_home from .base import _fetch_remote from .base import _pkl_filepath from .base import RemoteFileMetadata from ..utils import Bunch from ..externals import joblib

The original data can be found at:

http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz

ARCHIVE = RemoteFileMetadata( filename='cal_housing.tgz', url='https://ndownloader.figshare.com/files/5976036', checksum=('aaa5c9a6afe2225cc2aed2723682ae40' '3280c4a3695a2ddda4ffb5d8215ea681'))

Grab the module-level docstring to use as a description of the

dataset

MODULE_DOCS = doc

logger = logging.getLogger(name)

def fetch_california_housing(data_home=None, download_if_missing=True): """Loader for the California housing dataset from StatLib. Read more in the :ref:`User Guide <datasets>`. Parameters

data_home : optional, default: None
    Specify another download and cache folder for the datasets. By default
    all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
download_if_missing : optional, True by default
    If False, raise a IOError if the data is not locally available
    instead of trying to download the data from the source site.
Returns
-------
dataset : dict-like object with the following attributes:
dataset.data : ndarray, shape [20640, 8]
    Each row corresponding to the 8 feature values in order.
dataset.target : numpy array of shape (20640,)
    Each value corresponds to the average house value in units of 100,000.
dataset.feature_names : array of length 8
    Array of ordered feature names used in the dataset.
dataset.DESCR : string
    Description of the California housing dataset.
Notes
------
This dataset consists of 20,640 samples and 9 features.
"""
data_home = get_data_home(data_home=data_home)
if not exists(data_home):
    makedirs(data_home)

filepath = _pkl_filepath(data_home, 'cal_housing.pkz')
if not exists(filepath):
    if not download_if_missing:
        raise IOError("Data not found and `download_if_missing` is False")

    logger.info('Downloading Cal. housing from {} to {}'.format(
        ARCHIVE.url, data_home))

    archive_path = _fetch_remote(ARCHIVE, dirname=data_home)

    with tarfile.open(mode="r:gz", name=archive_path) as f:
        cal_housing = np.loadtxt(
            f.extractfile('CaliforniaHousing/cal_housing.data'),
            delimiter=',')
        # Columns are not in the same order compared to the previous
        # URL resource on lib.stat.cmu.edu
        columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
        cal_housing = cal_housing[:, columns_index]

        joblib.dump(cal_housing, filepath, compress=6)
    remove(archive_path)

else:
    cal_housing = joblib.load(filepath)

feature_names = ["MedInc", "HouseAge", "AveRooms", "AveBedrms",
                 "Population", "AveOccup", "Latitude", "Longitude"]

target, data = cal_housing[:, 0], cal_housing[:, 1:]

# avg rooms = total rooms / households
data[:, 2] /= data[:, 5]

# avg bed rooms = total bed rooms / households
data[:, 3] /= data[:, 5]

# avg occupancy = population / households
data[:, 5] = data[:, 4] / data[:, 5]

# target in units of 100,000
target = target / 100000.0

return Bunch(data=data,
             target=target,
             feature_names=feature_names,
             DESCR=MODULE_DOCS)`

msarahan commented 6 years ago

@dheerajsharma21 I don't understand. This issue was @amueller asking us to build scikit-learn packages. What does your comment have to do with that?

dheerajinampudi commented 6 years ago

Hi Mike,

There's a permission issue in anacondas scikit learn california_housing.py dataset file. This is fixed in scikit learn repository but not in anaconda. So, I thought this comment could be helpful to someone who wants to use housing dataset before this release. You can remove the comment. I'm sorry if it disturbed you. Thank you.

Regards, Dheeraj

On 23 Oct 2017 6:12 pm, "Mike Sarahan" notifications@github.com wrote:

@dheerajsharma21 https://github.com/dheerajsharma21 I don't understand. This issue was @amueller https://github.com/amueller asking us to build scikit-learn packages. What does your comment have to do with that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/anaconda-issues/issues/6809#issuecomment-338646474, or mute the thread https://github.com/notifications/unsubscribe-auth/ANEliMrenzSayxle0xIhhI3GqoXQ16wHks5svInHgaJpZM4QBxTJ .

msarahan commented 6 years ago

I see. Since we are building scikit-learn's latest code, I think this will be fixed already. Thanks for letting us know.

msarahan commented 6 years ago

Packages are built and staged on the c3i_test2 channel: https://anaconda.org/c3i_test2/scikit-learn

win-32 python 2.7 required disabling a test. I'm building that package again now. I need to run some errands, and will finish this up in about an hour.

amueller commented 6 years ago

Thanks! Can you tell us which test failed and ideally the traceback?

Sent from phone. Please excuse spelling and brevity.

On Oct 23, 2017 09:53, "Mike Sarahan" notifications@github.com wrote:

Packages are built and staged on the c3i_test2 channel: https://anaconda.org/c3i_test2/scikit-learn

win-32 python 2.7 required disabling a test. I'm building that package again now. I need to run some errands, and will finish this up in about an hour.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/anaconda-issues/issues/6809#issuecomment-338666303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFoi9ywYbz6KJDWMUytE1oLH0SjAYks5svJpogaJpZM4QBxTJ .

msarahan commented 6 years ago

The test was test_predict_proba_binary. There was an ~8% mismatch in values somehow. None of the values in the traceback disagreed - they were buried in the abbreviated array output. Sorry, the traceback is gone. I can rebuild it if it's really helpful to you.

We are also setting atol values on other tests:

From 3b230279c36d061e746976e087178c3371754c16 Mon Sep 17 00:00:00 2001
From: Ray Donnelly <mingw.android@gmail.com>
Date: Thu, 14 Sep 2017 21:23:38 +0100
Subject: [PATCH] Add a few atols

---
 sklearn/utils/estimator_checks.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sklearn/utils/estimator_checks.py b/sklearn/utils/estimator_checks.py
index ba83535..82bfe03 100644
--- a/sklearn/utils/estimator_checks.py
+++ b/sklearn/utils/estimator_checks.py
@@ -1223,7 +1223,7 @@ def check_supervised_y_2d(name, estimator_orig):
         assert_greater(len(w), 0, msg)
         assert_true("DataConversionWarning('A column-vector y"
                     " was passed when a 1d array was expected" in msg)
-    assert_allclose(y_pred.ravel(), y_pred_2d.ravel())
+    assert_allclose(y_pred.ravel(), y_pred_2d.ravel(), atol=1e-9)

 @ignore_warnings(category=(DeprecationWarning, FutureWarning))
@@ -1437,7 +1437,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier):
     classifier.set_params(class_weight=class_weight)
     coef_manual = classifier.fit(X, y).coef_.copy()

-    assert_allclose(coef_balanced, coef_manual)
+    assert_allclose(coef_balanced, coef_manual, atol=1e-9)

 @ignore_warnings(category=(DeprecationWarning, FutureWarning))
--
2.10.1

We can submit the atol patch to you guys if you want (and if you agree with it). I'm not sure what the answer is for win-32 or how important it is.

msarahan commented 6 years ago

0.19.1 packages are available on the new "main" channel. Closing this issue.

amueller commented 6 years ago

Thanks @msarahan! It doesn't show up in https://repo.continuum.io/pkgs/rss.xml, is that not the right place to look?

msarahan commented 6 years ago

I generally test with conda search scikit-learn - we're still getting some issues worked out with the rss feed after changing over to a new workflow. I'll look into the rss feed.

ContinuumIO / anaconda-issues

Scikit-learn 0.19.1 release #6809

Authors: Peter Prettenhofer

License: BSD 3 clause

The original data can be found at:

http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz

Grab the module-level docstring to use as a description of the

dataset

def fetch_california_housing(data_home=None, download_if_missing=True): """Loader for the California housing dataset from StatLib. Read more in the :ref:`User Guide <datasets>`. Parameters

ContinuumIO / anaconda-issues

Scikit-learn 0.19.1 release #6809

Authors: Peter Prettenhofer

License: BSD 3 clause

The original data can be found at:

http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz

Grab the module-level docstring to use as a description of the

dataset

def fetch_california_housing(data_home=None, download_if_missing=True): """Loader for the California housing dataset from StatLib. Read more in the :ref:User Guide <datasets>. Parameters

def fetch_california_housing(data_home=None, download_if_missing=True): """Loader for the California housing dataset from StatLib. Read more in the :ref:`User Guide <datasets>`. Parameters