Closed amueller closed 6 years ago
I can do this first thing tomorrow (10/23) morning.
@msarahan sweet, thanks :)
Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297. """
from os.path import exists from os import makedirs, remove import tarfile
import numpy as np import logging
from .base import get_data_home from .base import _fetch_remote from .base import _pkl_filepath from .base import RemoteFileMetadata from ..utils import Bunch from ..externals import joblib
ARCHIVE = RemoteFileMetadata( filename='cal_housing.tgz', url='https://ndownloader.figshare.com/files/5976036', checksum=('aaa5c9a6afe2225cc2aed2723682ae40' '3280c4a3695a2ddda4ffb5d8215ea681'))
MODULE_DOCS = doc
logger = logging.getLogger(name)
User Guide <datasets>
.
Parametersdata_home : optional, default: None
Specify another download and cache folder for the datasets. By default
all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
download_if_missing : optional, True by default
If False, raise a IOError if the data is not locally available
instead of trying to download the data from the source site.
Returns
-------
dataset : dict-like object with the following attributes:
dataset.data : ndarray, shape [20640, 8]
Each row corresponding to the 8 feature values in order.
dataset.target : numpy array of shape (20640,)
Each value corresponds to the average house value in units of 100,000.
dataset.feature_names : array of length 8
Array of ordered feature names used in the dataset.
dataset.DESCR : string
Description of the California housing dataset.
Notes
------
This dataset consists of 20,640 samples and 9 features.
"""
data_home = get_data_home(data_home=data_home)
if not exists(data_home):
makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')
if not exists(filepath):
if not download_if_missing:
raise IOError("Data not found and `download_if_missing` is False")
logger.info('Downloading Cal. housing from {} to {}'.format(
ARCHIVE.url, data_home))
archive_path = _fetch_remote(ARCHIVE, dirname=data_home)
with tarfile.open(mode="r:gz", name=archive_path) as f:
cal_housing = np.loadtxt(
f.extractfile('CaliforniaHousing/cal_housing.data'),
delimiter=',')
# Columns are not in the same order compared to the previous
# URL resource on lib.stat.cmu.edu
columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
cal_housing = cal_housing[:, columns_index]
joblib.dump(cal_housing, filepath, compress=6)
remove(archive_path)
else:
cal_housing = joblib.load(filepath)
feature_names = ["MedInc", "HouseAge", "AveRooms", "AveBedrms",
"Population", "AveOccup", "Latitude", "Longitude"]
target, data = cal_housing[:, 0], cal_housing[:, 1:]
# avg rooms = total rooms / households
data[:, 2] /= data[:, 5]
# avg bed rooms = total bed rooms / households
data[:, 3] /= data[:, 5]
# avg occupancy = population / households
data[:, 5] = data[:, 4] / data[:, 5]
# target in units of 100,000
target = target / 100000.0
return Bunch(data=data,
target=target,
feature_names=feature_names,
DESCR=MODULE_DOCS)`
@dheerajsharma21 I don't understand. This issue was @amueller asking us to build scikit-learn packages. What does your comment have to do with that?
Hi Mike,
There's a permission issue in anacondas scikit learn california_housing.py dataset file. This is fixed in scikit learn repository but not in anaconda. So, I thought this comment could be helpful to someone who wants to use housing dataset before this release. You can remove the comment. I'm sorry if it disturbed you. Thank you.
Regards, Dheeraj
On 23 Oct 2017 6:12 pm, "Mike Sarahan" notifications@github.com wrote:
@dheerajsharma21 https://github.com/dheerajsharma21 I don't understand. This issue was @amueller https://github.com/amueller asking us to build scikit-learn packages. What does your comment have to do with that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/anaconda-issues/issues/6809#issuecomment-338646474, or mute the thread https://github.com/notifications/unsubscribe-auth/ANEliMrenzSayxle0xIhhI3GqoXQ16wHks5svInHgaJpZM4QBxTJ .
I see. Since we are building scikit-learn's latest code, I think this will be fixed already. Thanks for letting us know.
Packages are built and staged on the c3i_test2 channel: https://anaconda.org/c3i_test2/scikit-learn
win-32 python 2.7 required disabling a test. I'm building that package again now. I need to run some errands, and will finish this up in about an hour.
Thanks! Can you tell us which test failed and ideally the traceback?
Sent from phone. Please excuse spelling and brevity.
On Oct 23, 2017 09:53, "Mike Sarahan" notifications@github.com wrote:
Packages are built and staged on the c3i_test2 channel: https://anaconda.org/c3i_test2/scikit-learn
win-32 python 2.7 required disabling a test. I'm building that package again now. I need to run some errands, and will finish this up in about an hour.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/anaconda-issues/issues/6809#issuecomment-338666303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFoi9ywYbz6KJDWMUytE1oLH0SjAYks5svJpogaJpZM4QBxTJ .
The test was test_predict_proba_binary. There was an ~8% mismatch in values somehow. None of the values in the traceback disagreed - they were buried in the abbreviated array output. Sorry, the traceback is gone. I can rebuild it if it's really helpful to you.
We are also setting atol values on other tests:
From 3b230279c36d061e746976e087178c3371754c16 Mon Sep 17 00:00:00 2001
From: Ray Donnelly <mingw.android@gmail.com>
Date: Thu, 14 Sep 2017 21:23:38 +0100
Subject: [PATCH] Add a few atols
---
sklearn/utils/estimator_checks.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sklearn/utils/estimator_checks.py b/sklearn/utils/estimator_checks.py
index ba83535..82bfe03 100644
--- a/sklearn/utils/estimator_checks.py
+++ b/sklearn/utils/estimator_checks.py
@@ -1223,7 +1223,7 @@ def check_supervised_y_2d(name, estimator_orig):
assert_greater(len(w), 0, msg)
assert_true("DataConversionWarning('A column-vector y"
" was passed when a 1d array was expected" in msg)
- assert_allclose(y_pred.ravel(), y_pred_2d.ravel())
+ assert_allclose(y_pred.ravel(), y_pred_2d.ravel(), atol=1e-9)
@ignore_warnings(category=(DeprecationWarning, FutureWarning))
@@ -1437,7 +1437,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier):
classifier.set_params(class_weight=class_weight)
coef_manual = classifier.fit(X, y).coef_.copy()
- assert_allclose(coef_balanced, coef_manual)
+ assert_allclose(coef_balanced, coef_manual, atol=1e-9)
@ignore_warnings(category=(DeprecationWarning, FutureWarning))
--
2.10.1
We can submit the atol patch to you guys if you want (and if you agree with it). I'm not sure what the answer is for win-32 or how important it is.
0.19.1 packages are available on the new "main" channel. Closing this issue.
Thanks @msarahan! It doesn't show up in https://repo.continuum.io/pkgs/rss.xml, is that not the right place to look?
I generally test with conda search scikit-learn
- we're still getting some issues worked out with the rss feed after changing over to a new workflow. I'll look into the rss feed.
scikit-learn will release 0.19.1 on 10/22 or 10/23 using this tag: https://github.com/scikit-learn/scikit-learn/releases/tag/0.19.1
It would be cool if you could package it and let us know, so we can do a simultaneous release. Thanks folks!