amueller / scipy-2016-sklearn

Scikit-learn tutorial at SciPy2016
Creative Commons Zero v1.0 Universal
515 stars 516 forks source link

nb additions #4

Closed rasbt closed 8 years ago

amueller commented 8 years ago

sweet lgtm. I'd probably use np.random.RandomState(seed=1234) for reproducibility.

rasbt commented 8 years ago

This is not really the recommended way, right? Would you do it like this? That can be tricky, I think.

You mean in contrast to sth like this?

>>> import numpy as np
>>> rndst = np.random.RandomState(1234)

Not sure, but this only works for e.g., randint and some others, right?

>>> rndst.randint(3)
2
>>> rndst.random((3, 5))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'mtrand.RandomState' object has no attribute 'random'

Hm, isn't np.random.RandomState(seed=1234) and np.random.seed(1234) essentially the same? http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html

amueller commented 8 years ago

well random.seed changes a global (private?) random state, while random.RandomState is an explicit object that you can pass around. Both depend on execution order, but I feel with the object it is more explicit. It works with all distributions, but not all aliases. random is an alias for random_state. random is also just a special case of uniform, right? I tend to use uniform because that's more explicit to me.

rasbt commented 8 years ago

Both depend on execution order, but I feel with the object it is more explicit.

I agree, will swap it out later when I get home!

amueller commented 8 years ago

thanks :)

rasbt commented 8 years ago

Just updated the RandomState! General question (related to the third notebook), the dataset that is available via load_digits, where's it coming from? (Is it a lower-resolution subset of MNIST?) -- I think someone at the tutorials will likely ask ;)

amueller commented 8 years ago

It's unrelated to MNIST, I think, but also collected by NIST. The DESCR attribute should say:

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.
rasbt commented 8 years ago

Thanks, I dunno why I didn't just check the DESCR :/. (On a side note: it would maybe be useful to add the desc .rst files somehow to the function docstrings so that they also appear in the API doc online?)

Btw I can merge the changing once in a while so that you don't lose the overview here ;). Hehe, going through notebook 01.4, I must say that using the random_state=1999 (to get the 0.33 proportion in the iris test/train split) was a tad sneaky :); I changed it using the new stratify=y option.

amueller commented 8 years ago

The DESCR should be in the user guide, but it looks like it is not. We should probably fix that. And yeah, feel free to merge. I'm a bit caught up still in my book stuff.

rasbt commented 8 years ago

Okay, maybe we should open an issue then. No worries, take you time; I also want to get through all the notebook this weekend hopefully so that I can tackle the other things we discusses (presentation figures, the linear regression implementation, etc.) I will merge the changes then for now!