ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.16k stars 12.92k forks source link

Questions to Chapter 2 #470

Open FritzPeleke opened 5 years ago

FritzPeleke commented 5 years ago

Hi Aurelien,

I have some questions concerning the GridsearchCV and RandomsearchCV. Firstly, concerning the RandomsearchCV. what do the scipy.stats.reciprocal and scipy.stats.expon do exactly?. Why is important to specify an n_iter ? what does n_jobs do? secondly, what could be possible reasons for my Gridsearch running for a very long while? Thirdly, what roles do this backend play because i ran my Gridsearch with C having values of up to 300,000 and i had an error concerning the backend. How could this be solved? 1)The code below just keeps running :

processed = full_pipeline.fit_transform(practice_set)

#training model
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
svm_reg = SVR(gamma='auto')
predictor = svm_reg.fit(processed,practice_set_labels)
predictions = svm_reg.predict(processed)
mean_squared_error = mean_squared_error(y_true=practice_set_labels,y_pred=predictions)
rmse = np.sqrt(mean_squared_error)

#cross validation
#score = cross_val_score(svm_reg,processed,practice_set_labels,scoring='neg_mean_squared_error',cv=10)
#cv_score = np.sqrt(-score)

def display(score):
    print('scores\n',score)
    print(score.mean())
    print(score.std())
params = [{'kernel':['linear','rbf'],'C':[1.,3.,30.,100.]}]
grid_search = GridSearchCV(svm_reg,param_grid=params,scoring='neg_mean_squared_error',cv=5)
grid_search.fit(processed,practice_set_labels)
print(grid_search.best_score_)

2)Running this one which is identical to that of the Jupyter notebook:

processed = full_pipeline.fit_transform(practice_set)

#training model
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
svm_reg = SVR(gamma='auto')
predictor = svm_reg.fit(processed,practice_set_labels)
predictions = svm_reg.predict(processed)
mean_squared_error = mean_squared_error(y_true=practice_set_labels,y_pred=predictions)
rmse = np.sqrt(mean_squared_error)

#cross validation
#score = cross_val_score(svm_reg,processed,practice_set_labels,scoring='neg_mean_squared_error',cv=10)
#cv_score = np.sqrt(-score)

def display(score):
    print('scores\n',score)
    print(score.mean())
    print(score.std())

param_grid = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]

svm_reg = SVR()
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search.fit(processed, practice_set_labels)

ERROR MESSAGE:

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
exception calling callback for <Future at 0x2274f53ca90 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
'''
Traceback (most recent call last):
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\multiprocessing\queues.py", line 109, in get
    self._sem.release()
OSError: [WinError 6] Das Handle ist ungültig

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\process_executor.py", line 391, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\multiprocessing\queues.py", line 111, in get
    self._rlock.release()
OSError: [WinError 6] Das Handle ist ungültig
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 309, in __call__
    self.parallel.dispatch_next()
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 731, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\_parallel_backends.py", line 510, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\process_executor.py", line 1022, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
FEHLER: Der Prozess "37772" wurde nicht gefunden.
FEHLER: Der Prozess "32116" wurde nicht gefunden.
FEHLER: Der Prozess "37652" wurde nicht gefunden.
FEHLER: Der Prozess "7540" wurde nicht gefunden.
joblib.externals.loky.process_executor._RemoteTraceback: 
'''
Traceback (most recent call last):
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\multiprocessing\queues.py", line 109, in get
    self._sem.release()
OSError: [WinError 6] Das Handle ist ungültig

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\process_executor.py", line 391, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\multiprocessing\queues.py", line 111, in get
    self._rlock.release()
OSError: [WinError 6] Das Handle ist ungültig
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/Users/fritz/PycharmProjects/Hello/Chap-2_Excercise.py", line 110, in <module>
    grid_search.fit(processed, practice_set_labels)
  File "C:\Users\fritz\Hello\lib\site-packages\sklearn\model_selection\_search.py", line 688, in fit
    self._run_search(evaluate_candidates)
  File "C:\Users\fritz\Hello\lib\site-packages\sklearn\model_selection\_search.py", line 1149, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "C:\Users\fritz\Hello\lib\site-packages\sklearn\model_selection\_search.py", line 667, in evaluate_candidates
    cv.split(X, y, groups)))
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 934, in __call__
    self.retrieve()
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Users\fritz\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 309, in __call__
    self.parallel.dispatch_next()
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 731, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\_parallel_backends.py", line 510, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "C:\Users\fritz\Hello\lib\site-packages\joblib\externals\loky\process_executor.py", line 1022, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Process finished with exit code 1
ageron commented 5 years ago

Hi @FritzPeleke , For the reciprocal and expon distributions, please see my explanations in #282 . Regarding n_iter, in theRandomizedSearchCV class this determines the number of times the model will be trained (times the number of folds). For example, if n_iter=10 and cv=3, then the model will be trained 30 times. The larger n_iter is, the more likely you are to end up with a good set of hyperparameters. But also the more times the model will be trained, so it will take a lot of time.

If Gridsearch runs for a very long time, this is probably because (1) the hyperparameter space is large (i.e., there are many combinations of hyperparameters to evaluate), and/or (2) training the model once takes a long time, perhaps because the dataset is large.

n_jobs tells Scikit-Learn the number of jobs that will run in parallel for this task. If n_jobs=4, for example, then Scikit-Learn will use up to 4 CPU cores in parallel to train the model, which can speed up training by a factor of (up to) 4. If you set n_jobs=-1, it means "use all available CPU cores". If you are having joblib errors, try using n_jobs=1 to avoid parallelism. If this fixes the problem, then it means there's a bug in your joblib installation and you will need to fix it if you want to run jobs in parallel: I believe there are issues with Windows+Anaconda+joblib, please check with anaconda or joblib (I don't use Windows, so I can't help you on this).

Hope this helps!

mathewsmutethia commented 5 years ago

Hi Aurelien, What would you suggest to a person who is totally new to all of this? Do you recommend a person to first learn python then read the book? I am currently in Kenya and would like guidance in understanding and mastering the skills in machine learning. Please help.

ageron commented 5 years ago

Hi @mathewsmutethia , You're embarking on a great journey, welcome to Machine Learning! 👍

I hope this helps! There's a lot to learn, so it will take some time, but there's nothing super difficult if you take your time. Be patient and never give up, and you'll make it, don't worry!

I hope this helps, Aurélien

FritzPeleke commented 5 years ago

Hi @ageron , I found a way around the joblib problem. The dask-ml module performs the n_jobs and skips the joblib problem encountered with sklearn. All one needs to do is install and import the GridsearchCV from dask-ml module and the rest of the code is exactly like sklearn. Its GridsearchCV is said to be faster than that of sklearn. Hope this helps others who have the same problem with windows.

from dask_ml.model_selection import GridSearchCV

mathewsmutethia commented 5 years ago

Hi @ageron Took your advice. The journey looks long, but I'll hopefully get there on time. Regards, Mathews

jasonjoe2019 commented 4 years ago

Hi ageron,

could you explain more about teh income category, I don't understand it what it is and how to apply it?

Thanks, Jason

ageron commented 4 years ago

Hi @jasonjoe2019 ,

Thanks for your question. Suppose we know that the median income is a very important feature to predict the median housing price in a district. Then it is important to ensure that the train set, the validation set and the test sets are representative of the overall distribution of incomes. For example, if 10% of all districts in California have a median income between 30k$ and 40k$ (just making up some numbers here), then you want to ensure that the training set, the validation set and the test set all have as close as possible to 10% districts with a median income in this range.

It's a bit like in political surveys: if you suspect that votes depend strongly on, say, the voter's gender, then you need to ensure that the sample of people you include in your survey have the same ratio of men and women as in the overall population. For example, if you know that 51% of all voters are female, and your survey sample includes 1,000 people, then you want 510 females and 490 males in your sample. This is called stratified sampling.

Gender is a categorical feature (there is a discrete amount of possible values), so it's easy to ensure that the distributions match. But income is continuous (there's an infinite amount of possible values), so it's harder to ensure that the income distribution in the sets closely matches the overall income distribution. One approach is to bucketize the income feature and use the resulting bucketized feature (which is categorical) for stratified sampling. For example, incomes from 0 to 15k$ will be assigned to bucket 1, incomes from 15 to 30k$ will be assigned to bucket 2, incomes 30 to 45k$ will be assigned to bucket 3, incomes 45 to 60k$ will be assigned to bucket 4, and incomes higher than 60k$ will be assigned to bucket 5. Then we can estimate the proportion of incomes in the overall population for each bucket, and then ensure that the training set, validation set and test set have the same proportions (or as close as possible).

To create the bucketized feature described above, we can write:

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

Here's another approach that gives the exact same result (it's much less elegant, but that's what I used in the 1st edition of my book):

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

The first line divides the income by 1.5 and rounds the result up. For example, if the income is 4 (which represents $40,000 since incomes are already divided by 10,000 in the dataset), then the income category will be 4/1.5=2.67 rounded up, which is 3.

The second replaces any income greater than 5 with 5. The where() method confused many readers: it says "if income category < 5 then keep it, or else replace it with 5" (many readers thought it was saying the opposite).

You can easily check that these two approaches are equivalent, but the cut() approach is much clearer.

I hope all of this is clear and that you'll enjoy the book!

jasonjoe2019 commented 4 years ago

Hi ageron,

Thanks for your great help! The explanation is very clear, it's absolutely very useful.

Thanks, Jason

jasonjoe2019 commented 4 years ago

Hi Ageron,

For the Handling text and categorical attributes part of the book. after tranforming the category data to the 2D binary array, how to use it to prepare the training data. I think other training data is not 1D dimension data, thus how to combine the original number training data and the new 2D binary array, then fit model?

Thanks, Jason

ageron commented 4 years ago

Hi @jasonjoe2019 , Thanks for your question. Scikit-Learn estimators expect the input data to be a 2D array, with one row per instance, and one column per feature. Suppose your original input data has two numerical features and one categorical feature (for example, the weekday):

[
  [1.5, 2.0, "monday"],
  [2.5, 1.7, "wednesday"],
  [2.5, 1.7, "monday"],
  ...
]

The categorical feature (the weekday in this case) can be simply converted to a number directly, for example mapping "monday" to 0., "tuesday" to 1., and so on:

[
  [1.5, 2.0, 0.],
  [2.5, 1.7, 2.],
  [2.5, 1.7, 0.],
  ...
]

For some categorical features which have a natural order, that's fine. For weekdays, it's probably all right. But for other categorical features, it's not ideal, since there's no natural order. For example, suppose we're looking at a new article's category, which can be either "financial", "fashion", "international", and "tech". Then there's no natural order between these categories, and we typically prefer a one-hot encoding (or else the model would wrongly assume that some categories are similar when in fact they're not). For example, "financial" will be mapped to [1., 0., 0., 0.], "fashion" will be mapped to [0., 1., 0., 0.], and so on. Assuming the original dataset looks like this:

[
  [1.5, 2.0, "fashion"],
  [2.5, 1.7, "fashion"],
  [2.5, 1.7, "financial"],
  ...
]

will become this:

[
  [1.5, 2.0, 0., 1., 0., 0.],
  [2.5, 1.7, 0., 1., 0., 0.],
  [2.5, 1.7, 1., 0., 0., 0.]
  ...
]

Does this help?

jasonjoe2019 commented 4 years ago

So one categorical feature is transformed into several binary features? Thanks.

ageron commented 4 years ago

Yes, when doing one-hot encoding, that's exactly right. 👍

Now there are other ways to encode categories. One of the most popular is to use an embedding matrix. This is just a matrix containing float values, with one row per category and an arbitrary number of columns (it's a hyperparameter you can tweak). For example, assuming the embeddings vectors are 3D (i.e., there are 3 columns), and there are 4 categories (the same as earlier), then we may have an embedding matrix like this, for example:

[
    [0.5, 0.3, 0.1],  # financial
    [0.1, 0.7, 0.3],  # fashion
    [0.3, 0.7, 0.1],  # international
    [0.9, 0.9, 0.2]   # tech
]

In this case, each category is simply represented by the corresponding embedding vector, so assuming the original dataset is:

[
  [1.5, 2.0, "fashion"],
  [2.5, 1.7, "fashion"],
  [2.5, 1.7, "financial"],
  ...
]

then resulting dataset is:

[
  [1.5, 2.0, 0.1, 0.7, 0.3],
  [2.5, 1.7, 0.1, 0.7, 0.3],
  [2.5, 1.7, 0.5, 0.3, 0.1],
  ...
]

Each category is just replaced with 3 features.

Now you may wonder how to define the embedding matrix. When training neural networks, the most common solution is to make it part of the model, so its values can we tweaked by gradient descent. This way, during training the embedding vectors will gradually change, hopefully in a way that results in the most useful embedding vectors possible, for example similar categories will likely have similar embedding vectors. Another solution is to use pretrained embeddings. You may have heard of pretrained "word embeddings": that's just a matrix containing one vector per word. You can just use it to encode words before feeding them to a machine learning algorithm. They were trained on huge datasets, so they're actually pretty good. For example, the embedding vector for the word "fantastic" is probably pretty close to the embedding for the word "awesome".

Hope this helps!

jasonjoe2019 commented 4 years ago

Thank you so much for your help! I learned a lot.

Tim4497 commented 4 years ago

Hey, I hope you can help me on this short question, (I have the german translation of your book)

in Chapter 2 at analysing the best model, you list the most important features. Then you say, that I can drop not significant features like "NEAR OCEAN", "ISLAND", ect.

These are categorial features. My solution would be that you change housing_cat to a binary category, which has 1 for INLAND and 0 for all other. Is that the right way? Or do you meant somthing diffrent?

ageron commented 4 years ago

Hi @Tim4497 , Thanks for your question. Yes, you got it, that could definitely be one way to do this. However, I was thinking of dropping the unimportant features after the categorical feature is one-hot encoded. The result is the same, but this solution makes it perhaps a bit easier to try different options (e.g., dropping just ISLAND, or dropping NEAR OCEAN & ISLAND, etc.). I hope this makes sense. But your solution would work, so do whatever you feel more comfortable with.