Hello Aurelien, its Fritz I left a tweet for you and you redirected me to Github. So I started working with the book Hands-on Machine Learning with sklearn and tensor flow. I am using the Pycharm IDE. Whenever I run the code, I received a totally new output. Sometimes the output(e.g Root mean squared error for RandomeFrorestRegressor) looks better and when I run again it changes to worst than before or gets better than before. Thesame thing happens when i use GridsearchCV to tune parameters. I can obtain two different outputs for what the best estimator is.

Hi Fritz, Thanks for your question! There are several possible causes for randomness in Machine Learning programs. The first is simply that many algorithms rely on stochasticity. For example, RandomForestRegressor obviously relies on randomness. To make sure its outputs are reproducible, you need to use the same random seed every time you run it. That's the purpose of the random_state arguments. So first, make sure you set this argument (e.g., to random_state=42) for every class or function that has this argument. For example, the DecisionTreeRegressor and RandomForestRegressor classes use randomness, so you should set their random_state argument when creating an instance. In this case, this will probably suffice.

FYI, another source of randomness is Python's hash function, which is used every time you rely on the order of items in a dictionary or a set. For example, open a Python shell, type list(set("abcdef")), look at the output, then close the shell, open a new one and try again. If you use Python 3, you will get two different outputs. This is a safety feature, to avoid an (unlikely) denial of service attack. To ignore this feature and have reproducible outputs, you must set the PYTHONHASHSEED environment variable to 0 (before starting Python). For example:

>>> list(set("abcdef")) # always returns the same order:
['b', 'a', 'd', 'c', 'f', 'e']

Here are a few other sources of non-reproducibility:

To learn more, check out my video on this topic.

Hope this helps!

housing['income_cat'].where(housing['income_cat']<5.0,other=5.0,inplace=True) Sir what is happening in here...

ageron commented 5 years ago

Hi @Samrat666 ,

Thanks for your question. The where() method can be confusing. It takes a condition as the first argument, and wherever it is True, then the result keeps the DataFrame 's corresponding value, or else it uses the other dataframe's value. With inplace=True, the original DataFrame is modified directly (or else a new DataFrame is created and returned).

So in this particular case, the line says "keep the income_cat column as it is when the value is <5.0, but for any value ≥5.0 use the other value, which is 5.0". In other words, it replaces all values greater than 5.0 with 5.0.

Hope this helps.

def split_tarin_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):
       ids = data[id_column]
       in_test_set = ids.apply(lambda id_: test_set_check(id_,test_ratio, hash))

Sir could you please specify where does the lambda parameter i.e. id_ gets its value from and what is the value it would get and why. Please sir if you could help

ageron commented 4 years ago

Hi @Samrat666 ,

When you call apply() on a Pandas Series object, you must pass it a function with a single argument, and Pandas will call this function for each and every item in the Series. For example:

import pandas as pd
s = pd.Series([1,2,3])

def triple(x):
    print("I will triple", x)
    return x * 3

result = s.apply(triple)

When you run this code, you get this output:

I will triple 1
I will triple 2
I will triple 3
0    3
1    6
2    9
dtype: int64

As you can see, the triple() function was called once per element in the Series.

Now back to the split_train_test_by_id() function: its data argument is a Pandas DataFrame. The first line of the function gets the column whose name is specified in the id_column argument. This returns a Pandas Series object, so that's what ids is. Next we call the apply() method on this Series, so it calls the given lambda once for each id in the column.

In the notebook, I assume that the ids are 64-bit integers, so id_ would just be an integer.

Side note: I chose to name the argument id_ rather than id because id is the name of a built-in function in Python. In fact, I should probably have called the hash argument hash_, since hash is a built-in function as well (this is why syntax highlighting displays it in blue).

I hope this helps.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

please if you could help me out of how the variables strat_train_set and strat_test_set instead of getting overwritten every time get updates with new values as if operated with an append function which seems to be nowhere in the code above...

ageron commented 4 years ago

Hi @Samrat666 , Actually the for loop only runs once since n_splits=1. It we set n_splits=2 or more, then indeed the strat_train_set and strat_test_set variables would keep getting overwritten.

Perhaps I should have used the following code, it might have been clearer:

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_index, test_index = next(split.split(housing, housing["income_cat"]))
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]