Open FritzPeleke opened 5 years ago
Hi Fritz,
Thanks for your question!
There are several possible causes for randomness in Machine Learning programs. The first is simply that many algorithms rely on stochasticity. For example, RandomForestRegressor
obviously relies on randomness. To make sure its outputs are reproducible, you need to use the same random seed every time you run it. That's the purpose of the random_state
arguments. So first, make sure you set this argument (e.g., to random_state=42
) for every class or function that has this argument.
For example, the DecisionTreeRegressor
and RandomForestRegressor
classes use randomness, so you should set their random_state
argument when creating an instance.
In this case, this will probably suffice.
FYI, another source of randomness is Python's hash function, which is used every time you rely on the order of items in a dictionary or a set. For example, open a Python shell, type list(set("abcdef"))
, look at the output, then close the shell, open a new one and try again. If you use Python 3, you will get two different outputs. This is a safety feature, to avoid an (unlikely) denial of service attack. To ignore this feature and have reproducible outputs, you must set the PYTHONHASHSEED
environment variable to 0 (before starting Python). For example:
$ PYTHONHASHSEED=0 python3
>>> list(set("abcdef")) # always returns the same order:
['b', 'a', 'd', 'c', 'f', 'e']
Here are a few other sources of non-reproducibility:
os.listdir()
)tf.reduce_sum()
operation).To learn more, check out my video on this topic.
Hope this helps!
housing['income_cat'].where(housing['income_cat']<5.0,other=5.0,inplace=True) Sir what is happening in here...
Hi @Samrat666 ,
Thanks for your question. The where()
method can be confusing. It takes a condition as the first argument, and wherever it is True
, then the result keeps the DataFrame
's corresponding value, or else it uses the other
dataframe's value. With inplace=True
, the original DataFrame
is modified directly (or else a new DataFrame
is created and returned).
So in this particular case, the line says "keep the income_cat
column as it is when the value is <5.0, but for any value ≥5.0 use the other
value, which is 5.0". In other words, it replaces all values greater than 5.0 with 5.0.
Hope this helps.
Really thanks sir for the reply sir your efforts are laudable sir..
On Fri, 6 Sep, 2019, 11:14 Aurélien Geron, notifications@github.com wrote:
Hi @Samrat666 https://github.com/Samrat666 ,
Thanks for your question. The where() method can be confusing. It takes a condition as the first argument, and wherever it is True, then the result keeps the DataFrame 's corresponding value, or else it uses the other dataframe's value. With inplace=True, the original DataFrame is modified directly (or else a new DataFrame is created and returned).
So in this particular case, the line says "keep the income_cat column as it is when the value is <5.0, but for any value ≥5.0 use the other value, which is 5.0". In other words, it replaces all values greater than 5.0 with 5.0.
Hope this helps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/469?email_source=notifications&email_token=AMDZ7NXQRGI3C7QSXYH6IA3QIHU5BA5CNFSM4IJC7Q32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BZOZA#issuecomment-528717668, or mute the thread https://github.com/notifications/unsubscribe-auth/AMDZ7NQFXVJPHLPHY77CGBTQIHU5BANCNFSM4IJC7Q3Q .
def split_tarin_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_,test_ratio, hash))
Sir could you please specify where does the lambda parameter i.e. id_
gets its value from and what is the value it would get and why. Please sir if you could help
Hi @Samrat666 ,
When you call apply()
on a Pandas Series
object, you must pass it a function with a single argument, and Pandas will call this function for each and every item in the Series
. For example:
import pandas as pd
s = pd.Series([1,2,3])
def triple(x):
print("I will triple", x)
return x * 3
result = s.apply(triple)
print("----")
print(result)
When you run this code, you get this output:
I will triple 1
I will triple 2
I will triple 3
----
0 3
1 6
2 9
dtype: int64
As you can see, the triple()
function was called once per element in the Series
.
Now back to the split_train_test_by_id()
function: its data
argument is a Pandas DataFrame
. The first line of the function gets the column whose name is specified in the id_column
argument. This returns a Pandas Series
object, so that's what ids
is.
Next we call the apply()
method on this Series
, so it calls the given lambda once for each id in the column.
In the notebook, I assume that the ids are 64-bit integers, so id_
would just be an integer.
Side note: I chose to name the argument id_
rather than id
because id
is the name of a built-in function in Python. In fact, I should probably have called the hash
argument hash_
, since hash
is a built-in function as well (this is why syntax highlighting displays it in blue).
I hope this helps.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
please if you could help me out of how the variables strat_train_set
and strat_test_set
instead of getting overwritten every time get updates with new values as if operated with an append function which seems to be nowhere in the code above...
Hi @Samrat666 ,
Actually the for
loop only runs once since n_splits=1
. It we set n_splits=2
or more, then indeed the strat_train_set
and strat_test_set
variables would keep getting overwritten.
Perhaps I should have used the following code, it might have been clearer:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_index, test_index = next(split.split(housing, housing["income_cat"]))
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Hello Aurelien, its Fritz I left a tweet for you and you redirected me to Github. So I started working with the book Hands-on Machine Learning with sklearn and tensor flow. I am using the Pycharm IDE. Whenever I run the code, I received a totally new output. Sometimes the output(e.g Root mean squared error for RandomeFrorestRegressor) looks better and when I run again it changes to worst than before or gets better than before. Thesame thing happens when i use GridsearchCV to tune parameters. I can obtain two different outputs for what the best estimator is. Below is my code for Chapter2: