ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.17k stars 12.92k forks source link

New_outputs after each run of my code #469

Open FritzPeleke opened 5 years ago

FritzPeleke commented 5 years ago

Hello Aurelien, its Fritz I left a tweet for you and you redirected me to Github. So I started working with the book Hands-on Machine Learning with sklearn and tensor flow. I am using the Pycharm IDE. Whenever I run the code, I received a totally new output. Sometimes the output(e.g Root mean squared error for RandomeFrorestRegressor) looks better and when I run again it changes to worst than before or gets better than before. Thesame thing happens when i use GridsearchCV to tune parameters. I can obtain two different outputs for what the best estimator is. Below is my code for Chapter2:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import numpy as np
pd.options.display.width = 0

housing = pd.read_csv(r'https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv')
print(housing.head())
print(housing.info())

#Finding how districts are divided based on ocean_proximity: One can use groupby method too
print(housing['ocean_proximity'].value_counts())
print(housing['median_house_value'].value_counts().sort_values(ascending=False))

#showing a summary of numerical attributes
print(housing.describe())

#plotting a histogram of numerical attribute
#housing.hist(bins=50)
#plt.show()

##creating a test sample

train_set,test_set =  train_test_split(housing,test_size=0.2,random_state=42)
print(len(train_set))
print(len(test_set))

##Stratified sampling
housing['income_cat'] = np.ceil(housing['median_income']/1.5)
housing['income_cat'].where(housing['income_cat']<5.0,other=5.0,inplace=True)
print(housing['income_cat'].head())
#housing['income_cat'].hist()
#plt.show()
print(housing.head())
splitter = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in splitter.split(housing,housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
#proportions for each strata
#print(strat_test_set['income_cat'].value_counts()/len(strat_test_set))

#deleting the created column "income_cat" so data looks like the original of the beginning
for set_ in (strat_train_set,strat_test_set):
    set_.drop('income_cat',axis=1,inplace=True)

##visualise data for more insight(make a copy of training set to avoid harming the original training set)
housing_copy = strat_train_set.copy()
housing_copy.plot(kind='scatter',x='longitude',y='latitude',alpha=0.1)#alpha command shows areas of high density
#plt.show()
#lets visualize with more style(s = radius of circles,c = color)
housing_copy.plot(kind='scatter',x='longitude',y='latitude',s=housing_copy['population']/100,label='population',c='median_house_value',cmap=plt.get_cmap('jet'),colorbar=True,)
plt.legend()
#plt.show()

##Looking at correlations in data
cor_matrix = housing_copy.corr()
print(cor_matrix)
print(cor_matrix['median_house_value'].sort_values(ascending=False))
#checking correlation with pandas scatter_matrix_function
attributes = ['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(housing_copy[attributes],figsize=(12,8))
#plt.show()
#since median_income shows strong correlation to median_house_value lets focus on it
housing_copy.plot(kind='scatter',x='median_income',y='median_house_value',alpha=0.1)
#plt.show()
#attribute combinations e.g bedrooms per rooms is reasonable than just looking at the bedrooms
housing_copy['bedrooms_per_rooms'] = housing_copy['total_bedrooms']/housing_copy['total_rooms']
housing_copy['rooms_per_household']=housing_copy['total_rooms']/housing_copy['households']
print(housing_copy.corr().median_house_value.sort_values(ascending=False))

##Data cleaning: we make a new copy of the training set
train_set_copy = strat_train_set.copy().drop('median_house_value',axis=1)
train_set_labels = strat_train_set['median_house_value'].copy()
#print(train_set_copy.info())
#print(strat_train_set.info())
#cleaning
#there are 3 ways to deal with missing values:-delete the rows,delete the whole column or fill(mean,median or zero)
#here we use sklearn Simpleimputer function because it can compute the median or mean for all num_attributes
imputer = SimpleImputer(strategy='median')
train_numeric_set = train_set_copy.drop('ocean_proximity',axis=1)
imputer.fit(train_numeric_set)
#print(imputer.statistics_)

X = imputer.transform(train_numeric_set)
#print(X)

##Handling attributes with text in two ways
housing_text_cat = housing_copy['ocean_proximity']
housing_text_cat_encoded, housing_categories = housing_text_cat.factorize()
#print(housing_text_cat_encoded)
#doing one hot encoding
encoder = OneHotEncoder(categories='auto')
housing_1hot = encoder.fit_transform(housing_text_cat_encoded.reshape(-1,1))

#one can use the Onehotnecoder directly without need for pd.factorize()
tes = housing_copy['ocean_proximity'].copy()
enc = OneHotEncoder()
hs_cat_encoded = enc.fit_transform(tes.values.reshape(-1,1))
#print(hs_cat_encoded[:10])

#Building custom transformer

room_ix, bedrooms_ix,population_ix,household_ix = [list(housing.columns).index(col)
                                                   for col in ('total_rooms','total_bedrooms','population','households')]
class AttributeCombiner(BaseEstimator,TransformerMixin):

    def __init__(self,add_bedroom_per_room = True):
        self.add_bedroom_per__rooms = add_bedroom_per_room
    def fit(self,X,y = None):
        return self
    def transform(self,X,y = None):
        room_per_household = X[:,room_ix]/X[:,household_ix]
        population_per_household = X[:,population_ix]/X[:,household_ix]
        if self.add_bedroom_per__rooms:
            bedrooms_per_room = X[:,bedrooms_ix]/X[:,room_ix]
            return np.c_[X,room_per_household,population_per_household,bedrooms_per_room]
        else:
            return np.c_[X,room_per_household,population_per_household]
ComAtt = AttributeCombiner(add_bedroom_per_room=True)
new_att = ComAtt.fit_transform(housing.values)
#can create new df with new attributes and explore further
'''a = ['room_per_household','population_per_household','bedrooms_per_room']
columns = list(housing.columns)
print(columns)
columns = columns + a
new_df = pd.DataFrame(data=new_att,columns=columns)
print(new_df)'''

##Transformation Pipeline
#our pipeline requires numpy arrays but our data is pandas a dataframe,so we create a trans to convert it

class ArrayProducer(BaseEstimator,TransformerMixin):

    def __init__(self,sel_attributes):
        self.num_attributes = sel_attributes
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        return X[self.num_attributes].values

num_attributes = list(train_numeric_set.columns)
cat_attribute = ['ocean_proximity']
num_pipeline = Pipeline(steps=[('ArrayProducer',ArrayProducer(sel_attributes=num_attributes)),
                               ('imputer',SimpleImputer(strategy='median')),
                               ('att_combiner',AttributeCombiner()),
                               ('scaler',StandardScaler())
                               ])
cat_pipeline = Pipeline(steps=[('ArrayProducer',ArrayProducer(sel_attributes=cat_attribute)),
                               ('hotencoder',OneHotEncoder(sparse=False))])

full_pipeline = FeatureUnion(transformer_list=[('num_pipeline',num_pipeline),
                                               ('cat_pipeline',cat_pipeline)])

housing_processed = full_pipeline.fit_transform(housing_copy)

print(housing_processed)

##selecting a model
# train a Linear Regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
lin_reg = LinearRegression()
lin_reg.fit(housing_processed,train_set_labels)
lin_predictions = lin_reg.predict(housing_processed)
#calculating RMSE
lin_mse = mean_squared_error(train_set_labels,lin_predictions)
lin_rmse = np.sqrt(lin_mse)
print('lin_mse=',lin_rmse)#mse shows linear regresion underfits data

#train a decisiontreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_processed,train_set_labels)
tree_predictions = tree_reg.predict(housing_processed)
tree_mse = mean_squared_error(train_set_labels,tree_predictions)
tree_rmse = np.sqrt(tree_mse)
print('tree_mse',tree_rmse)# 0 mse is too good to be true. Model surel overfits data

##Evaluation and cross validation
#cross validation for Decision Tree
tree_scores = cross_val_score(tree_reg,housing_processed,train_set_labels,scoring='neg_mean_squared_error',cv=10)
tree_rmse_cv = np.sqrt(-tree_scores)

def display_scores(scores):
    print('scores\n',scores)
    print('mean:',scores.mean())
    print('standard_deviation',scores.std())
display_scores(tree_rmse_cv)
#croos validation for Linear regression
lin_scores = cross_val_score(lin_reg,housing_processed,train_set_labels,scoring='neg_mean_squared_error',cv=10)
lin_rmse_cv = np.sqrt(-lin_scores)
display_scores(lin_rmse_cv)
#cross validation on RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg = forest_reg.fit(housing_processed,train_set_labels)
predictions = forest_reg.predict(housing_processed)
for_mse = mean_squared_error(train_set_labels,predictions)
forest_rmse = np.sqrt(for_mse)
forest_scores = cross_val_score(forest_reg,housing_processed,train_set_labels,scoring='neg_mean_squared_error',cv=10)
forest_rmse_cv = np.sqrt(-forest_scores)
print(forest_rmse)
print('RandomforestRegressor')
display_scores(forest_rmse_cv)

'''# saving models
import joblib
joblib.dump(lin_reg,'lin_reg.pkl')
my_model_loaded = joblib.load('lin_reg.pkl')'''

##Fine tuning a model
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators':[3,10,30],'max_features':[2,4,6,8]},{'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]}]
Gridsearch = GridSearchCV(forest_reg,cv=5,param_grid=param_grid,scoring='neg_mean_squared_error',refit=True)
Gridsearch.fit(housing_processed,train_set_labels)
#If you want to use best_estimator you need to adjust refit = True
#print(Gridsearch.best_estimator_)

'''cvres=Gridsearch.cv_results_
for mean_score,params in zip(cvres['mean_test_score'],cvres['params']):
    print(np.sqrt(-mean_score),params)'''
'''from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
par_dist = {'n_estimators': randint(low=1, high=200),
            'max_features': randint(low=1, high=10)}
rnd_search = RandomizedSearchCV(forest_reg,param_distributions=par_dist,n_iter=10,scoring='neg_mean_squared_error',cv=5,random_state=42,refit=False)
rnd_search.fit(housing_processed,train_set_labels)
rncvres = rnd_search.cv_results_
#df = pd.DataFrame(rncvres)'''

#Testing the model on the test set
final_model = Gridsearch.best_estimator_
x_test = strat_test_set.drop('median_house_value',axis=1)
y_test = strat_test_set['median_house_value'].copy()
x_test_prepared = full_pipeline.transform(x_test)
y_pred = final_model.predict(x_test_prepared)
mse_test = mean_squared_error(y_test,y_pred)
rmse_test = np.sqrt(mse_test)
print('Root_mean_squared_error:\n',rmse_test)
ageron commented 5 years ago

Hi Fritz, Thanks for your question! There are several possible causes for randomness in Machine Learning programs. The first is simply that many algorithms rely on stochasticity. For example, RandomForestRegressor obviously relies on randomness. To make sure its outputs are reproducible, you need to use the same random seed every time you run it. That's the purpose of the random_state arguments. So first, make sure you set this argument (e.g., to random_state=42) for every class or function that has this argument. For example, the DecisionTreeRegressor and RandomForestRegressor classes use randomness, so you should set their random_state argument when creating an instance. In this case, this will probably suffice.

FYI, another source of randomness is Python's hash function, which is used every time you rely on the order of items in a dictionary or a set. For example, open a Python shell, type list(set("abcdef")), look at the output, then close the shell, open a new one and try again. If you use Python 3, you will get two different outputs. This is a safety feature, to avoid an (unlikely) denial of service attack. To ignore this feature and have reproducible outputs, you must set the PYTHONHASHSEED environment variable to 0 (before starting Python). For example:

$ PYTHONHASHSEED=0 python3
>>> list(set("abcdef")) # always returns the same order:
['b', 'a', 'd', 'c', 'f', 'e']

Here are a few other sources of non-reproducibility:

To learn more, check out my video on this topic.

Hope this helps!

Samrat666 commented 5 years ago

housing['income_cat'].where(housing['income_cat']<5.0,other=5.0,inplace=True) Sir what is happening in here...

ageron commented 5 years ago

Hi @Samrat666 ,

Thanks for your question. The where() method can be confusing. It takes a condition as the first argument, and wherever it is True, then the result keeps the DataFrame 's corresponding value, or else it uses the other dataframe's value. With inplace=True, the original DataFrame is modified directly (or else a new DataFrame is created and returned).

So in this particular case, the line says "keep the income_cat column as it is when the value is <5.0, but for any value ≥5.0 use the other value, which is 5.0". In other words, it replaces all values greater than 5.0 with 5.0.

Hope this helps.

Samrat666 commented 5 years ago

Really thanks sir for the reply sir your efforts are laudable sir..

On Fri, 6 Sep, 2019, 11:14 Aurélien Geron, notifications@github.com wrote:

Hi @Samrat666 https://github.com/Samrat666 ,

Thanks for your question. The where() method can be confusing. It takes a condition as the first argument, and wherever it is True, then the result keeps the DataFrame 's corresponding value, or else it uses the other dataframe's value. With inplace=True, the original DataFrame is modified directly (or else a new DataFrame is created and returned).

So in this particular case, the line says "keep the income_cat column as it is when the value is <5.0, but for any value ≥5.0 use the other value, which is 5.0". In other words, it replaces all values greater than 5.0 with 5.0.

Hope this helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/469?email_source=notifications&email_token=AMDZ7NXQRGI3C7QSXYH6IA3QIHU5BA5CNFSM4IJC7Q32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BZOZA#issuecomment-528717668, or mute the thread https://github.com/notifications/unsubscribe-auth/AMDZ7NQFXVJPHLPHY77CGBTQIHU5BANCNFSM4IJC7Q3Q .

Samrat666 commented 4 years ago
def split_tarin_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):
       ids = data[id_column]
       in_test_set = ids.apply(lambda id_: test_set_check(id_,test_ratio, hash))

Sir could you please specify where does the lambda parameter i.e. id_ gets its value from and what is the value it would get and why. Please sir if you could help

ageron commented 4 years ago

Hi @Samrat666 ,

When you call apply() on a Pandas Series object, you must pass it a function with a single argument, and Pandas will call this function for each and every item in the Series. For example:

import pandas as pd
s = pd.Series([1,2,3])

def triple(x):
    print("I will triple", x)
    return x * 3

result = s.apply(triple)
print("----")
print(result)

When you run this code, you get this output:

I will triple 1
I will triple 2
I will triple 3
----
0    3
1    6
2    9
dtype: int64

As you can see, the triple() function was called once per element in the Series.

Now back to the split_train_test_by_id() function: its data argument is a Pandas DataFrame. The first line of the function gets the column whose name is specified in the id_column argument. This returns a Pandas Series object, so that's what ids is. Next we call the apply() method on this Series, so it calls the given lambda once for each id in the column.

In the notebook, I assume that the ids are 64-bit integers, so id_ would just be an integer.

Side note: I chose to name the argument id_ rather than id because id is the name of a built-in function in Python. In fact, I should probably have called the hash argument hash_, since hash is a built-in function as well (this is why syntax highlighting displays it in blue).

I hope this helps.

Samrat666 commented 4 years ago
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

please if you could help me out of how the variables strat_train_set and strat_test_set instead of getting overwritten every time get updates with new values as if operated with an append function which seems to be nowhere in the code above...

ageron commented 4 years ago

Hi @Samrat666 , Actually the for loop only runs once since n_splits=1. It we set n_splits=2 or more, then indeed the strat_train_set and strat_test_set variables would keep getting overwritten.

Perhaps I should have used the following code, it might have been clearer:

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_index, test_index = next(split.split(housing, housing["income_cat"]))
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]