Open Samrat666 opened 4 years ago
what does this line of code do? rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
could you please help me out with this line of code sir
rooms_ix
(room index) is a python variable holding the value of 3.
housing_ix
(housing index) is a python variable holding the value of 6.
X
is a numpy 2D array.
X[:, rooms_ix]
selects column 3 of array X, and it is a 1D array.
X[:, household_ix]
selects column 6 of array X, and it is a 1D array as well.
Then, the expression X[:, rooms_ix] / X[:, household_ix]
performs element-wise division of these arrays and assigns the result to python variable rooms_per_household
.
Read a bit about Numpy arrays indexing and arithmetic.
Sir, could you please explain how does the code work:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42) for train_set, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]
:-Could you please explain why we always use split along with the StratifiedShuffleSplit() i.e. as StratifiedShuffleSplit(...).split(...) and could you please explain why we always keep making the random_state = 42 where we could just make it equal to any other constant please sir if u could help me....
Please if you could also explain how the values are loaded to the variables train_set and test_set each time the loop executes.... Please sir sorry for the inconvenience but I would be highly grateful for the support
Sir, could you please explain how does the code work:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42) for train_set, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]
The line split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
creates an instance of the class StratifiesShuffleSplit which is defined in the module sklearn.model_selection. You can consult the documentation of Sklearn to verify that StratifiedShuffleSplit is a class by looking at the first line of appropriate documentation, which reads as follows:
The line for train_set, test_index in split.split(housing, housing["income_cat"]):
is the for loop command and it calls the split method (i.e. the second split in split.split(...)
) of the split object (i.e. the first 'split' in split.split(...)
) created previously. Here, you have to think of this split method as returning a generator. You don't have to know the exact details of how the split method is implemented. Through each iteration of the loop, a tuple is returned by this split method. This tuple is then assigned to the Python variables train_set
and test_index
respectively.
Read about how the for-loop executes, and also read about what is a generator.
[Some hints: the for-loop stops to iterate when StopIteration
is raised. And a generator is coded with the keyword yield
. You can investigate these to get a better understanding, but no need to overwhelm yourself with too much details at this stage. With some practice this becomes easier to grasp.]
The line strat_train_set = housing.loc[train_index]
is used to select rows in housing
DataFrame object from the pandas module. The rows of the DataFrame object corresponding to the indices in the Numpy array train_index
are selected. Here, you should investigate about DataFrame indexing. It is similar to Numpy array indexing but with some notable differences. If you have studied Numpy indexing already, this should not be too hard now.
:-Could you please explain why we always use split along with the StratifiedShuffleSplit() i.e. as StratifiedShuffleSplit(...).split(...)
StratifiedShuffleSplit is a class. It has to be instantiated first before being used. When it is instantiated, the instance thus created is used to call the method split. This is a feature of object oriented programming and Python is an object oriented language. This model of programming simplifies a lot of the coding process. Conceptually, it simplifies the design process as well. You should read some more about classes and instances in Python.
Please if you could also explain how the values are loaded to the variables train_set and test_set each time the loop executes....
Consider the following simpler cases to understand what is happening:
x,y = (3,4)
When the above code is executed in Python, x
will hold the value 3 and y
will hold the value 4.
x,y = ([1,2,3,4],[5,6,7,8])
When the later is executed, x
will hold the list [1,2,3,4] and y
will hold the list [5,6,7,8]
Now, through each iteration of the loops, a new tuple is returned and the variables train_set
and test_set
are reassigned each time. And then these variables are used in the body of the for-loop. This process repeats until the loop exits.
and could you please explain why we always keep making the random_state = 42 where we could just make it equal to any other constant
There is no particular significance with the number 42 in programming. You can use another number if you want to, but then different results would be produced. By setting the random_state to a common number, the sequence of pseudo random numbers generated can be reproduced.
One explanation as to why the number 42 is often used is often said to be because of a book, The Hitchhiker's Guide to the Galaxy, written by Douglas Adams. In that book, the author wrote a joke which has one of the characters of the book answering that 42 is the answer to everything! Since then, the number 42 has become somewhat of a geek thing used to initialize random number generators! Take it humorously! You can set your random_state to another number, it does not matter.
Really Really highly obliged sir...
sir but I the for loop, i.e. for train_index, test_index in split.split(housing, housing["income_cat"]): , you said each time a tuple is assigned to the duo variables and each value is stored in the variable accordingly but the train_index is 80 percent of the data but the test_index is just 20 percent of the data so after the 20 percent of the loading th values assigned to the test_index wil get exausted so then the tuple assigned for both would carry what sir?? I remember u told me not to get into this stuff much but out of curiosity please if u could help
I remember u told me not to get into this stuff much but out of curiosity please if u could help
Yeah, I told you that, but obviously you did not listen to me!!
Your question is in relation to how for-loops work in Python. When you run the following:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
a StratifiedShuffleSplit object is created.
Notice that you also specify the number of splits in the constructor, namely, 'n_splits' which is here set to 1. This will generate a list of 1 tuple as item. If you had specified more than 1, say 13, then a list of 13 tuples would have been generated.
Then in the for-loop, this list will be iterated over and assigned as I had explained to you.
for train_set, test_index in split.split(housing, housing["income_cat"]):
When the end of this list is reached, a StopIteration exception is raised and the for-loop exits. When the for-loop exits, the body of the loop is not executed anymore but code right after the loop and it's body executes.
cmap=plt.cm.gray, cmap=mpl.cm.binary and cmap=plt.get_cmap("gray")
I just wanted to ask that what is the difference between all these three and the other functions. Please help me out sir I searched a lot on the internet over this issue but could not be resolved. My issue is that :-- 1>Are these three almost same function or are different with different set of color maps? 2>Are the color maps same but perform differently when operated on these three? 3>If not then are these functions some data or dataset specific if not then what are the use of these three statements... You can understand my breadth of confusion sir please help...
what does this line of code do? rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
could you please help me out with this line of code sir