Closed nicolerichter1989 closed 3 years ago
Hello hello Nicole 🙋🏻♀️ , here we go with the revision of your project
Wow Nicole! Really good work in the readme!! Only as a details:
In this part of the readme Structure of the Data
lists all the variables we are going to work with. It would be perfect if you describe briefly each of them.
Maybe you could include a little description of the dataset.
You have a lot of files which means that your repo can be a bit "disorganised".
How could we organise this?
We could create different folders according to the type of file we store in them. For example:
A data
folder where we store all the data files we have. Both the original ones and the ones we create ourselves.
A SQL
folder, where we store all the SQL files with the queries.
A Notebooks
folder where we have all the jupyters. In case we have several, ideally we should number them in order to know the working order.
The src
folder where we have the .py
file or files with all the functions we will use in the project.
Then, if you have duplicate files like regression_project_tableau_backup.twb
, delete one of them.
You have two note files and a power point, if they are personal notes, to the gitignore (that's the main use of the gitignore, to put files that we don't want people to see but are important to us).
Finally, you have this folder mid-bootcamp-project-1
which cannot be selected or accessed. Try to remove it in the future
Well Nicole, I've left you with a good rant here, although it's true that we've never done much work on the order of the repos. I hope this will help you in the future
Let's go with the code!
Regression Project - Housing Prices
file:
use categories from Tableau exercise
for the yr_renovated
and columns
If you look at the code it is the same but for different columns. What you could do is to create a function that receives a parameter that is the column and you can use it for both columns. The code would look something like this:
def categories (col):
conditions = [
((data[col] >= 1900) & (data[col] < 1930)),
((data[col] >= 1930) & (data[col] < 1960)),
((data[col] >= 1960) & (data[col] < 1990)),
((data[col] >= 1990) & (data[col] <= 2015))
]
values = ['A', 'B', 'C', 'D']
data[col] = np.select(conditions, values)
return "Change done"
# when you call the function
data["yr_renovated"] = categories("data[col]")
🔴 It is also true that you make this change in these columns and then remove them because they are of no interest to our model, so I wouldn't do this part here. In the end it is not necessary to clean all the columns, but only those that we are interested in 😉!
More explanations in some parts, for example in the correlation matrix "What do you conclude from this matrix, do you do something with the values that are highly correlated?"
Let's go with this code
num_df_win['sqft_living_99%'] = winsorize(num_df['sqft_living'], limits=(0, 0.01))
num_df_win['sqft_lot_99%'] = winsorize(num_df['sqft_lot'], limits=(0, 0.01))
num_df_win['grade_99%'] = winsorize(num_df['grade'], limits=(0, 0.01))
num_df_win['sqft_above_99%'] = winsorize(num_df['sqft_above'], limits=(0, 0.01))
num_df_win['sqft_basement_99%'] = winsorize(num_df['sqft_basement'], limits=(0, 0.01))
num_df_win['sqft_living15_99%'] = winsorize(num_df['sqft_living15'], limits=(0, 0.01))
num_df_win['sqft_lot15_92.5%'] = winsorize(num_df['sqft_lot15'], limits=(0, 0.075))
num_df_win['price_wins_97.5%'] = winsorize(num_df['price'], limits=(0, 0.025))
You repeated the same code several times, the solution? Create a for loop. Here some lines of code
# create different list depending on the limits
limit_01 = ['sqft_living', 'sqft_lot', 'grade', 'sqft_above', 'sqft_basement', 'sqft_living15' ]
limit_075 = ['sqft_lot15']
limit_025 = ['price']
for col in data.columns:
if col in limit_01:
num_df_win[col] = winsorize(num_df[col], limits=(0, 0.01))
elif col in limit_075:
num_df_win[col] = winsorize(num_df[col], limits=(0, 0.075))
else:
num_df_win[col] = winsorize(num_df[col], limits=(0, 0.025))
# If you have doubt about this code, tell me and I try to help you.
Regarding the KNN model, when you choose the best value of k you did it perfect. As a detail, in pyhton we have the KElbowVisualizer
method which allow us to select optimal number of cluster in a simple way by fitting the model with a range of values.
Here some documentation.
# here an example of what the code would look like
# set the model
model = KMeans()
#inizialice the Visuaized. k correspond with the range of k we want to test
visualizer = KElbowVisualizer(model, k=(2,15), metric='silhouette')
# fit the model for all the k created in the previous step
visualizer.fit(X)
# return a plot highlighting the optimum number of k
visualizer.show()
Regarding the functions, when we write a function it mandatory to create a docstring.
What is it?
Is a string that give us the functionality of the code. What to put in the docstring of a function?
# taking one of your functions as an example:
def remove_outliers(data, threshold=1.25, in_columns=data.select_dtypes(np.number).columns, skip_columns=[]):
'''
Remove dataset outliers
Args:
df (dataframe): the target data set
threshold ( float): by default 1.5
in_columns (list): list with the names of the columns we are interested in
Returns:
The same data set without the outliers
'''
for column in in_columns:
if column not in skip_columns:
upper = np.percentile(data[column],75)
lower = np.percentile(data[column],25)
iqr = upper - lower
upper_limit = upper + (threshold * iqr)
lower_limit = lower - (threshold * iqr)
data = data[(data[column]>lower_limit) & (data[column]<upper_limit)]
return data
Here some documentation about docstrings in python
SQL questions - regression
Really good job in this part of the project Nicole!!! Nothing to said 🔝💪
As a recap of the correction:
.py
file with all the functionsEven with all that I said, very good job Nicole! 💪🔥
https://github.com/nicolerichter1989/mid-bootcamp-project