mid bootcamp project

Hello hello Nicole 🙋🏻‍♀️ , here we go with the revision of your project

README

Wow Nicole! Really good work in the readme!! Only as a details:

In this part of the readme Structure of the Data lists all the variables we are going to work with. It would be perfect if you describe briefly each of them.
Maybe you could include a little description of the dataset.

Repo structure

You have a lot of files which means that your repo can be a bit "disorganised".

How could we organise this?

We could create different folders according to the type of file we store in them. For example:

A data folder where we store all the data files we have. Both the original ones and the ones we create ourselves.
A SQL folder, where we store all the SQL files with the queries.
A Notebooks folder where we have all the jupyters. In case we have several, ideally we should number them in order to know the working order.
The src folder where we have the .py file or files with all the functions we will use in the project.
Then, if you have duplicate files like regression_project_tableau_backup.twb, delete one of them.
You have two note files and a power point, if they are personal notes, to the gitignore (that's the main use of the gitignore, to put files that we don't want people to see but are important to us).
Finally, you have this folder mid-bootcamp-project-1 which cannot be selected or accessed. Try to remove it in the future

Well Nicole, I've left you with a good rant here, although it's true that we've never done much work on the order of the repos. I hope this will help you in the future

Code syntax

Let's go with the code!

Regression Project - Housing Prices file:

When you do this use categories from Tableau exercise for the yr_renovated and columns

If you look at the code it is the same but for different columns. What you could do is to create a function that receives a parameter that is the column and you can use it for both columns. The code would look something like this:

def categories (col): 
    conditions = [
    ((data[col] >= 1900) & (data[col] < 1930)),
    ((data[col] >= 1930) & (data[col] < 1960)),
    ((data[col] >= 1960) & (data[col] < 1990)),
    ((data[col] >= 1990) & (data[col] <= 2015))
]

    values = ['A', 'B', 'C', 'D']

    data[col] = np.select(conditions, values)

    return "Change done"

# when you call the function
data["yr_renovated"] = categories("data[col]")

🔴 It is also true that you make this change in these columns and then remove them because they are of no interest to our model, so I wouldn't do this part here. In the end it is not necessary to clean all the columns, but only those that we are interested in 😉!

More explanations in some parts, for example in the correlation matrix "What do you conclude from this matrix, do you do something with the values that are highly correlated?"

Let's go with this code

num_df_win['sqft_living_99%'] = winsorize(num_df['sqft_living'], limits=(0, 0.01))
num_df_win['sqft_lot_99%'] = winsorize(num_df['sqft_lot'], limits=(0, 0.01))
num_df_win['grade_99%'] = winsorize(num_df['grade'], limits=(0, 0.01))
num_df_win['sqft_above_99%'] = winsorize(num_df['sqft_above'], limits=(0, 0.01))
num_df_win['sqft_basement_99%'] = winsorize(num_df['sqft_basement'], limits=(0, 0.01))
num_df_win['sqft_living15_99%'] = winsorize(num_df['sqft_living15'], limits=(0, 0.01))
num_df_win['sqft_lot15_92.5%'] = winsorize(num_df['sqft_lot15'], limits=(0, 0.075))
num_df_win['price_wins_97.5%'] = winsorize(num_df['price'], limits=(0, 0.025))

You repeated the same code several times, the solution? Create a for loop. Here some lines of code

# create different list depending on the limits

limit_01 = ['sqft_living', 'sqft_lot', 'grade',  'sqft_above', 'sqft_basement', 'sqft_living15' ]
limit_075 = ['sqft_lot15']
limit_025 = ['price']

for col in data.columns:
    if col in limit_01: 
        num_df_win[col] = winsorize(num_df[col], limits=(0, 0.01))
    elif col in limit_075:
        num_df_win[col] = winsorize(num_df[col], limits=(0, 0.075))
    else: 
        num_df_win[col] = winsorize(num_df[col], limits=(0, 0.025))
# If you have doubt about this code, tell me and I try to help you.

Regarding the KNN model, when you choose the best value of k you did it perfect. As a detail, in pyhton we have the KElbowVisualizer method which allow us to select optimal number of cluster in a simple way by fitting the model with a range of values.

Here some documentation.

# here an example of what the code would look like

# set the model
model = KMeans()

#inizialice the Visuaized. k correspond with the range of k we want to test 
visualizer = KElbowVisualizer(model, k=(2,15), metric='silhouette')

# fit the model for all the k created in the previous step
visualizer.fit(X)  

# return a plot highlighting the optimum number of k     
visualizer.show()

Regarding the functions, when we write a function it mandatory to create a docstring.

What is it?

Is a string that give us the functionality of the code. What to put in the docstring of a function?

What the function does
The parameters it receives and what type they are
What it returns

# taking one of your functions as an example:
def remove_outliers(data, threshold=1.25, in_columns=data.select_dtypes(np.number).columns, skip_columns=[]):
 '''
Remove dataset outliers

Args: 
    df (dataframe): the target data set
    threshold ( float): by default 1.5
    in_columns (list): list with the names of the columns we are interested in
Returns:
    The same data set without the outliers
'''

    for column in in_columns:
        if column not in skip_columns:
            upper = np.percentile(data[column],75)
            lower = np.percentile(data[column],25)
            iqr = upper - lower
            upper_limit = upper + (threshold * iqr)
            lower_limit = lower - (threshold * iqr)
            data = data[(data[column]>lower_limit) & (data[column]<upper_limit)]
    return data

Here some documentation about docstrings in python

SQL questions - regression

Really good job in this part of the project Nicole!!! Nothing to said 🔝💪

TODOs

As a recap of the correction:

Create the .py file with all the functions
More organised repo
Write the docstrings in the functions
Atomise the code to try not to repeat code structures so much.

Even with all that I said, very good job Nicole! 💪🔥

Ironhack-Data-0621-Remote / mid-bootcamp-project

[nicole richter] mid-bootcamp-project #16

mid bootcamp project

README

Repo structure

Code syntax

TODOs