Closed Elissadejong closed 3 years ago
Hi Elissa, here I am with the first project review.
First of all, suuper specifying the objective and the tools used in this project 🔝
What did I miss in your readme?
visualization
or List of libraries
), avoid doing these things. It is better that you don't put them and later add more content than leave things empty.Regarding the libraries, put them on a list to make it more user friendly. If you add link to the official website it would be perfect. For example:
How can we include link in markdown?
[pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
If you use this structure you create a link like the ones we create for you when we send you documentation.
Example 👇🏽👇🏽
Here a markdown cheat-sheet in case it help you in the future
Overall, the structure of the repo is perfect, you have all the required files and the most important! No temporary files such as vscode
or DS_Store
have been left out of the gitignore.
I only have a two question here:
P1_fifa_money_ball.sql
. However, in this file you only have one line of code:
USE P1_FIFA_money_ball;
What you have used file for?? It is important not to include material within our repo that does not add value to it or that we will not use.
# Why did you do this?
data['height'] = data['height'].str.replace('"',"")
data['height'] = pd.to_numeric(data['height'].map(lambda x: int(x.split("'")[0])*30.48 + int(x.split("'")[1])*2.54))
I love 😍 how you have divided your jupyter into the different working steps, very nice work here 💪!!!
Suuper the managment that you did with the weight
, height
, release_clause
and the value
columns.
You have the same line of code to clean two different columns value
and wage
. To avoid having duplicate code we can create functions, for example:
def cleaning_symbols(data, col):
data[col] = data[col].map(lambda x: x.lstrip('€'))
for i in data[col]:
if 'K' in i:
data[col] = data[col].str.replace('K', '000')
elif 'M' in i:
data[col] = data[col].str.replace('M', '00000')
data[col] = data[].str.replace('.', '')
# important here pass columns names as strings
data = cleaning_symbols(data, "wage")
data = cleaning_symbols(data, "value")
This will make our work much cleaner and easier to follow.
Suuper this # dropping columns with more than 75% of NaN values
you have made a decission based on data 🚀, this is the objetive 👏🏽
Wow Elissa!!! You include SQL in your project, this is amazing!
When you merge the differents dataframes, you have merged one by one, which is not bad, but again you are repeating the same code's line many times. What can we do to avoid this?
# compile the list of dataframes you want to merge
data_frames = [data_mentalities, data_aggression, data_interceptions, data_positioning, data_vision, data_penalties, data_composure]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['club'], how='outer'), data_frames) # i think that that how parameter is outer, but could be better inner 🤔 (i'm not sure)
Regarging the model:
heatmap
. This has led you to have a heatmap that is quite difficult to interpret. To avoid this situations we can remove remove those variables that we know are not important to our model or select onlythose that we know are important (with our a priori knowledge) describe
method, it is not necessary to create first a "subdataframe" with the numerical columns, by default the describe
method only takes into account the numerical columns. fill_missing_n
Only as a detail, as a convention the imports in python ususally go at the beginning of the jupyter.
Overall your project is super complete Elissa, super good work seriously!!!! Here are a few things you could work on in the future
Visualization: when we are doing machine learning models, visualisation is super important, it can become our best friend 🤣! Not only will it help us simplify our data, but it will also help us understand the relationships between them and better communicate our results.
Storytelling: You have raised the questions, you get some results with some very good tables and tools (very very good use of all the tools that pandas gives us), but I missed a bit of storytelling in the jupyter, that this all a little more spun, in short that you tell me a story, from the beginning, because you ask yourself that question, how you plan to solve it and what are your conclusions and what decision you would take based on all the data that you have extracted.
All in all, good work Elissa, in this project you have reinforced the knowledge acquired so far in the bootcamp. You have explored the data, familiarised yourself with it, cleaned it and created a machine learning model with very good results. Congratulations 🔥!
https://github.com/Elissadejong/P1_FIFA_money_ball.git