W2 Project - Data cleaning & wrangling
The goal of this project is to combine everything you have learned about data wrangling, cleaning, and manipulation with Pandas so you can see how it all works together. For this project, you will start with this messy data set Shark Attack. You will need to download it, import it, use your data wrangling skills to clean it up, prepare it to be analyzed, and then export it as a clean CSV data file. Some graphs to better understand the data will surely be useful!!
TO DO's
- Decide on research question (or research questions)
- Explore the data and write down what you have found
- you can use:
df.describe()
, df["column"]
, etc.
- Draw graphs that are insightful.
- Use at least 5 data cleaning techniques inside a file named
clean.ipynb
- null values, columns drop, duplicated data, string manipulation, apply fn, categorize, regex, etc.
- Show data that validates the conclusions based on your research questions in a file named
analysis.ipynb
- Build a compelling story-telling around your findings. Think of your stakeholders and convince them with your conclusions! (Some slides with few text and pretty plots are normally useful)
Bonus (but...bonus?)
- Encapsulate your code into functions and save them into
.py
files: make sure you have docstrings
- Import those functions into your jupyter notebooks and call them (you will substitue your code with your own functions)
- Work on titles and comments to have a well presented and cohesive story in your notebook
- Include a slide-based presentation where you present your findings/conclusions/insights.
Suggested Ways to Get Started
- Examine the data and try to understand what the fields mean before diving into data cleaning and manipulation methods.
- Break the project down into different steps - use the topics covered in the lessons to form a check list, add anything else you can think of that may be wrong with your data set, and then work through the check list.
- Use the tools in your tool kit - your knowledge of Python, data structures, Pandas, and data wrangling.
Work through the lessons in class & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... procrastinating.
- Commit early, commit often, don’t be afraid of doing something incorrectly because you can always roll back to a previous version.
- Consult documentation and resources provided to better understand the tools you are using and how to accomplish what you want.
How to deliver the project
- Create a new repo with the name
data-cleaning-pandas
on your github account (or another name)
- Create a
README.md
file on repo root with project documentation. Make sure to include as much useful information as possible. Someone that finds the README.md should be able to fully get a gist of the project without browsing your files.
- Include a
.gitignore
- At least 1 jupyter notebook is required
- Including your functions in a
src.py
is very, very highly reccommended (maybe even mandatory, check with your instructors)
- DO NOT UPLOAD SHARKs ATTACK DATASET TO GITHUB
- Make sure that you are as detailed on your README.md as possible . The goal for this is so that everyone (knowledgeable or not )on the topic can understand.
- Open an
Issue
on this repo and paste your own repo's link.
Links & Resources