The goal of this project is for you to practice what you have learned so far on this program. For this project, you will start with a data set of your choice. You will need to import it, use your newly-acquired skills to build a data pipeline that processes the data and produces a result. You should demonstrate your proficiency with the tools we covered (functions, classes, list comprehensions, string operations, pandas, and error handling, etc.).
You will be working individually for this project, but we'll be guiding you along the process and helping you as you go. Show us what you've got!
A data pipeline is a series of data processes in which the output of each one is the input of the next, forming a chain. As a BONUS step: you should automate the process of: extraction, transformation, merging and visualizing. You can do this by modularizing your code and running a main.py
The technical requirements for this project are as follows:
A) Find a dataset to start you work! A great place to start looking would be Awesome Public Data Sets and Kaggle Data Sets.
B) Clean and wrangle your dataset, prepare the data for your needs and intentions.
C) Enrich the database with external data, you have to choose at least one of the following:
beautifulsoup
module.D) The data you bring in to enrich the dataset must be related to it and complement it! Figure out how it fits together and how you prepare the data of both sources for your report. Some suggestions on how you could achieve this:
E) Create some reports containing valuable data from the dataset + enrichment. Some of the things you may do are:
mean
, max
, min
, std
, etc.).groupby()
.F) The finished report must be a very pretty jupyter notebook, with text, clean code, meaningful outputs, plots and charts. Try telling a story with your data, that is, conduct us (the readers) through your findings and lead us into your conclusions.
.py
modules. Be not afraid to modulate 🎶You will be working with both jupyter notebooks and python scripts. The goals of this project are:
For this first goal, you can either make calls on your cleaned dataset and add new columns to it, or you can do web-scrapping to generate a new dataset. Then, you'll have to plot graphs that show the relation between the data within the dataset (downloaded and enriched with API calls) or between the two datasets (the downloaded and the scrapped).
BONUS
E.g.: you tested your cleaning functions on your jupyter notebook. Now that they work, you take them to your cleaning.py
file. Remember that you'll have to call those functions as well for them to be executed:
def sum(a, b) #defining
return a+b
sum(3, 4) #calling
You should be able to run:
python3 cleaning.py
on your terminal so that it'll prompt you to enter a dataset to download. Then the code within your file will download it, clean it and export it.
After that's done, the rest of your code: enrichment and visualization can be told on jupyter notebooks.
So, basically, your repo structure should look something like:
1-downloading-and-cleaning.py #executable
2-enriching-and-cleaning.ipynb
3-visualizing.ipynb
However, even though the executable file will only be the cleaning.py
, that doesn't mean that there are no more files.py
. All of the functions that you use for enriching the datset (api calls, web-scrapping, cleaning the second dataset, etc) should also be stored in another file.py
. Eg.:
4-api.py #not necessarily executable but can be
5-scrapping.py
6-other-functions-you-can-think-of.py
We recommend that on the first day of the project kick-off, you find a theme to base your project on, you can start by basing it on what areas you like, here are some examples:
Then, within each area there are different topics, for example:
Within gastronomy we can find topics such as the evolution of gastronomy in Europe and new trends and how it influences the business. Or the best gastronomies in the world and what to consider before setting up a restaurant, etc…
Choose the data sources ASAP and try to stick to the plan. Don't switch datasets/API's/webs halfway.
Examine the data.
Break the project down into different steps - A hundred simple tasks are` better than a single complicated one
Use the tools in your tool kit - your knowledge of intermediate Python as well as some of the things you've learned in the bootcamp. This is a great way to start tying everything you've learned together!
Work through the lessons in class & ask questions when you need to!
Think about adding relevant code to your project each day, instead of, you know... procrastinating.
Commit early, commit often, don’t be afraid of doing something incorrectly because you can always roll back to a previous version. Name your commits well.
Consult documentation and resources provided to better understand the tools you are using and how to accomplish what you want. GIYF.
Have fun! Never give up! Be proud of your work!