Ironhack-data-bcn-oct-2022 / project-II-pipelines

0 stars 2 forks source link

portada

Project: Data Pipeline

Overview

The goal of this project is for you to practice what you have learned so far on this program. For this project, you will start with a data set of your choice. You will need to import it, use your newly-acquired skills to build a data pipeline that processes the data and produces a result. You should demonstrate your proficiency with the tools we covered (functions, classes, list comprehensions, string operations, pandas, and error handling, etc.) in your pipeline.

You will be working individually for this project, but we'll be guiding you along the process and helping you as you go. Show us what you've got!

What is a pipeline?

A data pipeline is a series of data processes in which the output of each one is the input of the next, forming a chain.


Technical Requirements

The technical requirements for this project are as follows:

TO DO's

Summing up

You will be working with both jupyter notebooks and python scripts. The goals of this project are:

  1. To enrich a given dataset, either using API's or web-scrapping

For this first goal, you can either make calls on your cleaned dataset and add new columns to it, or you can do web-scrapping to generate a new dataset. Then, you'll have to plot graphs that show the relation between the data within the dataset (downloaded and enriched with API calls) or between the two datasets (the downloaded and the scrapped).

  1. To create executable python files.

E.g.: you tested your cleaning functions on your jupyter notebook. Now that they work, you take them to your cleaning.py file. Remember that you'll have to call those functions as well for them to be executed:

def sum(a, b) #defining
  return a+b

sum(3, 4) #calling

You should be able to run:

python3 cleaning.py

on your terminal so that it'll prompt you to enter a dataset to download. Then the code within your file will download it, clean it and export it.

After that's done, the rest of your code: enrichment and visualization can be told on jupyter notebooks.

So, basically, your repo structure should look something like:

1-downloading-and-cleaning.py #executable
2-enriching-and-cleaning.ipynb
3-visualizing.ipynb

However, even though the executable file will only be the cleaning.py, that doesn't mean that there are no more files.py. All of the functions that you use for enriching the datset (api calls, web-scrapping, cleaning the second dataset, etc) should also be stored in another file.py. Eg.:

4-api.py #not necessarily executable but can be
5-scrapping.py
6-other-functions-you-can-think-of.py

Super Ultra Mega Blaster Tips

Useful Resources