DE-2410-A / web-scraping-vedi-hamza

de-2410-a-challenges-web-scraping-web-scraping-activity created by GitHub Classroom
0 stars 1 forks source link

Digital Futures

Python For Extracting

Introduction

This is a practical activity to create a set of Python scripts to perform web scraping. Data will be scraped from the a website that is intended for such a purpose called Books to Scrape. The data scraped includes the book title, price, rating, and availability. Once the data has been scraped, the activities will show how to extract the data from the HTML ready for cleansing and transformation.

The activities include the production Python scripts required and also the unit tests to ensure that all scripts are reliable and working correcty.


Webscraping with Python - Learner Stories

In completing this activity, you will be working on the following user stories from the Data Engineering SKU backlog:

As a DATA PROFESSIONAL,  
I want to be able to SCRAPE data from websites,  
so that I can COLLECT data from them ready for TRANSFORMATION and INTEGRATION into my DATA PIPELINE

As a DATA PROFESSIONAL,  
I want to be able to write UNIT TESTS using the unittest and/or pytest module,  
so that I can ENSURE my CODE is RELIABLE and WORKS correctly

Definitions of Done

Webscraping


Unit Tests


What is Web Scraping? - ***IMPORTANT

Read this information page on Web Scraping before continuing with the activity.

Here are some further resources to help you understand the concepts of web scraping:


Getting Started

  1. After accepting the assignment from GitHub Classroom, both you and your partner should clone the repository to your local machines.

  2. One of the pair should then do the following locally:

# Replace <YOUR-BRANCH-NAME> with anything suitable - e.g. feature/webscraping
git checkout -b feature/<YOUR-BRANCH-NAME>
# Create a new Python environment
python3 -m venv .venv

# Activate the Python environment
source .venv/bin/activate

# Install the required Python packages
pip install ipykernel

# Create a requirements.txt file
pip freeze > requirements.txt
git add .
git commit -m "CHORE: set up Python environment"
git push -u origin feature/<YOUR-BRANCH-NAME>
  1. The other person should then pull the new branch to their local machine and switch to it
git pull
git switch feature/<YOUR-BRANCH-NAME>
cd /path/to/your/project/root

# Create a new Python environment
python3 -m venv .venv
# Activate the Python environment
source .venv/bin/activate
pip install -r requirements.txt
  1. Both of you should then ensure that you select the correct Python Interpreter for the Jupyter Notebook
    • In VSCode, set the Kernel from the top right of the Notebook window and choose the .venv environment

|---> Next ---> Understanding the Problem