Task 2 :

Objective:

In this task, you will initialize a new Spring Boot project, add the JSoup dependency, and write a simple Java program to scrape data from a website of your choice. The goal is to get hands-on experience with setting up a Spring Boot project, using an external library (JSoup), and applying web scraping techniques.

Instructions:

Initialize a New Spring Boot Project:
- Visit [Spring Initializr](https://start.spring.io/) and generate a new Spring Boot project with the following settings:
  - Project: Maven
  - Language: Java
  - Spring Boot Version: 3.x (the latest stable version)
  - Project Metadata:
    - Group: com.yourname
    - Artifact: web-scraper
    - Name: Web Scraper
    - Package Name: com.yourname.webscraper
  - Dependencies: Add the Spring Web dependency (to allow adding more features later).
- Click on "Generate" to download the project as a ZIP file.
- Unzip the downloaded file and open the project in IntelliJ IDEA.
Add the JSoup Dependency:
- Open the pom.xml file in the root directory of your project.
- Add the following JSoup dependency within the <dependencies> tag:
- Save the pom.xml file and allow IntelliJ to update the Maven project to download the JSoup library.
Choose a Website for Scraping:
- Select a website that you find interesting or relevant. It could be an e-commerce site, a news website, a blog, or any other public web page with data you'd like to extract.
- Identify the specific data you want to scrape from the website (e.g., product names, prices, article titles, etc.).
Implement a CommandLineRunner Class:
- Create a new Java class in the com.yourname.webscraper package that implements the CommandLineRunner interface.
- In the run method, use JSoup to connect to the website you chose and scrape the data.
- Print the scraped data to the console.
Run Your Application:
- Run the Spring Boot application and observe the output in the console.
- Ensure that the scraped data is displayed correctly.

Submit Your Work:

Once you’ve completed the task, submit the following:
- A brief description of the website you chose and what data you scraped.
- The Java code you wrote for the CommandLineRunner.
- A screenshot of the console output showing the scraped data.

Resources that can help:

https://github.com/dgPadBootcamps/Java-Bootcamp-2024/discussions/64
explain what is scraping https://www.parsehub.com/blog/what-is-web-scraping/
explain what is the Jsoup https://jsoup.org/

Movies Data Scraping

My application extracts movies data from the famous IMDB website and displays their: title, year, duration, age rating, stars rating, image url, and the navigation link of the movie. Below are the files behind this application.

Movie.java

package com.hodroj.webscraper;

public class Movie {
    private String title;
    private String year;
    private String duration;
    private String ageRating;
    private String starRating;
    private String imgUrl;
    private String movieUrl;

    public Movie() {
    }

    public Movie(String title, String year, String duration, String ageRating, String starRating, String imgUrl, String movieUrl) {
        this.title = title;
        this.year = year;
        this.duration = duration;
        this.ageRating = ageRating;
        this.starRating = starRating;
        this.imgUrl = imgUrl;
        this.movieUrl = movieUrl;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getYear() {
        return year;
    }

    public void setYear(String year) {
        this.year = year;
    }

    public String getDuration() {
        return duration;
    }

    public void setDuration(String duration) {
        this.duration = duration;
    }

    public String getAgeRating() {
        return ageRating;
    }

    public void setAgeRating(String ageRating) {
        this.ageRating = ageRating;
    }

    public String getStarRating() {
        return starRating;
    }

    public void setStarRating(String starRating) {
        this.starRating = starRating;
    }

    public String getImgUrl() {
        return imgUrl;
    }

    public void setImgUrl(String imgUrl) {
        this.imgUrl = imgUrl;
    }

    public String getMovieUrl() {
        return movieUrl;
    }

    public void setMovieUrl(String movieUrl) {
        this.movieUrl = movieUrl;
    }

    @Override
    public String toString() {
        return "Title: " + title + ", Year: " + year + ", Duration: " + duration + ", Age Rating: " + ageRating + ", Rating: " + starRating + ", Poster: " + imgUrl + ", Link: " + movieUrl;
    }
}

ScrapingService.java

package com.hodroj.webscraper;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class ScrapingService {
    public static List<Movie> scrapedMovies(String url) throws IOException {
        List<Movie> movieList = new ArrayList<>();
        Document doc = Jsoup.connect(url).get();
        Elements movies = doc.select("li.cli-parent");

        for (Element movieElement : movies) {

            String title = movieElement.select("h3.ipc-title__text").text();

            String year = movieElement.select("div.cli-title-metadata span").get(0).text();

            String duration = movieElement.select("div.cli-title-metadata span").get(1).text();

            String ageRating = movieElement.select("div.cli-title-metadata span").get(2).text();

            String starRating = movieElement.select("span.ipc-rating-star--rating").text();

            String imgUrl = movieElement.select("div.ipc-media img").attr("src");

            String movieUrl = "https://www.imdb.com/" + movieElement.select("a.ipc-lockup-overlay").attr("href");

            movieList.add(new Movie(title, year, duration, ageRating, starRating, imgUrl, movieUrl));
        }
        return movieList;
    }
}

ScrapingApplication.java

package com.hodroj.webscraper;

import org.springframework.boot.CommandLineRunner;
import org.springframework.stereotype.Component;

import java.util.List;
import java.util.Scanner;

@Component
public class ScrapingApplication implements CommandLineRunner {
    ScrapingService scrapingService = new ScrapingService();

    @Override
    public void run(String... args) throws Exception {
        Scanner scanner = new Scanner(System.in);
        while (true){
            System.out.println("Enter your link to scrape or type exit");
            String link = scanner.nextLine();
            if(link.equals("exit"))
                break;

            List<Movie> movieList = scrapingService.scrapedMovies(link);

            for(Movie movie : movieList)
                System.out.println(movie.toString());
        }
    }
}

Output

Any page in IMDB that has a list of movies is supposed to work, I used the below link as a test: https://www.imdb.com/chart/top/?sort=rank%2Casc scrapedMovies

dgPadBootcamps / Java-Bootcamp-2024

Task 2 : Adding Dependencies and Web Scraping #65