Alhajras / webscraper

Configurable search engine written in Python and Angular. It supports indexing as well.
1 stars 0 forks source link

Guide #30

Open Alhajras opened 10 months ago

Alhajras commented 10 months ago


Logo

Scriburg Search Engine

Master thesis project at the University of Freiburg
Read thesis »

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contact

About The Project

If you are looking for a free scalable tool to collect and index information from a specific set of domains on the Internet, Scriburg is the right tool for you.

Examples of valid use cases:

Screenshots

Crawlers Dashboard Crawler Configuration
showcase scene showcase scene 3
Indexers Dashboard Search Result
showcase scene showcase scene 3

(back to top)

Built With

Django Angular Docker Postgres

(back to top)

Getting Started

To simplify the installations process, docker compose is used and recommended.

Installation

Installing docker compose (Recommended) The docker compose version supported is: v2.16.0 run docker compose version to print your local version.

If you do not have docker compose you can install it from here Docker compose

(back to top)

Usage

Use Case - 1: World University Rankings 2023

You are a university professor and would like to maintain a local version list of the universities worldwide and their ranking. To do this, we will use the website timeshighereducation to download and index the information. The Times Higher Education World University Rankings 2023 include 1,799 universities across 104 countries and regions, making them the largest and most diverse university rankings to date.

Read more ... This is a screenshot on how the table we want to extract information from looks like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/47decc10-d491-4ff8-b1dc-4ec156f30d18) **Goal:** We would like to extract the following Fields: **University ranking, University name** and **University location** To do so, we will follow the next steps: ------------------------ ## 1 - Templates ![image](https://github.com/Alhajras/webscraper/assets/36598060/ce1b5e88-b483-4232-a020-8c8fd5bcff6c) We start by creating a _Template_, which is the blueprint that maps to the fields that want to be downloaded as a document. ### Steps: - Go to the Templates page - Click on `Create a template` button - Give a name of the template `{your-name}-uni-ranking` and click on save. - Expand the template you created. - Now we want to create the fields we want to capture from the page: `Uni name`, `Uni location` and `Uni ranking`. - Click on `Create an inspector` button and create the following inspectors: ``` Name: Uni name {yourname} Selector: //*[contains(@class, 'ranking-institution-title')] Type: text Name: Uni location {yourname} Selector: //*[contains(concat(' ', normalize-space(@class), ' '), ' location ')] Type: text Name: Uni ranking {yourname} Selector: //*[contains(@class, 'rank') and contains(@class, 'sorting_1') and contains(@class, 'sorting_2')] Type: text ``` This is how your inspector's list should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/7f5127fc-95a0-4d2e-ad4e-9dbc6e2e7c8e) ------------------------ ## 2 - Crawlers After creating a _Template_, we want to create and configure a _Crawler_. ### Steps: - Navigate to the Crawlers page - Click on `Create a crawler` and expand `Advanced options` - Fill the next values: ``` Name: {yourname}-uni-crawler Template: {your-name}-uni-ranking Max pages: 10000 Max collected docs: 300000 Seed URL: https://www.timeshighereducation.com/world-university-rankings/2023/world-ranking Allow multi elements crawling: Enable Links Scope (Pagination): //*[contains(@class, 'pagination')] Threads: 4 Max depth: 10000 ``` This is how it should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/45c8f3e0-b5a0-4860-bf73-024c56367cdf) - Click on `Create` button ------------------------ ## 3 - Runners Runners are jobs that run the crawling process in a cluster or locally. After creating the _Crawler_, we create a _Runner_. Now, we can run/stop the crawlers from the Runners page. ### Steps: - Navigate to the Runners page: - Click on `Create a Runner` - Fill the following: ``` Name: {yourname}-uni-runner Crawler: {yourname}-uni-crawler Machine: localhost ``` - Click on Create Find your runner in the list. Click on the burger menu to start crawling and click on `Start`. ![image](https://github.com/Alhajras/webscraper/assets/36598060/df26e6f5-3b3c-45de-988d-f0fb83ee76ad) The list will keep refreshing. You don't have to keep reloading the page. You can monitor the progress by looking at the `progress` column and `status` column. You can see the log and statistics by clicking on: ![image](https://github.com/Alhajras/webscraper/assets/36598060/ee956f61-817a-46f5-ab09-07e50eff5e26) ------------------------ ## 4 - Indexing After the runner is completed, we can start indexing the results. ### Steps: - Navigate to Indexers - Click on `create an indexer` - Fill the following: ``` name: {yourname}-uni-indexer Inspectors: {yourname}-uni-crawler (Uni name {yourname}) ``` - Click on Create. - Find your indexer from the list and click on `Start indexing`. - Watch the indexing going from status `New` to `Completed` ------------------------ ## 5 - Searching After crawling (Collecting data) and Indexing (Preparing them for searching), we can test if searching returns the right results. ### Steps: - Navigate to Search - Select your indexer - Search for: - `university` (Covering a normal query case) - `what is freiburg` (Covering a case where only one word should be more important than others) - `show me hamburg unis` (Long query) - `berlin` (Covering a normal query case) - `Humboldt Berlin` (Covering a normal query case) - `Electronic` (Covering a normal query case) - `college` (Covering a normal query case) - Testing the suggestions list: - Enter `Universi`, Should correctly suggest `university` - Enter `univsrity`, Misspelling should be forgiven, and the result should be `university` - Enter `university oxford`, should show results including `university of oxford`

Use Case - 2: Comparing Products Prices

You are a small business owner who would like to monitor and track the competitors. You can create more than one crawler to monitor different websites and for this use case, we will focus on Douglas.

Read more ... This is a screenshot on how the products we want to extract information from looks like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/489d206a-e444-40dc-a007-280fd938c03d) **Goal:** We would like to extract the following Fields: **Brand, Image, name** and **Price** To do so, we will follow the next steps: ------------------------ ## 1 - Templates ![image](https://github.com/Alhajras/webscraper/assets/36598060/ce1b5e88-b483-4232-a020-8c8fd5bcff6c) We start by creating a _Template_, which is the blueprint that maps to the fields that want to be downloaded as a document. ### Steps: - Go to the Templates page - Click on `Create a template` button - Give a name of the template `{your-name}-douglas` and click on save. - Expand the template you created. - Now we want to create the fields we want to capture from the page: `Product brand`, `Product image`, `Product name` and `Product price`. - Click on `Create an inspector` button and create the following inspectors: ``` Name: product-name-{yourname} Selector: //*[contains(@class, 'text')][contains(@class, 'name')] Type: text Name: product-brand-{yourname} Selector: //*[contains(@class, 'top-brand')] Type: text Name: product-image-{yourname} Selector: //a[contains(@class, 'product-tile__main-link')]/div[1]/div/img Type: image Name: product-price-{yourname} Selector: //div[contains(concat(' ', normalize-space(@class), ' '), ' price-row ')] Type: text ``` This is how your inspector's list should look like with different names: ![image](https://github.com/Alhajras/webscraper/assets/36598060/1be60cba-5e65-40e5-a363-8cadc3a6d512) ------------------------ ## 2 - Crawlers After creating a _Template_, we want to create and configure a _Crawler_. ### Steps: - Navigate to the Crawlers page - Click on `Create a crawler` and expand `Advanced options` - Fill the next values: ``` Name: {yourname}-douglas Template: {your-name}-douglas Max pages: 20000 Max collected docs: 200000 Seed URL: https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111 Allow multi elements crawling: Enable Links Scope (Pagination): This is a list field, meaning after each entry press enter button - //*[contains(@class, 'pagination')] - //*[contains(@class, 'left-content-slot')] - //*[contains(@class, 'navigation-main__container')] - //*[contains(@class, 'header')] Threads: 4 Max depth: 100 ``` This is how it should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/1c6aeaf6-704b-4dc2-8999-31ac8b4ab718) - Click on `Create` button ------------------------ ## 3 - Runners Runners are jobs that run the crawling process in a cluster or locally. After creating the _Crawler_, we create a _Runner_. Now, we can run/stop the crawlers from the Runners page. ### Steps: - Navigate to the Runners page: - Click on `Create a Runner` - Fill the following: ``` Name: {yourname}-douglas Crawler: {yourname}-douglas Machine: localhost ``` - Click on Create Find your runner in the list. Click on the burger menu to start crawling and click on `Start`. ![image](https://github.com/Alhajras/webscraper/assets/36598060/df26e6f5-3b3c-45de-988d-f0fb83ee76ad) The list will keep refreshing. You don't have to keep reloading the page. You can monitor the progress by looking at the `progress` column and `status` column. You can see the log and statistics by clicking on: ![image](https://github.com/Alhajras/webscraper/assets/36598060/ee956f61-817a-46f5-ab09-07e50eff5e26) ------------------------ ## 4 - Indexing After the runner is completed, we can start indexing the results. ### Steps: - Navigate to Indexers - Click on `create an indexer` - Fill the following: ``` name: {yourname}-douglas Inspectors: product-name-{yourname} ({your-name}-douglas) ``` - Click on Create. - Find your indexer from the list and click on `Start indexing`. - Watch the indexing going from status `New` to `Completed` ------------------------ ## 5 - Searching After crawling (Collecting data) and Indexing (Preparing them for searching), we can test if searching returns the right results. ### Steps: - Navigate to Search - Select your indexer - Search for: - `set` (Covering short query) - `water` (Covering short query) - `Micellar Water` (Covering exact product name) - `black` (Covering normal query) - `black in` (Covering a case were the tokens are in the wrong order) - `set spring dadadadadad` (Covering a random word) - Testing the suggestions list: - Enter `Universi`, Should correctly suggest `university` - Enter `univsrity`, Misspelling should be forgiven, and the result should be `university` - Enter `university oxford`, should show results including `university of oxford`

(back to top)


Contact

Alhajras Algdairy - Linkedin - alhajras.algdairy@gmail.com

(back to top)