Alhajras/webscraper - Githubissues

Scriburg Search Engine

Master thesis project at the University of Freiburg
Read thesis »

Report Bug · Request Feature

About The Project
- Built With
- Screenshots
Getting Started
- Installation
Usage
Contact

About The Project

If you are looking for a free scalable tool to collect and index information from a specific set of domains on the Internet, Scriburg is the right tool for you.

Examples of valid use cases:

Compare prices and find the cheapest product in the market
Track job positions with a keyword
Find an affordable apartment to rent
Follow news
Monitor social media content

Screenshots

Crawlers Dashboard	Crawler Configuration

Indexers Dashboard	Search Result

(back to top)

Built With

(back to top)

Getting Started

To simplify the installations process, docker compose is used and recommended.

Installation

Installing docker compose (Recommended) The docker compose version supported is: v2.16.0 run docker compose version to print your local version.

If you do not have docker compose you can install it from here Docker compose

Building all images:
```
docker compose build
```
Run containers:
```
docker compose up -d
```

If you want to only run the pbs cluster run:

docker compose up -d pbs-head-node pbs-sim-node

To register the simulation node in the head node, first you have to invlike the head container and run the following:
```
. /etc/profile.d/pbs.sh
qmgr -c "create node pbs-sim-node"
```

(back to top)

Usage

Use Case - 1: Stack Overflow Questions

You want to create a small search engine that can answer any question related to Python. To do so you want to collect all the questions on Stack Overflow related to Python topic.

This is a screenshot on how the table we want to extract information from looks like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/cce29dfc-92de-40db-83eb-e425dcac70e8) **Goal:** We would like to extract the following Fields: **Title, Description** and **Votes** To do so, we will follow the next steps: ------------------------ ## 1 - Templates ![image](https://github.com/Alhajras/webscraper/assets/36598060/ce1b5e88-b483-4232-a020-8c8fd5bcff6c) We start by creating a _**Template**_, which is the blueprint that maps to the fields that want to be downloaded as a document. ### Steps: - Go to the [Templates](http://localhost:4200/templates) page - Click on `Create a template` button - Give a name of the template `stack-overflow-template` and click on `save`. - Expand the template you created by clicking on the `>` button. - Now we want to create the fields we want to capture from the page: `title`, `description` and `votes`. - Click on `Create an inspector` button and create the following inspectors: ``` Name: title Selector: //*[contains(@class, 's-post-summary--content-title')] Type: text Name: description Selector: //*[contains(@class, 's-post-summary--content-excerpt')] Type: text Name: votes Selector: //*[contains(@class, 's-post-summary--stats-item__emphasized')] Type: text ``` This is how your inspector's list should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/7d63ef96-6ee6-4a2d-8445-f3589b909910) ------------------------ ## 2 - Crawlers After creating a _**Template**_, we want to create and configure a _**Crawler**_. ### Steps: - Navigate to the [Crawlers](http://localhost:4200/crawlers) page - Click on `Create a crawler` and expand `Advanced options` - Fill the next values: **Note**: After filling the `Links Scope (Pagination)` press enter to apply changes to the field because it is a list of text. ``` Name: stack-overflow-crawler Template: stack-overflow-template Max pages: 1000 Max collected docs: 1000 Seed URL: https://stackoverflow.com/questions/tagged/python Allow multi elements crawling: Enable Links Scope (Pagination): //*[contains(@class, 's-pagination')] Threads: 1 Max depth: 10000 ``` This is how it should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/3e1967ee-b195-4f2d-a76b-56ce1f4e7549) - Click on `Create` button ------------------------ ## 3 - Runners Runners are jobs that run the crawling process in a cluster or locally. After creating the _Crawler_, we create a _Runner_. Now, we can run/stop the crawlers from the Runners page. ### Steps: - Navigate to the [Runners](http://localhost:4200/runners) page: - Click on `Create a Runner` - Fill the following: ``` Name: stack-overflow-runner Crawler: stack-overflow-crawler Machine: localhost ``` - Click on Create Find your runner in the list. Click on the burger menu to start crawling and click on `Start`. ![image](https://github.com/Alhajras/webscraper/assets/36598060/df26e6f5-3b3c-45de-988d-f0fb83ee76ad) The list will keep refreshing. You don't have to keep reloading the page. You can monitor the progress by looking at the `progress` column and `status` column. You can see the log and statistics by clicking on: ![image](https://github.com/Alhajras/webscraper/assets/36598060/ee956f61-817a-46f5-ab09-07e50eff5e26) ------------------------ ## 4 - Indexing After the runner is completed, we can start indexing the results. ### Steps: - Navigate to Indexers - Click on `create an indexer` - Fill the following: ``` name: stack-overflow-indexer Inspectors: stack-overflow (Title), stack-overflow (Description) ``` - Click on Create. - Find your indexer from the list and click on `Start indexing`. - Watch the indexing going from status `New` to `Completed` ------------------------ ## 5 - Searching After crawling (Collecting data) and Indexing (Preparing them for searching), we can test if searching returns the right results. ### Steps: - Navigate to Search - Select your indexer - Search for: - `python` (Covering a normal query case) - `pythen` (Fuzzy search test) - `what is ruby` (Covering a case where only one word should be more important than others) - `django` (Covering a normal query case) - `SQL database` (Covering a normal query case) - `Python 2.6` (Covering a normal query case with a number)

Use Case - 2: World University Rankings 2023

You are a university professor and would like to maintain a local version list of the universities worldwide and their ranking. To do this, we will use the website (www.timeshighereducation.com) to download and index the information. The Times Higher Education World University Rankings 2023 include 1,799 universities across 104 countries and regions, making them the largest and most diverse university rankings to date.

This is a screenshot on how the table we want to extract information from looks like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/47decc10-d491-4ff8-b1dc-4ec156f30d18) **Goal:** We would like to extract the following Fields: **University ranking, University name** and **University location** To do so, we will follow the next steps: ------------------------ ## 1 - Templates ![image](https://github.com/Alhajras/webscraper/assets/36598060/ce1b5e88-b483-4232-a020-8c8fd5bcff6c) We start by creating a _**Template**_, which is the blueprint that maps to the fields that want to be downloaded as a document. ### Steps: - Go to the [Templates](http://localhost:4200/templates) page - Click on `Create a template` button - Give a name of the template `uni-ranking-template` and click on `save`. - Expand the template you created by clicking on the `>` button. - Now we want to create the fields we want to capture from the page: `uni-name`, `uni-location` and `uni-ranking`. - Click on `Create an inspector` button and create the following inspectors: ``` Name: uni-name Selector: //*[contains(@class, 'ranking-institution-title')] Type: text Name: uni-location Selector: //*[contains(concat(' ', normalize-space(@class), ' '), ' location ')] Type: text Name: uni-ranking Selector: //*[contains(@class, 'rank') and contains(@class, 'sorting_1') and contains(@class, 'sorting_2')] Type: text ``` This is how your inspector's list should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/6609c955-dd8e-435e-9f10-f29888384045) ------------------------ ## 2 - Crawlers After creating a _**Template**_, we want to create and configure a _**Crawler**_. ### Steps: - Navigate to the [Crawlers](http://localhost:4200/crawlers) page - Click on `Create a crawler` and expand `Advanced options` - Fill the next values: **Note**: After filling the `Links Scope (Pagination)` press enter to apply changes to the field because it is a list of text. ``` Name: uni-crawler Template: uni-ranking-template Max pages: 10000 Max collected docs: 300000 Seed URL: https://www.timeshighereducation.com/world-university-rankings/2023/world-ranking Allow multi elements crawling: Enable Links Scope (Pagination): //*[contains(@class, 'pagination')] Threads: 4 Max depth: 10000 ``` This is how it should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/3e1967ee-b195-4f2d-a76b-56ce1f4e7549) - Click on `Create` button ------------------------ ## 3 - Runners Runners are jobs that run the crawling process in a cluster or locally. After creating the _Crawler_, we create a _Runner_. Now, we can run/stop the crawlers from the Runners page. ### Steps: - Navigate to the [Runners](http://localhost:4200/runners) page: - Click on `Create a Runner` - Fill the following: ``` Name: uni-runner Crawler: uni-crawler Machine: localhost ``` - Click on Create Find your runner in the list. Click on the burger menu to start crawling and click on `Start`. ![image](https://github.com/Alhajras/webscraper/assets/36598060/df26e6f5-3b3c-45de-988d-f0fb83ee76ad) The list will keep refreshing. You don't have to keep reloading the page. You can monitor the progress by looking at the `progress` column and `status` column. You can see the log and statistics by clicking on: ![image](https://github.com/Alhajras/webscraper/assets/36598060/ee956f61-817a-46f5-ab09-07e50eff5e26) ------------------------ ## 4 - Indexing After the runner is completed, we can start indexing the results. ### Steps: - Navigate to Indexers - Click on `create an indexer` - Fill the following: ``` name: {yourname}-uni-indexer Inspectors: {yourname}-uni-crawler (Uni name {yourname}) ``` - Click on Create. - Find your indexer from the list and click on `Start indexing`. - Watch the indexing going from status `New` to `Completed` ------------------------ ## 5 - Searching After crawling (Collecting data) and Indexing (Preparing them for searching), we can test if searching returns the right results. ### Steps: - Navigate to Search - Select your indexer - Search for: - `university` (Covering a normal query case) - `what is freiburg` (Covering a case where only one word should be more important than others) - `show me hamburg unis` (Long query) - `berlin` (Covering a normal query case) - `Humboldt Berlin` (Covering a normal query case) - `Electronic` (Covering a normal query case) - `college` (Covering a normal query case) - Testing the suggestions list: - Enter `Universi`, Should correctly suggest `university` - Enter `univsrity`, Misspelling should be forgiven, and the result should be `university` - Enter `university oxford`, should show results including `university of oxford`

Use Case - 3: Comparing Products Prices

You are a small business owner who would like to monitor and track the competitors. You can create more than one crawler to monitor different websites and for this use case, we will focus on Douglas.

This is a screenshot on how the products we want to extract information from looks like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/489d206a-e444-40dc-a007-280fd938c03d) **Goal:** We would like to extract the following Fields: **Brand, Image, name** and **Price** To do so, we will follow the next steps: ------------------------ ## 1 - Templates ![image](https://github.com/Alhajras/webscraper/assets/36598060/ce1b5e88-b483-4232-a020-8c8fd5bcff6c) We start by creating a _Template_, which is the blueprint that maps to the fields that want to be downloaded as a document. ### Steps: - Go to the Templates page - Click on `Create a template` button - Give a name of the template `{your-name}-douglas` and click on save. - Expand the template you created. - Now we want to create the fields we want to capture from the page: `Product brand`, `Product image`, `Product name` and `Product price`. - Click on `Create an inspector` button and create the following inspectors: ``` Name: product-name-{yourname} Selector: //*[contains(@class, 'text')][contains(@class, 'name')] Type: text Name: product-brand-{yourname} Selector: //*[contains(@class, 'top-brand')] Type: text Name: product-image-{yourname} Selector: //a[contains(@class, 'product-tile__main-link')]/div[1]/div/img Type: image Name: product-price-{yourname} Selector: //div[contains(concat(' ', normalize-space(@class), ' '), ' price-row ')] Type: text ``` This is how your inspector's list should look like with different names: ![image](https://github.com/Alhajras/webscraper/assets/36598060/1be60cba-5e65-40e5-a363-8cadc3a6d512) ------------------------ ## 2 - Crawlers After creating a _Template_, we want to create and configure a _Crawler_. ### Steps: - Navigate to the Crawlers page - Click on `Create a crawler` and expand `Advanced options` - Fill the next values: ``` Name: {yourname}-douglas Template: {your-name}-douglas Max pages: 20000 Max collected docs: 200000 Seed URL: https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111 Allow multi elements crawling: Enable Links Scope (Pagination): This is a list field, meaning after each entry press enter button - //*[contains(@class, 'pagination')] - //*[contains(@class, 'left-content-slot')] - //*[contains(@class, 'navigation-main__container')] - //*[contains(@class, 'header')] Threads: 4 Max depth: 100 ``` This is how it should look like: ![image](https://github.com/Alhajras/webscraper/assets/36598060/1c6aeaf6-704b-4dc2-8999-31ac8b4ab718) - Click on `Create` button ------------------------ ## 3 - Runners Runners are jobs that run the crawling process in a cluster or locally. After creating the _Crawler_, we create a _Runner_. Now, we can run/stop the crawlers from the Runners page. ### Steps: - Navigate to the Runners page: - Click on `Create a Runner` - Fill the following: ``` Name: {yourname}-douglas Crawler: {yourname}-douglas Machine: localhost ``` - Click on Create Find your runner in the list. Click on the burger menu to start crawling and click on `Start`. ![image](https://github.com/Alhajras/webscraper/assets/36598060/df26e6f5-3b3c-45de-988d-f0fb83ee76ad) The list will keep refreshing. You don't have to keep reloading the page. You can monitor the progress by looking at the `progress` column and `status` column. You can see the log and statistics by clicking on: ![image](https://github.com/Alhajras/webscraper/assets/36598060/ee956f61-817a-46f5-ab09-07e50eff5e26) ------------------------ ## 4 - Indexing After the runner is completed, we can start indexing the results. ### Steps: - Navigate to Indexers - Click on `create an indexer` - Fill the following: ``` name: {yourname}-douglas Inspectors: product-name-{yourname} ({your-name}-douglas) ``` - Click on Create. - Find your indexer from the list and click on `Start indexing`. - Watch the indexing going from status `New` to `Completed` ------------------------ ## 5 - Searching After crawling (Collecting data) and Indexing (Preparing them for searching), we can test if searching returns the right results. ### Steps: - Navigate to Search - Select your indexer - Search for: - `set` (Covering short query) - `water` (Covering short query) - `Micellar Water` (Covering exact product name) - `black` (Covering normal query) - `black in` (Covering a case were the tokens are in the wrong order) - `set spring dadadadadad` (Covering a random word) - Testing the suggestions list: - Enter `Universi`, Should correctly suggest `university` - Enter `univsrity`, Misspelling should be forgiven, and the result should be `university` - Enter `university oxford`, should show results including `university of oxford`

(back to top)

Contact

Alhajras Algdairy - Linkedin - alhajras.algdairy@gmail.com

(back to top)