Chapter 4: Approach - Githubissues

Alhajras / webscraper

Configurable search engine written in Python and Angular. It supports indexing as well.

1 stars 0 forks source link

Chapter 4: Approach #27

Open Alhajras opened 10 months ago

Alhajras commented 10 months ago

System Overview and the stack
- Implementation Details
Explain the pipeline of how to crawl
The User Interface
- Explain each page and what it does

Alhajras commented 10 months ago

get robot.txt file get seed_url create threads pool to share the found links add link to the pool init_queue while queue not empty and not all_threads_completed

if queue empty find links from other threads else: get crawler configurations if link vistided retrun selenium -> get link execute_all_before_actions find links in the page eclude links that: Out of doamin Disallowed by the robot.txt file get all docuemnts and save them exclude duplicated docuemnt clean up docuemnts before saving