Open Alhajras opened 10 months ago
get robot.txt file get seed_url create threads pool to share the found links add link to the pool init_queue while queue not empty and not all_threads_completed
if queue empty find links from other threads else: get crawler configurations if link vistided retrun selenium -> get link execute_all_before_actions find links in the page eclude links that: Out of doamin Disallowed by the robot.txt file get all docuemnts and save them exclude duplicated docuemnt clean up docuemnts before saving