PacktPublishing / Learning-Concurrency-in-Python

Learning Concurrency in Python by Packt
MIT License
77 stars 51 forks source link

Chapter 5 - webCrawler.py not working properly #6

Open xemage opened 4 years ago

xemage commented 4 years ago

I think this code is not working properly. The result is very dependent on the number of threads you start. The more threads you start, the more pages will be crawled. I guess the problem is that the crawler threads finish due to empty queue and don't get back to work when there is new work in the queue.

Some results shown by the crawler when crawling https://tutorialedge.net

1 Thread: Total Number of Pages Visited 35 5 Threads: Total Number of Pages Visited 35 10 Threads: Total Number of Pages Visited 36 50 Threads: Total Number of Pages Visited 67 100 Threads: Total Number of Pages Visited 78

rualark commented 2 years ago

Hi. You are close. The problem is not finishing due to empty queue, but due to duplicates. Here is detailed explanation: https://stackoverflow.com/questions/70468915/problems-with-webcrawler-implementation