c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

Add multithread option #49

Closed Garrett-R closed 6 years ago

Garrett-R commented 6 years ago

This package can be prohibitively slow for site with many pages. I've added a command line option for multithreading. I tested it on our site (up.codes) and the results are:

Before: 36 URLs / minute After (with -n 16): 444 URLs / minute

The default is still single-threaded.

There's 2 commits here, the first is just renaming some variable and minor formatting fixes. So you may want to review them separately.

c4software commented 6 years ago

Hi,

Nice. Thanks you for this huge contribution. Before merging I want to check something with you.

How did you check (or dedupe) URI in the queue ?

Again thanks for this nice improvement

Garrett-R commented 6 years ago

No prob, this repo has been super helpful so happy to give back.

The method for preventing dupes in the queue is similar to before here, but slightly different.

Before how it worked before (and still works under the single-threaded default) was you have a queue, and you pop one URI at a time. When adding new URIs to the queue, you would check to make sure it's neither in the queue already nor already crawled.

With multithreaded: 1) Initialize the queue 2) The entire queue is converted into tasks and these are now saved into self.crawled_or_crawling. 3) The queue is cleared so it empty 4) The program now splits into multiple threads to finish all remaining tasks 4a) Each thread can add to the queue. The same checks happen to make sure that a URI is neither in the queue (potentially added by another thread) nor is a current task that is or will be processed by a thread. 5) The main thread waits for all tasks to finish (here) 6) Once all tasks are fininshed, the program is basically back in "single-thread mode" 7) Go back to the step (2)

So note that in step (4a), the queue does not get processed yet. All tasks have to finish and then you go back to step (2) at which point a bunch of tasks are created (sometimes thousands of tasks).

Does that answer the question?

c4software commented 6 years ago

Perfectly answer the question thanks you