The original URL schedule was implemented by a list named 'pre_parse_urls'. Then getting URL is a LIFO way so that item parsing will wait until all entry pages are parsed. It's not proper in a crawler.
My change is to use a queue with FIFO order. The plus is that we can refactor the framework into distributed style in future.
The original URL schedule was implemented by a list named 'pre_parse_urls'. Then getting URL is a LIFO way so that item parsing will wait until all entry pages are parsed. It's not proper in a crawler. My change is to use a queue with FIFO order. The plus is that we can refactor the framework into distributed style in future.