Change candidate urls implementation from list to asyncio queue and little fixes

elliotgao2 / gain

Web crawling framework based on asyncio.

GNU General Public License v3.0

2.04k stars 207 forks source link

Change candidate urls implementation from list to asyncio queue and little fixes #33

Closed babykick closed 7 years ago

babykick commented 7 years ago

The original URL schedule was implemented by a list named 'pre_parse_urls'. Then getting URL is a LIFO way so that item parsing will wait until all entry pages are parsed. It's not proper in a crawler. My change is to use a queue with FIFO order. The plus is that we can refactor the framework into distributed style in future.