elliotgao2 / gain

Web crawling framework based on asyncio.
GNU General Public License v3.0
2.04k stars 207 forks source link

Queue timeout and python 3.6 support. #40

Closed yc0 closed 6 years ago

yc0 commented 6 years ago

There're two commits First of all, we have to pass some checks, and bring the sessions to the subsequent requests. Therefore, I suggest that we can take cookie_jar to manipulate sessions.

Secondly, there's a bug, and happen to block while dequing. Since, the condition at "spider.is_running()" has a chance that queue is empty but len(parser.parsing_urls) > 0, the queue.get() coroutine would block here for coming up urls. Indeed, there's another possibility that new urls would enqueue somehow; nevertheless, if there's no other new url, the event_loop won't stop unless you interrupt it.

I try to impose the timeout mechanism for it. By default, we will wait for 5 secs until new urls arrive; otherwise, it would cause timeout.