Open Chin-I opened 8 years ago
There are many reasons that the crawler can stop: The default case is when the workers have not received a url from the frontier for (config.FrontierTimeOut) seconds. It can be that the frontier has no more urls left to crawl (and the crawl is complete).
In any case, the crawler will stop with a dump of what is left in the Frontier. If that set was empty, then there were no urls left in the frontier to crawl. If that set was not empty, something else killed the crawler, and it can be resumed on restart.
Not sure what you mean by read "text" from the shelve. The method you used to access the shelve file is the right way to do it.
The Persistent file is a dictionary. of
Hello lordnahor. Thanks for your sharing of the crawler on git hub.
Recently I try to use the http://www.ics.uci.edu/ as the seed to crawl. First time I crawl 10 hour to get Persistent.shelve about 150MB. Second time, I stop at 300MB. But I wander is there any designed "stop" situation like run out the frontier or just stop by accidentally.
One more question, I want to double check can I read "text" from the shelve file ? Cause when I execute
d=shelve.open("Persistent.shelve.db")
print"Persistent.shelve.db",d
What I get is just 'http://www.ics.uci.edu/grad/courses/listing.php': (True, 3), 'http://www.ics.uci.edu/prospective/ko/degrees/business-information-management': (False, 4), 'http://hombao.ics.uci.edu?s=opportunities': (False, 4), 'http://asterix.ics.uci.edu/talks.html': (False, 4),Thanks for your answer (..)