Mondego / crawler4py

A web crawler in Python
21 stars 17 forks source link

What's the designed stop situation ? & Question about the shelve #11

Open Chin-I opened 8 years ago

Chin-I commented 8 years ago

Hello lordnahor. Thanks for your sharing of the crawler on git hub.

Recently I try to use the http://www.ics.uci.edu/ as the seed to crawl. First time I crawl 10 hour to get Persistent.shelve about 150MB. Second time, I stop at 300MB. But I wander is there any designed "stop" situation like run out the frontier or just stop by accidentally.

One more question, I want to double check can I read "text" from the shelve file ? Cause when I execute d=shelve.open("Persistent.shelve.db") print "Persistent.shelve.db",d What I get is just 'http://www.ics.uci.edu/grad/courses/listing.php': (True, 3), 'http://www.ics.uci.edu/prospective/ko/degrees/business-information-management': (False, 4), 'http://hombao.ics.uci.edu?s=opportunities': (False, 4), 'http://asterix.ics.uci.edu/talks.html': (False, 4),

Thanks for your answer (..)

rohan-achar commented 8 years ago

There are many reasons that the crawler can stop: The default case is when the workers have not received a url from the frontier for (config.FrontierTimeOut) seconds. It can be that the frontier has no more urls left to crawl (and the crawl is complete).

In any case, the crawler will stop with a dump of what is left in the Frontier. If that set was empty, then there were no urls left in the frontier to crawl. If that set was not empty, something else killed the crawler, and it can be resumed on restart.

Not sure what you mean by read "text" from the shelve. The method you used to access the shelve file is the right way to do it. The Persistent file is a dictionary. of : (, depth) The urls that have set to False, will be added to the Frontier on restart of the crawler (That is how the resume function works)