Google1234 / Information_retrieva_Projectl-

新闻检索:爬虫定向采集3-4个网页,实现网页信息的抽取、检索和索引。网页个数不少于10个,能按时间、相关度、热度等属性进行排序,并实现相似主题的自动聚类。可以实现:有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果, 能预览)功能
MIT License
128 stars 36 forks source link

爬取网页数目足够时,没能及时停止爬虫 #4

Open Google1234 opened 8 years ago

Google1234 commented 8 years ago

http://www.sharejs.com/codes/python/8808

Google1234 commented 8 years ago
def parse(self,response):
    if self.web_id>self.crawl_number:
        if self.has_terminated==False:
            self.write_block_data.close()
            self.write_block_crawlwd_weburl.close()
            del self.write_block_crawlwd_weburl,self.write_block_data
            self.has_terminated==True

2016-05-22 01:26:58 [scrapy] ERROR: Spider error processing <GET http://news.163.com/rank/> (referer: http://news.163.com/) Traceback (most recent call last): File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output for x in result: File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in return (_set_referer(r) for r in result or ()) File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in return (r for r in result or () if _filter(r)) File "C:\Python\Information_retrieva_Projectl-\crawl\spiders\netease_spider.py", line 59, in parse del self.write_block_crawlwd_weburl,self.write_block_data AttributeError: write_block_crawlwd_weburl