匹配内容有时候会漏掉内容

Google1234 / Information_retrieva_Projectl-

新闻检索：爬虫定向采集3-4个网页，实现网页信息的抽取、检索和索引。网页个数不少于10个，能按时间、相关度、热度等属性进行排序，并实现相似主题的自动聚类。可以实现：有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果，能预览)功能

MIT License

128 stars 36 forks source link

匹配内容有时候会漏掉内容 #1

Closed Google1234 closed 8 years ago

Google1234 commented 8 years ago

采用规则：item['content']=response.xpath('//div[@class="post_text"]/p').extract()提取新闻的文本内容如在采集http://news.163.com/16/0424/01/BLCOK6H400014AED.html时，只能提取到 "<p class=\"otitle\">\n （原标题：嫦娥三号拍出迄今最清晰月面照片(图)）\n <\/p>", "

<\/p>" 丢失：新华社电自2013年12月14日月面软着陆以来，我国嫦娥三号月球探测器创造了全世界在.....

Google1234 commented 8 years ago

放松提取规则