code4craft / webmagic

A scalable web crawler framework for Java.
http://webmagic.io/
Apache License 2.0
11.43k stars 4.18k forks source link

0.71版本downloader的时候就会打印页面到控制台,给查看日志带来了不便 #590

Closed lidaoyang closed 7 years ago

lidaoyang commented 7 years ago

0.71版本downloader的时候就会打印页面到控制台,这样就产生很多日志文件,在查看其他的日志时就很不方便,如果改变日记等级其他的日志我都看不到了,我觉得这个打印页面的不用要。

code4craft commented 7 years ago

是不是没有配置pipeline?默认ConsolePipeline确实会打印很多日志。

lidaoyang commented 7 years ago

Spider.create(new TongHSNewsPageProcessor(news_id,stockMap)) .addUrl("http://stock.10jqka.com.cn/companynews_list/index_1.shtml") // .addPipeline(newsDBPipeline) .setExitWhenComplete(true) .thread(1).run(); newsDBPipeline.process(newslist, news_contentlist,"News_TongHS"); 我是这样处理的因为放到pipeline里面是每一条存一次,我想把所有的抓完一次存入,这样比较节省资源,所以我自定义了process,所以我上次还问过有没有异步的处理完所有链接然后一次保存的方法呢!

lidaoyang commented 7 years ago

你好,我配置了pipeline但是在process里面什么都不做,还是会打印页面的,怎么才能去掉这个打印页面的功能?

code4craft commented 7 years ago

麻烦贴一下完整代码,确认了一下0.7.1也只会在没有pipeline的时候用console输出。

lidaoyang commented 7 years ago

Spider.create(new CnStockGsjjNewsPageProcessor(news_id, stockMap)) .addUrl(new String[] { "http://company.cnstock.com/company/scp_gsxw/1" }) .addPipeline(newsDBPipeline) .setExitWhenComplete(true) .thread(1).run(); newsDBPipeline.process(newslist, news_contentlist, "News_CnStockGsjj"); newslist = new ArrayList(); news_contentlist = new ArrayList(); 下面是public class NewsDBPipeline implements Pipeline 类,中间代码都被我注释掉了,这样还是打印页面内容。 public void process(ResultItems resultItems, Task task) { /ArrayList newslist = resultItems.get("newslist"); ArrayList news_contentlist = resultItems.get("news_contentlist"); if (newslist!=null&&newslist.size()>0) { Collections.reverse(newslist); Collections.reverse(news_contentlist); int ret = saveNews(newslist, news_contentlist); if (ret>0) { //获取Trigger设置信息 JSONObject tr_settings = getTriggerSetting(); if (!tr_settings.isEmpty()) { //更新Trigger信息流 updateTriggerInforFlowNS(newslist,tr_settings); } System.out.println("["+ DateUtils.DateToStr(new Date(), "")+ "]=================================================NewsDBPipeline 抓取公司新闻("+ret+")条"); } }/ }

lidaoyang commented 7 years ago

downloading page success Page成功之后就会打印页面

lidaoyang commented 7 years ago

你好,这个问题存在吗?还是我这边写的有问题呢?

lidaoyang commented 7 years ago

我又做了个测试无论有没有加pipeline都会有打印页面, public static void getSpider(NewsCjywDBPipeline newsCjywDBPipeline){ Spider.create(new GlobalNewsPageProcessor()) .addUrl("http://live.sina.com.cn/zt/f/v/finance/globalnews1") .addPipeline(newsCjywDBPipeline)//保存持久数据 .setExitWhenComplete(true) .thread(1).start(); } public class NewsCjywDBPipeline implements Pipeline //持久类类pipeline

public void process(ResultItems resultItems, Task task) { String type = resultItems.get("type"); ArrayList newslist = resultItems.get("newslist"); if (newslist.size()>0) { Collections.reverse(newslist); int count = globalnewsMapper.insertBatch(newslist); System.out.println("["+ DateUtils.DateToStr(new Date(), "")+ "]=================================================NewsCjywDBPipeline 抓取财经要闻("+count+")条"); } }

lidaoyang commented 7 years ago

我看了一下download源码在HttpClientDownloader类里的89行logger.debug("downloading page success {}", page); 我看了我这边打印的日志rawText里面就是页面代码 HttpClientDownloader - downloading page success Page{request=Request{url='http://live.sina.com.cn/zt/f/v/finance/globalnews1', method='null', extras=null, priority=0, headers={}, cookies={}}, resultItems=ResultItems{fields={}, request=Request{url='http://live.sina.com.cn/zt/f/v/finance/globalnews1', method='null', extras=null, priority=0, headers={}, cookies={}}, skip=false}, html=null, json=null, rawText='<!DOCTYPE html>

code4craft commented 7 years ago

明白了,你说的是日志,这里设置成debug级别确认会打印,你可以设置成info级别。

code4craft commented 7 years ago

0.7.2版本里把debug日志也取消了。