CrawlScript / WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
https://github.com/CrawlScript/WebCollector
GNU General Public License v3.0
3.07k stars 1.45k forks source link

设置了Config.MAX_EXECUTE_COUNT,但是因超时而失败的种子好像没有再次抓取,这是怎么回事 #80

Closed haixingmu closed 6 years ago

hujunxianligong commented 6 years ago

可以用库里的工具读一下Berkeley db里的条目

On Fri, Mar 9, 2018, 11:16 PM haixingmu notifications@github.com wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CrawlScript/WebCollector/issues/80, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOYYU2CSwrYujr-Jy1fcWuqMaeedCnyks5tcpy1gaJpZM4SkaUX .

hujunxianligong commented 6 years ago

先不说生产者消费者的问题,如果没有请求成功,WC是不会将库中CrawlDatum的状态设置为成功的。只有成功执行完整个流程(请求、解析)的CrawlDatum才会被设置为成功。

haixingmu notifications@github.com于2018年3月11日周日 下午6:48写道:

代码是看懂了,QueueFeeder和FetcherThread采用生产者消费者模式,QueueFeeder每隔1秒从db中抽取执行状态不为STATUS_DB_SUCCESS的种子,FetcherThread不断从fetchQueue获取种子去请求,QueueFeeder是以1000为一次从db中抽取种子,但是如果这次任务传递的种子都不超过1000,但是这些种子在请求过程中又有好多都失败了, while (hasMore && running) {

        int feed = size - queue.getSize();
        if (feed <= 0) {
            try {
                Thread.sleep(1000);
            } catch (InterruptedException ex) {
            }
            continue;
        }
        while (feed > 0 && hasMore && running) {

            CrawlDatum datum = generator.next();
            hasMore = (datum != null);

            if (hasMore) {
                queue.addFetchItem(new FetchItem(datum));
                feed--;
            }

        }

那这时线程也没有进入沉睡状态,直接死掉了,后续失败的种子也不会继续添加到队列中,这些失败的种子也就不会再被请求, 测试中发现这部分代码generator = dbManager.createGenerator(generatorFilter);只执行了一次。

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/CrawlScript/WebCollector/issues/80#issuecomment-372105860, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOYYRUD-twU7Mkcudyj0sogP61WW3JAks5tdQD9gaJpZM4SkaUX .

haixingmu commented 6 years ago

这个我知道,取出的种子都是未执行的和请求失败的,但是我测试了很多次 @Override public CrawlDatum filter(CrawlDatum datum) { if(datum.getStatus() == CrawlDatum.STATUS_DB_SUCCESS){ return null; }else{ if(datum.getStatus() == CrawlDatum.STATUS_DB_UNEXECUTED){ LogUtils.writeLogo("开始抽取未执行的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/unexecutedSeed.txt"); }else if (datum.getStatus() == CrawlDatum.STATUS_DB_FAILED) { LogUtils.writeLogo("开始抽取执行失败的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/failedSeed.txt"); } return datum; } } Logutils是我写的文件工具类,failedSeed.txt文件始终没有出现,我觉得楼主的生产者和消费者模式好像是有问题的,而且我觉得完全没有使用这个模式的必要,处理种子的速度主要不还是取决于你设置的threads数和抓取间隔吧,为什么要多加一步呢,完全就可以把任务的所有种子数进行多线程处理不就行了吗

hujunxianligong commented 6 years ago

先回复一下为什么用生产者和消费者问题,当你有上亿级别数据时,如果不用生产者消费者模式,来保证队列里只有1000个元素时,你的队列在内存里可能会炸掉。

haixingmu notifications@github.com于2018年3月11日周日 下午8:33写道:

这个我知道,取出的种子都是未执行的和请求失败的,但是我测试了很多次 @override https://github.com/override public CrawlDatum filter(CrawlDatum datum) { if(datum.getStatus() == CrawlDatum.STATUS_DB_SUCCESS){ return null; }else{ if(datum.getStatus() == CrawlDatum.STATUS_DB_UNEXECUTED){ LogUtils.writeLogo("开始抽取未执行的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/unexecutedSeed.txt"); }else if (datum.getStatus() == CrawlDatum.STATUS_DB_FAILED) { LogUtils.writeLogo("开始抽取执行失败的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/failedSeed.txt"); } return datum; } } Logutils是我写的文件工具类,failedSeed.txt文件始终没有出现,我觉得楼主的生产者和消费者模式好像是有问题的,而且我觉得完全没有使用这个模式的必要,处理种子的速度主要不还是取决于你设置的threads数和抓取间隔吧,为什么要多加一步呢,完全就可以把任务的所有种子数进行多线程处理不就行了吗

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/CrawlScript/WebCollector/issues/80#issuecomment-372112004, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOYYUHIeijSczPSq5TBWU73-x4E_HK8ks5tdRmzgaJpZM4SkaUX .

haixingmu commented 6 years ago

我没有加任何filter,你代码里Crawker类里默认传递的是 protected GeneratorFilter generatorFilter = new StatusGeneratorFilter(); 而StatusGeneratorFilter类 public class StatusGeneratorFilter extends DefaultConfigured implements GeneratorFilter { @Override public CrawlDatum filter(CrawlDatum datum) { if(datum.getStatus() == CrawlDatum.STATUS_DB_SUCCESS){ return null; }else{ if(datum.getStatus() == CrawlDatum.STATUS_DB_UNEXECUTED){ LogUtils.writeLogo("开始抽取未执行的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/unexecutedSeed.txt"); }else if (datum.getStatus() == CrawlDatum.STATUS_DB_FAILED) { LogUtils.writeLogo("开始抽取执行失败的种子:"+ CrawlDatumFormater.datumToJsonStr(datum), "crawlDatum/failedSeed.txt"); } return datum; } } }里的filter方法默认抽取的是未执行的和请求失败的种子,但是请求失败的种子始终没有执行。

hujunxianligong commented 6 years ago

是depth次数设置太少了,导致爬虫还没有retry就提前结束了