ericvana / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

I can't crawl all the pages of a site #130

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. I can't crawl all the site with the default configuration

What is the expected output? What do you see instead?
I need to crawl all pages of a site and only are crawled 60 of them.

Please provide any additional information below.

I use this code to crawl one domain:

            String domain = "http://foo.com";

            CrawlConfiguration crawlConfig = new CrawlConfiguration();
            crawlConfig.CrawlTimeoutSeconds = 100;
            crawlConfig.MaxConcurrentThreads = 10;
            crawlConfig.MaxPagesToCrawl = 0;
            crawlConfig.MaxPagesToCrawlPerDomain = 0;
            crawlConfig.MaxCrawlDepth = 1000;        
A            crawlConfig.UserAgentString = firefoxAgent;

            MyCrawler crawler = null;

            crawler = new MyCrawler (crawlConfig);

            CrawlResult result = crawler.Crawl(new Uri(domain));

MyCrawler code:

public class MyCrawler : PoliteCrawler
{

    public MyCrawler(CrawlConfiguration crawlConfig)
    : base(crawlConfig, null, null, null, null, null, null, null, null)
    {
        PageCrawlCompleted += ProcessPageCrawlCompleted;
    }

    private void ProcessPageCrawlCompleted(object sender, 
                                               PageCrawlCompletedArgs e)
    {
        CrawledPage crawledPage = e.CrawledPage;

        System.IO.File.AppendAllLines(crawledPage.Uri.AbsoluteUri + " processed.");
    }
}

All the pages are not processed, I don't know why.

Original issue reported on code.google.com by DLopezGo...@gmail.com on 29 Sep 2014 at 10:58

GoogleCodeExporter commented 9 years ago
This could be a lot of things. You will need to look through the logs to
debug what pages are found.

Steven

Original comment by sjdir...@gmail.com on 29 Sep 2014 at 1:55

GoogleCodeExporter commented 9 years ago
Hello

I've solve the issue, it seems that if you use the configuration object and not 
the configuration file you need to put all the parameters in the code. I took 
the example in your quick started guide.

I need to put this lines to get the crawler rolling:

            CrawlConfiguration crawlConfig = new CrawlConfiguration();
            crawlConfig.CrawlTimeoutSeconds = 0;
            crawlConfig.MaxConcurrentThreads = max_threads;
            crawlConfig.MaxPagesToCrawl = 0;
            crawlConfig.MaxPagesToCrawlPerDomain = 0;
            crawlConfig.MaxCrawlDepth = 1000;        
            crawlConfig.UserAgentString = agent;
            crawlConfig.MaxPageSizeInBytes=0;
            crawlConfig.DownloadableContentTypes="text/html, text/plain";
            crawlConfig.IsUriRecrawlingEnabled=false;
            crawlConfig.IsExternalPageCrawlingEnabled=false;
            crawlConfig.IsExternalPageLinksCrawlingEnabled=false;
            crawlConfig.HttpServicePointConnectionLimit=200;
            crawlConfig.HttpRequestTimeoutInSeconds=15;
            crawlConfig.HttpRequestMaxAutoRedirects=7;
            crawlConfig.IsHttpRequestAutoRedirectsEnabled=true;
            crawlConfig.IsHttpRequestAutomaticDecompressionEnabled=false;
            crawlConfig.MinAvailableMemoryRequiredInMb=0;
            crawlConfig.MaxMemoryUsageInMb=256;
            crawlConfig.MaxMemoryUsageCacheTimeInSeconds=0;
            crawlConfig.MaxCrawlDepth=1000;
            crawlConfig.IsForcedLinkParsingEnabled = true;

Congratulation to you crawler, it's a awesome piece of code. Thank you very 
much.

Original comment by DLopezGo...@gmail.com on 30 Sep 2014 at 9:23