Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Getting service unavailable for many urls #713

Closed Akhilabala closed 3 years ago

Akhilabala commented 4 years ago

Hi, I'm getting 503 error for most of the urls while crawling the website, but if i load the urls it is working fine. Any idea what will be the issue.

essiembre commented 4 years ago

The easiest may be to check your web server logs for the cause of the rejection. If this is not possible or does not help you, I suggest you check the HTTP request sent to your web page by your browser and supply the same. Those headers should look like those shown here. The config snippet to use is this one:

<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
  ...
  <headers>
    <header name="(header name)">(header value)</header>
    <!-- You can repeat this header tag as needed. -->
  </headers>
  ...
</httpClientFactory>
Akhilabala commented 4 years ago

Hi Pascal, Tried to pass the same HTTP request,but still getting the below error for many urls. (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=503, reasonPhrase=Service Unavailable])

essiembre commented 4 years ago

Can you share your config? How "aggressive" are you with your crawling? That error can mean the site cannot cope with many requests in a short time. Maybe increase the default "delay" in your config and reduce the number of threads and/or crawl during less busy periods.

Given this is an error is coming from the website, you may want to contact the site owner to report this.

Akhilabala commented 4 years ago

Hi Pascal. PFB the conf

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">
  #set($core = "com.norconex.collector.core")
  #set($http = "com.norconex.collector.http")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
   #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
   #set($metaChecksummer   = "${http}.checksum.impl.LastModifiedMetadataChecksummer")
   #set($metaFetcher       = "${http}.fetch.impl.GenericMetadataFetcher")
   #set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory")
#set($robotsTxt = "${http}.robot.impl.StandardRobotsTxtProvider")

  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>
  <maxParallelCrawlers>-1</maxParallelCrawlers>

  <crawlerDefaults>
      <referenceFilters>
        <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg,msg</filter>
        </referenceFilters>
        <robotsTxt class="$robotsTxt" ignore="false"/>
      <metadataFetcher class="$metaFetcher">
      <validStatusCodes>200</validStatusCodes> 
      </metadataFetcher>
      <metadataChecksummer disabled="false" keep="false" targetField="collector.checksum-metadata" class="$metaChecksummer" />
  </crawlerDefaults>
  <crawlers>
    <crawler id="Norconex Minimum Test Page">   
     <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">

</httpClientFactory>
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <sitemap>sitemapurl</sitemap>
      </startURLs>
      <workDir>./examples-output/minimum</workDir>
      <maxDepth>10</maxDepth>    
       <numThreads>10</numThreads>
       <keepOutOfScopeLinks>false</keepOutOfScopeLinks>
       <orphansStrategy>PROCESS</orphansStrategy>
      <sitemapResolverFactory class="$sitemapFactory" ignore="false" lenient="true">
      </sitemapResolverFactory>
      <delay default="50" />
      <crawlerListeners>
    <listener  
        class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
       <statusCodes>100-199,201-599</statusCodes>
       <outputDir>./indexing/</outputDir>
       <fileNamePrefix>brokenLinks</fileNamePrefix>
     </listener>
</crawlerListeners>
<crawlerListeners>
    <listener  
        class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
       <statusCodes>200</statusCodes>
       <outputDir>./indexing/</outputDir>
       <fileNamePrefix>crawled</fileNamePrefix>
     </listener>
</crawlerListeners>
      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <!-- <fields>title,keywords,description</fields> -->
            <!-- <copy fromField="Contact"   toField="contact" overwrite="true" /> -->
          </tagger>

        </postParseHandlers>
      </importer> 

      <delay default="50" />
       <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>nodeurl</nodes>
        <indexName>indexnam</indexName>

      </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 4 years ago

As a test, I would try with a single thread and a delay of 3 seconds. If you no longer get the server error, it suggests the site can handle only so much. If you still get it, contact the site owner so they investigate whey the server could not respond at specific times (e.g., they can check their server logs).