Closed Akhilabala closed 3 years ago
The easiest may be to check your web server logs for the cause of the rejection. If this is not possible or does not help you, I suggest you check the HTTP request sent to your web page by your browser and supply the same. Those headers should look like those shown here. The config snippet to use is this one:
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
...
<headers>
<header name="(header name)">(header value)</header>
<!-- You can repeat this header tag as needed. -->
</headers>
...
</httpClientFactory>
Hi Pascal, Tried to pass the same HTTP request,but still getting the below error for many urls. (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=503, reasonPhrase=Service Unavailable])
Can you share your config? How "aggressive" are you with your crawling? That error can mean the site cannot cope with many requests in a short time. Maybe increase the default "delay" in your config and reduce the number of threads and/or crawl during less busy periods.
Given this is an error is coming from the website, you may want to contact the site owner to report this.
Hi Pascal. PFB the conf
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">
#set($core = "com.norconex.collector.core")
#set($http = "com.norconex.collector.http")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
#set($metaChecksummer = "${http}.checksum.impl.LastModifiedMetadataChecksummer")
#set($metaFetcher = "${http}.fetch.impl.GenericMetadataFetcher")
#set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory")
#set($robotsTxt = "${http}.robot.impl.StandardRobotsTxtProvider")
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<maxParallelCrawlers>-1</maxParallelCrawlers>
<crawlerDefaults>
<referenceFilters>
<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg,msg</filter>
</referenceFilters>
<robotsTxt class="$robotsTxt" ignore="false"/>
<metadataFetcher class="$metaFetcher">
<validStatusCodes>200</validStatusCodes>
</metadataFetcher>
<metadataChecksummer disabled="false" keep="false" targetField="collector.checksum-metadata" class="$metaChecksummer" />
</crawlerDefaults>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
</httpClientFactory>
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
<sitemap>sitemapurl</sitemap>
</startURLs>
<workDir>./examples-output/minimum</workDir>
<maxDepth>10</maxDepth>
<numThreads>10</numThreads>
<keepOutOfScopeLinks>false</keepOutOfScopeLinks>
<orphansStrategy>PROCESS</orphansStrategy>
<sitemapResolverFactory class="$sitemapFactory" ignore="false" lenient="true">
</sitemapResolverFactory>
<delay default="50" />
<crawlerListeners>
<listener
class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>100-199,201-599</statusCodes>
<outputDir>./indexing/</outputDir>
<fileNamePrefix>brokenLinks</fileNamePrefix>
</listener>
</crawlerListeners>
<crawlerListeners>
<listener
class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>200</statusCodes>
<outputDir>./indexing/</outputDir>
<fileNamePrefix>crawled</fileNamePrefix>
</listener>
</crawlerListeners>
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<!-- <fields>title,keywords,description</fields> -->
<!-- <copy fromField="Contact" toField="contact" overwrite="true" /> -->
</tagger>
</postParseHandlers>
</importer>
<delay default="50" />
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<nodes>nodeurl</nodes>
<indexName>indexnam</indexName>
</committer>
</crawler>
</crawlers>
</httpcollector>
As a test, I would try with a single thread and a delay of 3 seconds. If you no longer get the server error, it suggests the site can handle only so much. If you still get it, contact the site owner so they investigate whey the server could not respond at specific times (e.g., they can check their server logs).
Hi, I'm getting 503 error for most of the urls while crawling the website, but if i load the urls it is working fine. Any idea what will be the issue.