Open akumar251 opened 5 years ago
Collector Config file:
<httpcollector id="Collector1">
#set($http = "com.norconex.collector.http")
#set($core = "com.norconex.collector.core")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
#set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")
<crawlerDefaults>
<urlNormalizer class="$urlNormalizer" />
<numThreads>4</numThreads>
<maxDepth>1</maxDepth>
<maxDocuments>-1</maxDocuments>
<workDir>./norconexcollector</workDir>
<orphansStrategy>DELETE</orphansStrategy>
<delay default="0" />
<sitemapResolverFactory ignore="false" />
<robotsTxt ignore="true" />
<referenceFilters>
<filter class="$filterExtension" onMatch="exclude">jpg,jepg,svg,gif,png,ico,css,js,xlsx,pdf,zip,xml</filter>
</referenceFilters>
</crawlerDefaults>
<crawlers>
<crawler id="CrawlerID">
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
<sitemap>https://*******.com/sitemap.xml</sitemap>
</startURLs>
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>document.reference,title,description,content</fields>
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<rename fromField="document.reference" toField="reference"/>
</tagger>
<transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
<!-- carriage return -->
<reduce>\r</reduce>
<!-- new line -->
<reduce>\n</reduce>
<!-- tab -->
<reduce>\t</reduce>
<!-- whitespaces -->
<reduce>\s</reduce>
</transformer>
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
<replace>
<fromValue>\n</fromValue>
<toValue></toValue>
</replace>
<replace>
<fromValue>\t</fromValue>
<toValue></toValue>
</replace>
</transformer>
</postParseHandlers>
</importer>
<!-- Azure committer setting -->
<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
<endpoint>********</endpoint>
<apiKey>***********</apiKey>
<indexName>**********</indexName>
<maxRetries>3</maxRetries>
<targetContentField>content</targetContentField>
<queueDir>./queuedir</queueDir>
<queueSize>6000</queueSize>
</committer>
</crawler>
</crawlers>
</httpcollector>
This error is coming from Azure. Is it possible you have large documents? Online research suggests you are getting this when uploading something too big. I would suggest you try adding
You can find many Azure/IIS users having this problem and the upload limit seems configurable. For instance, this Microsoft thread give you a few options: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/d729a842-8ed9-466e-9ba8-4256ea294548/http11-413-request-entity-too-large?forum=biztalkgeneral
An excerpt:
Check the IIS request Filtering and set the Maximum allowed content length to higher value. Also there is a setting present in the IIS – “UploadReadAheadSize” that prevents upload and download of data greater than 49KB.The value present by default is 49152 bytes and can be increased up to 4 GB.
Hopefully this can give you a few pointers, else, you will have to ask Azure support for how to increase the limit.
Hello @essiembre ,
Thanks for this quick update. I guess the issue is with size because i have many sitemap files which is getting uploaded successfully and maximum which got uploaded as checked in log files is 72 mb and the file which is causing issue is having size more than 90 mb.
I will check the setting and will ask Azure support if it will not solve.
Br, Akash
Hi , I am trying to crawl sitemap xml file which includes bulk urls - 100% completed (563 processed/563 total) I am getting error when committing to Azure. I have tried many time running the norconex - command being used: collector-http.bat -a start -c collectorconfig.xml
PFB error details from logs -
Can you please advise what needs to be done for this.
Br, Akash