Norconex / committer-azuresearch

Implementation of Norconex Committer for Microsoft Azure Search.
https://opensource.norconex.com/committers/azuresearch/
Apache License 2.0
1 stars 2 forks source link

ERROR - Could not commit batched operations #4

Open akumar251 opened 5 years ago

akumar251 commented 5 years ago

Hi , I am trying to crawl sitemap xml file which includes bulk urls - 100% completed (563 processed/563 total) I am getting error when committing to Azure. I have tried many time running the norconex - command being used: collector-http.bat -a start -c collectorconfig.xml

PFB error details from logs -

Crawler : 2018-12-25 23:34:04 INFO - Azure Search REST API Http Client closed.
Crawler : 2018-12-25 23:34:04 INFO - Azure Search REST API Http Client closed.
Crawler : 2018-12-25 23:34:04 ERROR - Could not commit batched operations.
com.norconex.committer.core.CommitterException: Invalid HTTP response: "HTTP/1.1 413 Request Entity Too Large". Azure Response: The page was not displayed because the request entity is too large.
    at com.norconex.committer.azuresearch.AzureSearchCommitter.handleResponse(AzureSearchCommitter.java:509)
    at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(AzureSearchCommitter.java:478)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
    at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
    at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureSearchCommitter.java:405)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)

Can you please advise what needs to be done for this.

Br, Akash

akumar251 commented 5 years ago

Collector Config file:

<httpcollector id="Collector1">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")

  <crawlerDefaults>

    <urlNormalizer class="$urlNormalizer" />

    <numThreads>4</numThreads>  
    <maxDepth>1</maxDepth>
    <maxDocuments>-1</maxDocuments>
    <workDir>./norconexcollector</workDir>
    <orphansStrategy>DELETE</orphansStrategy>

    <delay default="0" />   
   <sitemapResolverFactory ignore="false" />      
    <robotsTxt ignore="true" /> 
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,jepg,svg,gif,png,ico,css,js,xlsx,pdf,zip,xml</filter>
    </referenceFilters>
  </crawlerDefaults>
  <crawlers>
    <crawler id="CrawlerID"> 
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">     
    <sitemap>https://*******.com/sitemap.xml</sitemap>
       </startURLs>
      <importer>
      <postParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,content</fields>
          </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>          
          </tagger>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
            <!-- carriage return -->
            <reduce>\r</reduce>
            <!-- new line -->
            <reduce>\n</reduce>
            <!-- tab -->
            <reduce>\t</reduce>
            <!-- whitespaces -->
            <reduce>\s</reduce>
          </transformer>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
            <replace>
                <fromValue>\n</fromValue>
                <toValue></toValue>
            </replace>
            <replace>
                <fromValue>\t</fromValue>
                <toValue></toValue>
            </replace>
          </transformer>
        </postParseHandlers>
      </importer>   
     <!-- Azure committer setting -->
     <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
       <endpoint>********</endpoint>
       <apiKey>***********</apiKey>
       <indexName>**********</indexName>
         <maxRetries>3</maxRetries>
        <targetContentField>content</targetContentField>
    <queueDir>./queuedir</queueDir>
    <queueSize>6000</queueSize>
     </committer>
    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 5 years ago

This error is coming from Azure. Is it possible you have large documents? Online research suggests you are getting this when uploading something too big. I would suggest you try adding 10 (or lower) to your committer to see if it makes a difference (from a default of 100).

You can find many Azure/IIS users having this problem and the upload limit seems configurable. For instance, this Microsoft thread give you a few options: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/d729a842-8ed9-466e-9ba8-4256ea294548/http11-413-request-entity-too-large?forum=biztalkgeneral

An excerpt:

Check the IIS request Filtering and set the Maximum allowed content length to higher value. Also there is a setting present in the IIS – “UploadReadAheadSize” that prevents upload and download of data greater than 49KB.The value present by default is 49152 bytes and can be increased up to 4 GB.

Hopefully this can give you a few pointers, else, you will have to ask Azure support for how to increase the limit.

akumar251 commented 5 years ago

Hello @essiembre ,

Thanks for this quick update. I guess the issue is with size because i have many sitemap files which is getting uploaded successfully and maximum which got uploaded as checked in log files is 72 mb and the file which is causing issue is having size more than 90 mb.

I will check the setting and will ask Azure support if it will not solve.

Br, Akash