Unable to commit more than 200documents in one execution of norconex collector command. Norconex collector doesn’t react after 200 documents.

avi7777 commented 6 years ago

I am trying to crawl sitemap xml file which includes bulk urls and commit the documents to azure service. There will be more than 400 documents getting stored in the committer-queue directory. Norconex azure committer is only able to commit 200 documents to azure service in one go when the command executes, it doesn't react or throw any error once it tries to send the next set of documents.

command being used: collector-http.bat -a start -c collectorconfig.xml

As per azure committer configuration , By default, batch size will be 100. It succeeds sending first 200 commit operations.( 100 operations per batch) , but it doesn't react thereafter when it tries to commit the next set of 100 operations. To send the remaining files, i need to terminate the current job and rerun it again to commit next 200 documents to service & so on.. Help to find out the root cause, or any limitation on number of documents to be committed,

Committer setting : using the basic committer setting with committer class , endpoint ,apikey, indexname tags.

<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter"> <endpoint>...</endpoint> <apiKey>...</apiKey> <indexName>...</indexName> <maxRetries>-1</maxRetries> <maxRetryWait>5</maxRetryWait> </committer>

essiembre commented 6 years ago

Do you have any errors/warnings showing in the HTTP Collector logs, or Azure logs? Are your documents relatively large? It may be having memory issues with batches that are too big. I suggest you try with a smaller batch size.

Also, do you have several committers possibly pointing to the same queudir. I recommend you explicitly put a unique path to each committer under the <queueDir>...</queueDir> setting.

avi7777 commented 6 years ago

There are no errors in the logs regarding the above issue. As you mentioned, i have reduced the batch size to 50 , also updated the queuedir & tried executing the job, i was able to get the similar issue, this time committer succeeded in sending first 2 sets of batch operations 1.e., total of 100 documents were committed and then it went to halt state(same case when tried with batchsize of 100).

Collector Config file: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml>

<httpcollector id="Collector1">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")

  <crawlerDefaults>

    <urlNormalizer class="$urlNormalizer" />

    <numThreads>4</numThreads>  
    <maxDepth>1</maxDepth>
    <maxDocuments>-1</maxDocuments>
    <workDir>./norconexcollector</workDir>
    <orphansStrategy>DELETE</orphansStrategy>

    <delay default="0" />   
   <sitemapResolverFactory ignore="false" />      
    <robotsTxt ignore="true" /> 
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,jepg,svg,gif,png,ico,css,js,xlsx,pdf,zip,xml</filter>
    </referenceFilters>
  </crawlerDefaults>
  <crawlers>
    <crawler id="CrawlerID"> 
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">     
    <sitemap>https://*******.com/sitemap.xml</sitemap>
       </startURLs>
      <importer>
      <postParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,content</fields>
          </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>          
          </tagger>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
            <!-- carriage return -->
            <reduce>\r</reduce>
            <!-- new line -->
            <reduce>\n</reduce>
            <!-- tab -->
            <reduce>\t</reduce>
            <!-- whitespaces -->
            <reduce>\s</reduce>
          </transformer>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
            <replace>
                <fromValue>\n</fromValue>
                <toValue></toValue>
            </replace>
            <replace>
                <fromValue>\t</fromValue>
                <toValue></toValue>
            </replace>
          </transformer>
        </postParseHandlers>
      </importer>   
     <!-- Azure committer setting -->
     <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
       <endpoint>********</endpoint>
       <apiKey>***********</apiKey>
       <indexName>**********</indexName>
    <commitBatchSize>50</commitBatchSize>
    <queueDir>./queuedir</queueDir>
     </committer>
    </crawler>
  </crawlers>
</httpcollector>

Can you also update me is there any way to trace , how the json format is farmed for our documents and being sent to azure, since i am unable to trace the request through fiddler trace.

essiembre commented 6 years ago

I was able to reproduce. The problem was a bug in the Azure Committer where the HTTP connections were not always released properly.

A fix was released in a new Azure Committer snapshot. Please try it and confirm.

To answer other question about getting the JSON construct, you can add this to log4j:

log4j.logger.com.norconex.committer.azuresearch=TRACE

If you have more issues specific to the Azure Committer, please use https://github.com/norconex/committer-azuresearch/issues to report them.

avi7777 commented 6 years ago

Hi , thanks a lot for the fix. yes, it fixed my issue and now i am able to commit huge number of documents to azure index in one execution itself. and the log trace was also worked fine. I was able to trace down the JSON construct in the logs. thanks for your great support.:)

essiembre commented 6 years ago

Thanks for confirming. Version 1.1.1 stable of Azure Search Committer has just been released.

Norconex / crawlers

Unable to commit more than 200documents in one execution of norconex collector command. Norconex collector doesn’t react after 200 documents. #442