documents not getting deleted at azure index

avi7777 commented 6 years ago

Hi,

1)We have scheduled task to re run the norconex job every week, we will have several urls removed from sitemap.xml file and several urls added in the xml file every week. But it is observed that the urls(documents) are not getting deleted from the index when we recrawl the sitemap xml file. They are still showing up in the search result page and in index.

2)It is observed that while recrawling the sitemap.xml every week, urls with updated contents are not getting committed to the index on direct job run, to do that we are deleting the workdir folders and running the job freshly then we are able to see our latest content changes in search result header/description. Can we handle this process without have to delete the workdirectories and queuedirectories again and again.

essiembre commented 6 years ago

Not sure why you are getting this. Have you set the orphanStrategy to IGNORE (in your collector config)? Can you share your config?

avi7777 commented 6 years ago

Hi, Please find the config info:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="Test Collector">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")
  <progressDir>./progress</progressDir>
  <logsDir>./logs</logsDir>

  <crawlerDefaults>

    <urlNormalizer class="$urlNormalizer" />   
    <numThreads>4</numThreads>    
    <maxDepth>0</maxDepth>   
    <maxDocuments>-1</maxDocuments>
    <workDir>./workdir</workDir>
    <orphansStrategy>DELETE</orphansStrategy>
    <!-- 1s delay between crawling of the pages -->
    <delay default="1000" />

   <sitemapResolverFactory ignore="false" />

    <robotsTxt ignore="true" />

    <referenceFilters>      
      <filter class="$filterExtension" onMatch="exclude">jpg,jepg,svg,gif,png,ico,css,js,xlsx,pdf,zip,xml</filter>     
    </referenceFilters>
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
<trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
  </crawlerDefaults>

  <crawlers>  
    <crawler id="Test Crawler">      
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">  
      <sitemap>https://www.../sitemap.xml</sitemap>
      </startURLs>

      <!-- Importer settings per crawler -->
      <importer>
        <!-- Pre-parsing handlers and actions -->
        <preParseHandlers>

         <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="p" toField="paragraph" extract="text"/>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
            <singleValue field="title" action="keepFirst"/>
            <singleValue field="description" action="keepFirst"/>       
            <singleValue field="paragraph" action="mergeWith: "/>            
          </tagger>

        </preParseHandlers>

        <postParseHandlers>
          <!-- Keep only those tags and remove all others -->
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,paragraph,content</fields>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>
          </tagger>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
            <!-- carriage return -->
            <reduce>\r</reduce>
            <!-- new line -->
            <reduce>\n</reduce>
            <!-- tab -->
            <reduce>\t</reduce>
            <!-- whitespaces -->
            <reduce>\s</reduce>
          </transformer>

          <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
            <replace>
              <fromValue>\n</fromValue>
              <toValue></toValue>
            </replace>
            <replace>
              <fromValue>\t</fromValue>
              <toValue></toValue>
            </replace>
          </transformer>

        </postParseHandlers>
      </importer>
      <!-- Azure committer setting -->
  <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
        <endpoint>..</endpoint>
        <apiKey>..</apiKey>
        <indexName>testindex</indexName>
        <maxRetries>3</maxRetries>
        <targetContentField>content</targetContentField>
        <queueDir>test</queueDir>
        <queueSize>6000</queueSize>
      </committer>
    </crawler>     
  </crawlers>
</httpcollector>

2) Can you also provide me the feedback for point 2 asked above, where the updated content is not shown up in index until unless norconex job is run freshly by deleting the queue directory, progress and sitemap directories.

essiembre commented 6 years ago

On subsequent runs, if a document has not been modified it will not be sent again to your search engine. By default, a MD5 Checksum is made out of the document content and is compared every time the document is encountered.

Your logs should tell you if that's the case. Do you see REJECTED_UNMODIFIED?

Clearing the crawlstore folder also clears the checksums so all appears new and that's why they are sent again.

You can modify the document checksummer to use something other than the content, or you can actually disable it if you always want to send crawled documents to your index:

<documentChecksummer disabled="true"/>

If you suspect something else going on, please share a URL with which you have the problem to help reproduce.

avi7777 commented 6 years ago

Hi ,

Unfortunately, i can not share the exact sitemap.xml url. But i could explain you the steps to reproduce: 1) Have a sitemap.xml file with more than 400urls. 2)Crawl for first time. 3)Change the content of few urls which are included in xml file. Also delete couple of urls from sitemap.xml, and also include couple of new urls inside sitemap.xml. 4)After doing above changes, recrawl the sitemap.xml. 5)Expectation output is that: cralwer should store the latest updated documents, and newly added documents and it should delete the documents from index which are removed from sitemap.xml. 6)Do these steps without deleting any of the work directories. Means while recrawling, we should get the "Previous Execution detected" message.

essiembre commented 6 years ago

I followed your steps to the letter and cannot reproduce. I modified my sitemap (removing, adding, etc). And each time the proper calls were made to my Committer. All that without deleting any crawler files (i.e., workdir). So incremental crawling was working just fine for me.

Is it possible you have more than one crawler pointing to the same working directory or committer queue dir?

For troubleshooting, I suggest you use the XMLFileCommitter so you will know of every additions and deletion each time the crawler runs. You can use both with the MultiCommitter.

Also, you may want to set your log level to DEBUG in the log4j.properties in case it gives you hints of what may be going on.

Can you share your sitemap URL via email (check my profile)?

avi7777 commented 6 years ago

HI, As confirmed by you in the below issue: https://github.com/norconex/committer-azuresearch/issues/3

norconex will take crawlstore directory into consideration while recrawling. Thanks for the confirmation. When crawlstore is kept as it is then norconex is able to delete the index entries.

But when i keep the crawlstore directory and other directories as it is without deleting and make some changes in one of the documents say i update new title for one of my url present inside sitemap.xml and recrawl ,then i am seeing the changes are not getting uploaded to index. i could only see "REJECTED_TOO_DEEP" warnings while recrawling.

And i have included DEBUG mode to see the JSON response that is being sent to azure, and there is no JSON response being sent to azure when recrawling the same collector config for second time which includes content changes of title filed for few urls present inside sitemap.xml.

essiembre commented 6 years ago

Have a look at MD5DocumentChecksummer.

That is the default checksummer being used to figure out if a document has changed or not. By default, it only compares checkums from the document "content". If you want specific fields to be taken in to account, like title, you can tell it to do so. E.g.:

  <documentChecksummer 
      class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
      combineFieldsAndContent="true">
    <sourceFields>title</sourceFields>
  </documentChecksummer>

If you want to consider all fields, you can use this version:

...
    <sourceFieldsRegex>.*</sourceFieldsRegex> 
...

avi7777 commented 6 years ago

Hi , I tried below settings added after in the collector config file. Unable to commit updated content when recrawled.

<documentChecksummer class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer" combineFieldsAndContent="true"> <sourceFields>title,description</sourceFields> </documentChecksummer>

But, I could able to see REJECTED_PREMATURE warning for those urls in which i have updated the title content.

essiembre commented 6 years ago

You are making progress! REJECTED_PREMATURE means not enough time has elapsed since you last crawled these documents. The minimum wait time is usually defined in your sitemap.xml as <changefreq>. To ignore these change frequency settings, you can disable sitemap support altogether:

 <sitemapResolverFactory ignore="false" />

In your case, since you use the sitemap to find out what to crawl, that is probably not the best idea. You can instead have a look at adding a GenericRecrawlableResolver entry to overwrite the sitemap desired to recrawl frequencies.

Given your issue is not specific to Azure Search, please consider using the HTTP Collector issues tracker for issues that relate to the HTTP Collector.

avi7777 commented 6 years ago

Hi essiembre,

Perfect. I now know the significance of GenericRecrawlableResolver. I was able to verify my changes getting updated without have to delete any crawlstore directory using GenericRecrawlableResolver. Used below setting for verification.

<recrawlableResolver 
         class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
         sitemapSupport="last" >

    <minFrequency applyTo="reference" value="600000">.*xyz.com</minFrequency>
    <minFrequency applyTo="reference" value="600000">.*xyz.com/.*</minFrequency>
</recrawlableResolver>

But, since we already have <changefrequency> setting in our sitemap.xml which we crawl, we won't be using GenericRecrawlableResolver in our case, so can you confirm me that the crawler will take care of content changes based on changefrequency value set in sitemap.xml and it will only commit those documents which are modified in previous week and reject other unmodified documents.

essiembre commented 6 years ago

Yes. If you honor your sitemap change frequency, it will not even attempt to crawl or verify if a file has changed if not enough time has passed. Every time you launch the crawl, it will check if enough time has passed for each URLs with a change frequency. Eventually, enough time will have passed and it will then check if the file was modified or not, based on the document checksum it keeps.

Does that answer?

Norconex / committer-azuresearch

documents not getting deleted at azure index #2