Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Communicating Norconex Committer failures #647

Closed arunakanaparthy closed 3 years ago

arunakanaparthy commented 4 years ago

Hi, We have noticed that sometimes Norconex committer fails to index few documents for any reason. These failures cannot be communicated back to the crawl which updates the checksum and will not be picked up in the next incremental run. Even if we identify those documents and send them back to Norconex to crawl, they will still not be picked up for indexing as their checksum has been already been updated to the latest.

Is there a way to address this problem and have Norconex pick up these failed documents for indexing in the next run?

Thanks, Aruna Kanaparthy

essiembre commented 4 years ago

The Committer will retry documents left in its queue on next run if they could not be submitted. From what you are describing, it seems the failure occurs in your repository, after the documents has left the Committer (e.g., queued in Solr until a commit occurs). If so, the Committer has no way to know this. Here are a few suggestions to get around that:

  1. Once in a while, delete your "workdir" (or just the "crawlstore") so all documents are re-crawled. Other than processing more documents, the main drawback of this approach is the possible miss of deleted pages.

  2. Similar to previous suggestion, you can set disabled="true" to both checksummers whenever you do not want them to take effect. E.g.:

    <metadataChecksummer disabled="true"/>
    <documentChecksummer disabled="true"/>
  3. If you know which page failed that way, modify its source content slightly, or modify the checksummer to rely on some other field you can change without impacting your content, like a last modified date (which you would update).

  4. Again if you know the page that failed, you can modify the MD5DocumentChecksummer to include a mix of content and a made up field. That made up field is normally blank, but for failed document, you can fill it with a constant, like this:

  <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="replace" >
      <restrictTo field="document.reference">
          http://www.example.com/my/failing/page.html
      </restrictTo>
      <constant name="whateverSaltValue">1234</constant>
  </tagger>

We can also add a feature request to force re-crawling only specific documents regardless whether they were modified or not.

arunakanaparthy commented 4 years ago

Hi Pascal, Thanks for your response. These are great suggestions!

We are trying to automate the reindex of failed documents caught during committing to our repositories on the next incremental run. We should be able to capture the url's which run into errors through our committer application. Can we please make a feature request to feed an error list which are picked to be indexed irrespective of their checksums?

Thanks, Aruna

essiembre commented 3 years ago

This has been implemented with the version 3 stack. You can now add listeners at the collector or crawler level and be notified of any committer events, including errors. Additions/deletions events also include a reference to the request being made (so you know what failed).

In addition, committers supporting batch commits now have error-specific configuration options. For instance, it can automatically retry failing batches in smaller chunks until there is only the faulty document request(s) remaining. Documents requests in error are now stored in an "error" folder so it is easy to spot them.