Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

A failed job causes the second job to fail as well #507

Closed ohtwadi closed 6 years ago

ohtwadi commented 6 years ago

I have 2 jobs, A and B. I already know B will fail during indexing (to Solr). When B inevitably fails, A fails as well, provided, A finishes after B.

If I run A by itself, it completes successfully. If I setup A to finish before B (by reducing the number of start URLs), A completes successfully and then B fails later

Relevant bits from the log follow.

... B: 2018-07-23 15:08:27 ERROR - Execution failed for job: B com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. ... Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://mySolrHost:8984/solr/everything: ERROR: [doc=https://pub-calgary.escribemeetings.com/FileStream.ashx?DocumentId=30282] multiple values encountered for non multiValued field File Size: [25751 bytes, 4993 bytes, 8331 bytes, 5467 bytes, 31099 bytes, 398490 bytes, 47170 bytes] ... B: 2018-07-23 15:08:27 INFO - Running B: END (Mon Jul 23 15:06:48 MDT 2018) B: 2018-07-23 15:08:27 ERROR - B failed. ... A: 2018-07-23 15:10:50 ERROR - Execution failed for job: A com.norconex.committer.core.CommitterException: Cannot index document batch to Solr. ... Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://mySolrHost:8984/solr/everything: ERROR: [doc=https://pub-calgary.escribemeetings.com/FileStream.ashx?DocumentId=30282] multiple values encountered for non multiValued field File Size: [25751 bytes, 4993 bytes, 8331 bytes, 5467 bytes, 31099 bytes, 398490 bytes, 47170 bytes]

Notice the second failed message logs the same doc id as the first one. Seems that the Collector is retrying previously failed commits before it exists, even though I have explicitly configured the Solr committer with maxRetries=0.

I will attach the configuration you can use to reproduce.

Bugtest.txt

essiembre commented 6 years ago

My first guess is to have a look at the committer queue dir of your two crawlers. Make sure they point to different locations. Else, both committers (one for each crawler) will try to send the same files over to Solr.

In addition, if there are faulty documents, make sure you delete the committer queue (or entire "workdir") or it will try to send faulty documents again on next run.

essiembre commented 6 years ago

I just revised your config, and it confirms you have the same queue dir for both crawlers since you have it defined as a default. Copy the committer section under each crawler, with a different path.

A future release will likely address this by having committers use crawler-specific paths by default.

ohtwadi commented 6 years ago

Having a different queue directory per job did the trick. Thanks much for the quick response Pascal.