If QueryTraverser.runBatch throws exceptions, we can get infinite loop

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

See below.

What is the expected output? What do you see instead?

If QueryTraverser.runBatch throws exceptions, the batch is retried from the
previous checkpoint. The already processed PropertyMaps are recrawled.
Combined with issue 10, if a connector throws a repeatable exception, then
we're in an infinite loop. If it throws an intermittent exception, then we
may eventually make progress, but we'll reindex documents in doing so. When
I had a non-thread-safe implementation of the QTM, as I mentioned in an
earlier email, a crawl was on average reindexing each document 1000 times.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by donald.z...@gmail.com on 24 Jan 2007 at 12:14

GoogleCodeExporter commented 8 years ago

Brian, please add to backlog.

Original comment by donald.z...@gmail.com on 24 Jan 2007 at 12:14

GoogleCodeExporter commented 8 years ago

Google Bug #243984

Original comment by vjo...@gmail.com on 9 Feb 2007 at 4:19

GoogleCodeExporter commented 8 years ago

Issue 10 has already been resolved.  Do we have any specific Exceptions getting
thrown?  Do we know what call is throwing the Exception?

Original comment by danny....@gmail.com on 30 Apr 2007 at 10:48

GoogleCodeExporter commented 8 years ago

The reference to issue 10 is a typo. It should refer to issue 11. Basically, 
anything
that calls a connector might throw a RepositoryException.

There are also some issues with Subversion revision 334. First, it says that the
checkpoint is on the last document processed (in the log and in the code), but 
in
fact the checkpoint is on the current document, the one that threw an exception 
in
take. So the traversal will effectively skip the offending document. That may be
fine, but it's not the documented behavior.

Of course, doing something other than skipping the offending document is 
tricky. You
have to give up at some point, because if the exception is persistent then you 
have
an infinite loop. See issue 32 for a specific example of that. Also, as I've
mentioned somewhere, probably the google-vizdom-tech-discuss list, the Connector
Manager doesn't actually have a mechanism for retrying individual documents 
through
the SPI because the semantics of checkpoints is not strict enough.

Second, if the original problem is not specific to the operation that threw the
exception, e.g., the remote server is being restarted, then the call to 
checkpoint
may well fail, too. The same is true even if you checkpoint every document
proactively, rather than in reaction to an exception. There's nothing to say 
that
checkpoint won't be the first call to fail. So the design needs to handle cases 
where
creating checkpoints fails, ensure the traversal makes progress in such cases, 
and
doesn't redo too much work, although some rework may be acceptable.

Third, there are likely to be RepositoryExceptions hidden in the call to 
iter.next,
since that call may construct a PropertyMap. That means that such exceptions 
will
escape from runBatch unrecognized. I haven't looked to see what happens next. 
This is
more germane to issue 11, but I thought I'd mention is since we're talking 
about that
block of code.

Original comment by jl1615@gmail.com on 2 May 2007 at 8:28

GoogleCodeExporter commented 8 years ago

r334 addresses issues where we can recrawl docs if the "take" method throws
Exceptions.  We ensure that any forward progress we make will get checkpointed.

Since issue "11" has been solved, this issue is resolved.

Original comment by mgron...@gmail.com on 3 Oct 2007 at 9:52

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 3 Oct 2007 at 10:49

Added labels: Milestone-Release_1.0.0

PILLUTLAAVINASH / google-enterprise-connector-manager

If QueryTraverser.runBatch throws exceptions, we can get infinite loop #12