Multiple problems with exception handling in QueryTraverser.runBatch

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Traverse a repository where some documents throw an exception from 
DocumentList.nextDocument.

or,

1. Read the source.

What is the expected output? What do you see instead?

If at least one document in the batch is returned successfully, and then the 
call to 
resultSet.nextDocument subsequently throws a RepositoryException, the 
nextDocument variable 
will still be set to the previous document. There's a check for null, which it 
isn't, so it's passed off 
to the DocPusher, which then pushes the previous document into the feed again.

What version of the product are you using? On what operating system?

Connector Manager revision 756.

Please provide any additional information below.

One obvious workaround is to set nextDocument to null at the top of the while 
(true) loop. That 
doesn't actually help, and in fact it might be worse. Unlike an 
OutOfMemoryError, a 
RepositoryException does not force a call to checkpoint, so if we exit the 
while loop here, we'll 
just redo the entire batch. If the error is a persistent one, we still have an 
infinite loop that is 
doing even more unnecessary work.

This begs the question of what should be happening when a RepositoryException 
is thrown here. 
The key is knowing whether the error is transient (e.g., repository server was 
restarted) or 
persistent (e.g., the document cannot be retrieved from the repository). For 
the former, we want 
to retry the current document, and for the latter we want to skip it.

Original issue reported on code.google.com by jl1615@gmail.com on 26 Mar 2008 at 11:36

GoogleCodeExporter commented 9 years ago

Original comment by donald.z...@gmail.com on 18 Apr 2008 at 10:34

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I'm going to collect all of the individual issues related to exception handling 
in the runBatch method here.

What steps will reproduce the problem?
1. Traverse a repository where nextDocument might throw an exception.

What is the expected output? What do you see instead?

Depending on the state, the connector manager might skip the document or go 
into an infinite loop 
if no documents in a batch work successfully. If counter is 0, we will keep 
retrying the same batch, because checkpoint won't be called. If the 
exception is transient, we will eventually make progress, after reindexing zero 
or more documents. If the exception is persistent, we're in an infinite 
loop. If counter is greater than zero, we will skip the document, no matter 
what kind of exception we got.

One possible workaround to the infinite loop is to throw an OutOfMemoryError 
instead of a RepositoryException from the nextDocument method 
when no documents have been successfully returned, since that currently also 
forces checkpoint to be called. That interacts badly with recent 
changes to the batch delay, since traversing zero documents in runBatch will 
lead to a five minute delay, even though the root cause might just be 
an error retrieving a particular document.

Original comment by jl1615@gmail.com on 8 May 2008 at 11:15

Changed title: Multiple problems with exception handling in QueryTraverser.runBatch

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Traverse a repository where nextDocument might throw an OutOfMemoryError.

What is the expected output? What do you see instead?

If the first call to nextDocument() throws an OutOfMemoryError, then 
nextDocument will be null. (Note that if we fix 
the first problem reported in this issue, nextDocument would always be null if 
nextDocument() throws an exception of 
any kind.) This will then throw a NullPointerException in the OutOfMemory catch 
block, trying to retrieve the docid. We 
will still force the checkpoint to be saved in the finally block.

Original comment by jl1615@gmail.com on 8 May 2008 at 11:24

GoogleCodeExporter commented 9 years ago

From:     johnl@vizdom.com
Subject:  Exception handling on QueryTraverser.runBatch
Date:     July 3, 2008 3:34:43 PM PDT

Here's a sketch of our proposal for fixing the exception handling.

Marty, I know we've been banging on you pretty hard and it's pretty
late, but if you have a chance to look at this before you go, that
would be great.

For some of the basic bugs, see CM issue 72:
 http://code.google.com/p/google-enterprise-connector-manager/issues/detail?id=72

One significant additional issue that we've wrestled with in the
Livelink connector is that there is no way to signal that progress has
been made, but there are no documents to return. This can happen when
the traversal user does not have permission for the entire batch, or
there are errors retrieving documents (we've had several support cases
where customers had development databases with integrity problems).

Here's the main pieces of the proposal:

  1. Require a connector to throw a RepositoryException only in the
     case of a transient error.  If a document cannot be retrieved, 
     now or ever, then it should be silently skipped.

  2. If a RepositoryException is thrown, the batch will be ended.
     The checkpoint method will be called.  A wait period will be
     used before the traversal is resumed (perhaps simply by calling
     the existing connectorFinishedTraversal method.  Currently, the
     single document is skipped (modulo bugs), and checkpoint is
     called at the end of the batch, if any documents were returned.

  3. If a PushException is thrown, the treatment is similar: the
     batch will be abandoned, checkpoint will not be called, and the
     traversal will be resumed after a short wait.  Currently, the 
     batch is abandoned with no checkpoint, but no wait.

  4. If an OutOfMemoryError is thrown, skip the document.  Currently,
     the batch is ended, and checkpoint is called.

  5. In the return values from startTraversal and resumeTraversal,
     distinguish between null and an empty DocumentList.  A null value
     signals the end of the traversal, and there will be a wait before
     the traversal is resumed.  An empty DocumentList signals some
     progress without documents, the checkpoint method will be called,
     and the next batch will be processed immediately, with no wait.
     Currently, null and an empty DocumentList are treated the same,
     as the end of the traversal, with no call to checkpoint, and a
     wait before resuming the traversal.

Issues with this approach:

  A. This does not eliminate the possibility of an infinite loop,
     although it reduces the likelihood, and slows it down with the 
     5 minute waits. If a connector throws a RepositoryException on 
     a repeatable error, no progress will be made. We're essentially
     choosing this over skipping documents, or more likely, sometimes
     skipping documents and sometimes entering an infinite loop. This
     behavior is at least predictable and easily noticed. This same
     behavior is almost inevitable is startTraversal or resumeTraversal 
     throws an exception. We could add some detection of loops making 
     no progress, from any of startTraversal, resumeTraversal, or
     nextDocument. Perhaps when we address CM issue 79, we could
     disable the connector if the traversal has failed n times in a row.

  B. The burden is on the connector to distinguish between documents
     with permanent errors, which should be skipped, and documents
     where transient errors are encountered (such as unreachable
     servers), which are retried after a delay to avoid spinning on
     repository, server, or network restarts.

  C. The burden is also on the connector to return the previous
     document's checkpoint after nextDocument throws an exception, or
     if it can't do that, it has to throw an exception. If the
     checkpoint is returned for the document that threw the exception,
     that document will be skipped.

  D. A connector can actually decide whether an error is worth waiting
     for before resuming the traversal. That is, it can either throw a
     RepositoryException from nextDocument, or return null, and in
     either case return the last successful document's position from
     checkpoint. The former scenario will force a wait, the latter
     will not.

  E. We have avoided requiring finer-grained exceptions, or a total
     ordering, or calling checkpoint after each document is pushed
     successfully. You could make an argument for any of them, but we
     haven't. We don't see any of them as a silver bullet or even an
     obvious improvement.

John Lacy

Original comment by Brett.Mi...@gmail.com on 9 Jul 2008 at 11:05

GoogleCodeExporter commented 9 years ago

Original comment by jl1615@gmail.com on 10 Jul 2008 at 10:01

Changed state: Started

GoogleCodeExporter commented 9 years ago

Recently, I've come across this problem while indexing Documentum repository.

Even though a specific document does exists in the repository (retrieved 
through DQL
query), we can't fetch/view it using either Webtop/DA/DFC - here, the issue is
persistent.
When this exception is encountered, the connector stops feeding remaining 
documents
to GSA - hence the GSA index is not complete.

Enclosed file is dctm connector log.

Original comment by lightbends on 19 Aug 2008 at 9:18

Attachments:

[dctm connector log.txt](https://storage.googleapis.com/google-code-attachments/google-enterprise-connector-manager/issue-72/comment-6/dctm connector log.txt)

GoogleCodeExporter commented 9 years ago

A look at lightbends' log file shows the connector is looping, trying to feed 
the same broken document over 
and over again.  This is the case where no new checkpoint is created, causing 
the same batch to be traversed 
repeatedly, failing each time.

Original comment by Brett.Mi...@gmail.com on 19 Aug 2008 at 6:46

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

This time, I've problem with a huge (200+ MB) document.
The connector has already indexed the repository. Today, I've injected this 
file in
to our DCTM repository.

But, CM fails to fed this document and going to infinite loop.
Documents that were updated/imported later are not getting updated in GSA 
index. This
is a major problem especially when the repository is in production.

Please fix this bug ASAP

Original comment by lightbends on 25 Aug 2008 at 1:32

Attachments:

[connector log.txt](https://storage.googleapis.com/google-code-attachments/google-enterprise-connector-manager/issue-72/comment-9/connector log.txt)

GoogleCodeExporter commented 9 years ago

lightbends' large file problem will be addressed by this fix (when 
PushExceptions are handled properly), but it 
would also be fixed when Issue 62 (TraversalContext) is fixed - 200MB files 
should not be fed in the first place.

Original comment by Brett.Mi...@gmail.com on 25 Aug 2008 at 4:21

GoogleCodeExporter commented 9 years ago

Fixed in r953 | Brett.Michael.Johnson | 2008-09-18 13:41:49 -0700 (Thu, 18 Sep 
2008) | 144 lines

Changes to fix Connector Manager Issue 72: Poor Exception Handing in runBatch().

Analysis:
--------

Adapted from an email from John Lacey July 3, 2008,
then updated after a Google in-house meeting a week later.

Here's a sketch of our proposal for fixing the exception handling.

One significant additional issue that we've wrestled with in the
Livelink connector is that there is no way to signal that progress has
been made, but there are no documents to return. This can happen when
the traversal user does not have permission for the entire batch, or
there are errors retrieving documents (we've had several support cases
where customers had development databases with integrity problems).

Here are the main pieces of the proposal:

  1. Allows the connector to differentiate between exceptions thrown
     by individual documents and those generated by issues of connector
     health.  If a specific document has a problem (i.e. corrupt
     document or repository record), the connector may throw a new
     RepositoryDocumentException.  The connector manager will skip
     over that document, proceding to the next item in the
     DocumentList.  The connector should be aware that checkpoint()
     could be called immediately after RepositoryDocumentException
     is thrown.

  2. If a RepositoryException is thrown, the batch will be ended.
     The checkpoint method will be called.  A wait period will be
     used before the traversal is resumed (perhaps simply by calling
     the existing connectorFinishedTraversal method.  Currently, the
     single document is skipped (modulo bugs), and checkpoint is
     called at the end of the batch, if any documents were returned.

  3. If a PushException is thrown, the treatment is similar: the
     batch will be abandoned, checkpoint will not be called, and the
     traversal will be resumed after a short wait.  Currently, the 
     batch is terminated (and checkpoint may-or-may-not be called,
     depending on whether the first document fails), and no wait occurs.

  4. If an OutOfMemoryError is thrown, skip the document.  Currently,
     the batch is ended, and checkpoint is called.

  5. In the return values from startTraversal and resumeTraversal,
     distinguish between null and an empty DocumentList.  A null value
     signals the end of the traversal, and there will be a wait before
     the traversal is resumed.  An empty DocumentList signals some
     progress without documents, the checkpoint method will be called,
     and the next batch will be processed immediately, with no wait.
     Currently, null and an empty DocumentList are treated the same,
     as the end of the traversal, with no call to checkpoint, and a
     wait before resuming the traversal.

Issues with this approach:

  A. This does not eliminate the possibility of an infinite loop,
     although it reduces the likelihood, and slows it down with the 
     5 minute waits. If a connector throws a RepositoryException on 
     a repeatable error, no progress will be made. We're essentially
     choosing this over skipping documents, or more likely, sometimes
     skipping documents and sometimes entering an infinite loop. This
     behavior is at least predictable and easily noticed. This same
     behavior is almost inevitable is startTraversal or resumeTraversal 
     throws an exception. We could add some detection of loops making 
     no progress, from any of startTraversal, resumeTraversal, or
     nextDocument. Perhaps when we address CM issue 79, we could
     disable the connector if the traversal has failed n times in a row.

  B. The burden is on the connector to distinguish between documents
     with permanent errors, which should be skipped, and documents
     where transient errors are encountered (such as unreachable
     servers), which are retried after a delay to avoid spinning on
     repository, server, or network restarts.

  C. The burden is also on the connector to return the previous
     document's checkpoint after nextDocument throws an exception, or
     if it can't do that, it has to throw an exception. If the
     checkpoint is returned for the document that threw the exception,
     that document will be skipped.

  D. Although connectors do not need to be immediately updated to
     support these modifications, there are performance implications
     in not doing so.  Connectors that returned empty DocumentLists
     (rather than null) when there are no more documents to index, 
     will now induce a busy wait.  Also, checkpoint() will be called in
     more cases (specifically, empty DocumentLists and more Exceptions)
     and some existing connectors may not be expecting this, 
     failing upon having insufficient state to form a checkpoint.

Change Log:
----------

M  projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/QueryTraverser.jav
a
  - Differentiate between a null DocumentList and and empty DocumentList
    in the count of documents returned.  I chose -1 to represent a null
    DocumentList and 0 to represent a DocumentList with zero documents.
  - Checkpoint any batch that returned a DocumentList, even if that
    DocumentList has zero documents.
  - Replace several automated catch blocks with meaningful log messages.
  - OutOfMemory error only skips the offending document, rather than
    abandoning the whole batch.
  - Handle RepositoryExceptions in checkpoint meaningfully, specifically
    no new checkpoint, force a wait.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/scheduler/TraversalScheduler
.java
  - Differentiate between a null DocumentList and and empty DocumentList
    in the count of documents returned.  I chose -1 to represent a null
    DocumentList and 0 to represent a DocumentList with zero documents.
    In the case of -1, tell the hostLoadManager to suspend the connector's
    traversal for a few minutes.  In the case of 0, allow the connector
    to be rescheduled immediately.  This will cause existing connectors
    to spin looking for new content to index until they are fixed up to
    return null DocumentLists.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/Traverser.java
  - Document the new behaviours for runBatch.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Docum
entList.java
  - Document the new behaviours for nextDocument and checkpoint.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalManager.java
  - Document the new behaviours for startTraversal resumeTravesal and checkpoint.

A  projects/connector-
manager/source/java/com/google/enterprise/connector/spi/RepositoryDocumentExcept
ion.java
  - Subclass of RepositoryException that indicatates a single document failure.

M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/jcr/JcrTraversalManager
.java
  - Now returns null DocumentList if no results available.

M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/pusher/MockPusher.java
  - Closes input streams after reading content.
  - Fixes minor problems detailed in Connector Manager Issue 9.

M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/jcr/JcrTraversalManager
Test.java
M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/test/QueryTraversalUtil
.java
M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/QueryTraverse
rTest.java
   - fix tests.

D  projects/connector-manager/testdata/dynamicConnectorConfig
   - remove unused directory

Original comment by Brett.Mi...@gmail.com on 7 Nov 2008 at 12:39

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by jl1615@gmail.com on 12 Jan 2009 at 3:28

Added labels: Milestone-Release_1.3.0

GoogleCodeExporter commented 9 years ago

hey lightbends,did you resolve the RPC issue?could you please share the 
solution?I have the same problem ,for a particular object, it always through 
the DM_SESSION_E_RPC_ERROR
when I try to get it by id ,even though the object is exist 
workitem = (IDfWorkitem) getDfSession().getObject(new DfId(workitemID));

[DM_API_E_EXIST]error:  "Document/object specified by <workitemID> does not 
exist
[DM_SESSION_E_RPC_ERROR]error:  "RPC error 116 occurred: (116) Error performing 
send/receive.  errno: 0, message: Error 0"

Original comment by sayson...@gmail.com on 3 Sep 2010 at 3:09

KalibriCuga / google-enterprise-connector-manager

Multiple problems with exception handling in QueryTraverser.runBatch #72