Closed GoogleCodeExporter closed 9 years ago
Original comment by donald.z...@gmail.com
on 18 Apr 2008 at 10:34
I'm going to collect all of the individual issues related to exception handling
in the runBatch method here.
What steps will reproduce the problem?
1. Traverse a repository where nextDocument might throw an exception.
What is the expected output? What do you see instead?
Depending on the state, the connector manager might skip the document or go
into an infinite loop
if no documents in a batch work successfully. If counter is 0, we will keep
retrying the same batch, because checkpoint won't be called. If the
exception is transient, we will eventually make progress, after reindexing zero
or more documents. If the exception is persistent, we're in an infinite
loop. If counter is greater than zero, we will skip the document, no matter
what kind of exception we got.
One possible workaround to the infinite loop is to throw an OutOfMemoryError
instead of a RepositoryException from the nextDocument method
when no documents have been successfully returned, since that currently also
forces checkpoint to be called. That interacts badly with recent
changes to the batch delay, since traversing zero documents in runBatch will
lead to a five minute delay, even though the root cause might just be
an error retrieving a particular document.
Original comment by jl1615@gmail.com
on 8 May 2008 at 11:15
What steps will reproduce the problem?
1. Traverse a repository where nextDocument might throw an OutOfMemoryError.
What is the expected output? What do you see instead?
If the first call to nextDocument() throws an OutOfMemoryError, then
nextDocument will be null. (Note that if we fix
the first problem reported in this issue, nextDocument would always be null if
nextDocument() throws an exception of
any kind.) This will then throw a NullPointerException in the OutOfMemory catch
block, trying to retrieve the docid. We
will still force the checkpoint to be saved in the finally block.
Original comment by jl1615@gmail.com
on 8 May 2008 at 11:24
From: johnl@vizdom.com
Subject: Exception handling on QueryTraverser.runBatch
Date: July 3, 2008 3:34:43 PM PDT
Here's a sketch of our proposal for fixing the exception handling.
Marty, I know we've been banging on you pretty hard and it's pretty
late, but if you have a chance to look at this before you go, that
would be great.
For some of the basic bugs, see CM issue 72:
http://code.google.com/p/google-enterprise-connector-manager/issues/detail?id=72
One significant additional issue that we've wrestled with in the
Livelink connector is that there is no way to signal that progress has
been made, but there are no documents to return. This can happen when
the traversal user does not have permission for the entire batch, or
there are errors retrieving documents (we've had several support cases
where customers had development databases with integrity problems).
Here's the main pieces of the proposal:
1. Require a connector to throw a RepositoryException only in the
case of a transient error. If a document cannot be retrieved,
now or ever, then it should be silently skipped.
2. If a RepositoryException is thrown, the batch will be ended.
The checkpoint method will be called. A wait period will be
used before the traversal is resumed (perhaps simply by calling
the existing connectorFinishedTraversal method. Currently, the
single document is skipped (modulo bugs), and checkpoint is
called at the end of the batch, if any documents were returned.
3. If a PushException is thrown, the treatment is similar: the
batch will be abandoned, checkpoint will not be called, and the
traversal will be resumed after a short wait. Currently, the
batch is abandoned with no checkpoint, but no wait.
4. If an OutOfMemoryError is thrown, skip the document. Currently,
the batch is ended, and checkpoint is called.
5. In the return values from startTraversal and resumeTraversal,
distinguish between null and an empty DocumentList. A null value
signals the end of the traversal, and there will be a wait before
the traversal is resumed. An empty DocumentList signals some
progress without documents, the checkpoint method will be called,
and the next batch will be processed immediately, with no wait.
Currently, null and an empty DocumentList are treated the same,
as the end of the traversal, with no call to checkpoint, and a
wait before resuming the traversal.
Issues with this approach:
A. This does not eliminate the possibility of an infinite loop,
although it reduces the likelihood, and slows it down with the
5 minute waits. If a connector throws a RepositoryException on
a repeatable error, no progress will be made. We're essentially
choosing this over skipping documents, or more likely, sometimes
skipping documents and sometimes entering an infinite loop. This
behavior is at least predictable and easily noticed. This same
behavior is almost inevitable is startTraversal or resumeTraversal
throws an exception. We could add some detection of loops making
no progress, from any of startTraversal, resumeTraversal, or
nextDocument. Perhaps when we address CM issue 79, we could
disable the connector if the traversal has failed n times in a row.
B. The burden is on the connector to distinguish between documents
with permanent errors, which should be skipped, and documents
where transient errors are encountered (such as unreachable
servers), which are retried after a delay to avoid spinning on
repository, server, or network restarts.
C. The burden is also on the connector to return the previous
document's checkpoint after nextDocument throws an exception, or
if it can't do that, it has to throw an exception. If the
checkpoint is returned for the document that threw the exception,
that document will be skipped.
D. A connector can actually decide whether an error is worth waiting
for before resuming the traversal. That is, it can either throw a
RepositoryException from nextDocument, or return null, and in
either case return the last successful document's position from
checkpoint. The former scenario will force a wait, the latter
will not.
E. We have avoided requiring finer-grained exceptions, or a total
ordering, or calling checkpoint after each document is pushed
successfully. You could make an argument for any of them, but we
haven't. We don't see any of them as a silver bullet or even an
obvious improvement.
John Lacy
Original comment by Brett.Mi...@gmail.com
on 9 Jul 2008 at 11:05
Original comment by jl1615@gmail.com
on 10 Jul 2008 at 10:01
Recently, I've come across this problem while indexing Documentum repository.
Even though a specific document does exists in the repository (retrieved
through DQL
query), we can't fetch/view it using either Webtop/DA/DFC - here, the issue is
persistent.
When this exception is encountered, the connector stops feeding remaining
documents
to GSA - hence the GSA index is not complete.
Enclosed file is dctm connector log.
Original comment by lightbends
on 19 Aug 2008 at 9:18
Attachments:
A look at lightbends' log file shows the connector is looping, trying to feed
the same broken document over
and over again. This is the case where no new checkpoint is created, causing
the same batch to be traversed
repeatedly, failing each time.
Original comment by Brett.Mi...@gmail.com
on 19 Aug 2008 at 6:46
[deleted comment]
This time, I've problem with a huge (200+ MB) document.
The connector has already indexed the repository. Today, I've injected this
file in
to our DCTM repository.
But, CM fails to fed this document and going to infinite loop.
Documents that were updated/imported later are not getting updated in GSA
index. This
is a major problem especially when the repository is in production.
Please fix this bug ASAP
Original comment by lightbends
on 25 Aug 2008 at 1:32
Attachments:
lightbends' large file problem will be addressed by this fix (when
PushExceptions are handled properly), but it
would also be fixed when Issue 62 (TraversalContext) is fixed - 200MB files
should not be fed in the first place.
Original comment by Brett.Mi...@gmail.com
on 25 Aug 2008 at 4:21
Fixed in r953 | Brett.Michael.Johnson | 2008-09-18 13:41:49 -0700 (Thu, 18 Sep
2008) | 144 lines
Changes to fix Connector Manager Issue 72: Poor Exception Handing in runBatch().
Analysis:
--------
Adapted from an email from John Lacey July 3, 2008,
then updated after a Google in-house meeting a week later.
Here's a sketch of our proposal for fixing the exception handling.
One significant additional issue that we've wrestled with in the
Livelink connector is that there is no way to signal that progress has
been made, but there are no documents to return. This can happen when
the traversal user does not have permission for the entire batch, or
there are errors retrieving documents (we've had several support cases
where customers had development databases with integrity problems).
Here are the main pieces of the proposal:
1. Allows the connector to differentiate between exceptions thrown
by individual documents and those generated by issues of connector
health. If a specific document has a problem (i.e. corrupt
document or repository record), the connector may throw a new
RepositoryDocumentException. The connector manager will skip
over that document, proceding to the next item in the
DocumentList. The connector should be aware that checkpoint()
could be called immediately after RepositoryDocumentException
is thrown.
2. If a RepositoryException is thrown, the batch will be ended.
The checkpoint method will be called. A wait period will be
used before the traversal is resumed (perhaps simply by calling
the existing connectorFinishedTraversal method. Currently, the
single document is skipped (modulo bugs), and checkpoint is
called at the end of the batch, if any documents were returned.
3. If a PushException is thrown, the treatment is similar: the
batch will be abandoned, checkpoint will not be called, and the
traversal will be resumed after a short wait. Currently, the
batch is terminated (and checkpoint may-or-may-not be called,
depending on whether the first document fails), and no wait occurs.
4. If an OutOfMemoryError is thrown, skip the document. Currently,
the batch is ended, and checkpoint is called.
5. In the return values from startTraversal and resumeTraversal,
distinguish between null and an empty DocumentList. A null value
signals the end of the traversal, and there will be a wait before
the traversal is resumed. An empty DocumentList signals some
progress without documents, the checkpoint method will be called,
and the next batch will be processed immediately, with no wait.
Currently, null and an empty DocumentList are treated the same,
as the end of the traversal, with no call to checkpoint, and a
wait before resuming the traversal.
Issues with this approach:
A. This does not eliminate the possibility of an infinite loop,
although it reduces the likelihood, and slows it down with the
5 minute waits. If a connector throws a RepositoryException on
a repeatable error, no progress will be made. We're essentially
choosing this over skipping documents, or more likely, sometimes
skipping documents and sometimes entering an infinite loop. This
behavior is at least predictable and easily noticed. This same
behavior is almost inevitable is startTraversal or resumeTraversal
throws an exception. We could add some detection of loops making
no progress, from any of startTraversal, resumeTraversal, or
nextDocument. Perhaps when we address CM issue 79, we could
disable the connector if the traversal has failed n times in a row.
B. The burden is on the connector to distinguish between documents
with permanent errors, which should be skipped, and documents
where transient errors are encountered (such as unreachable
servers), which are retried after a delay to avoid spinning on
repository, server, or network restarts.
C. The burden is also on the connector to return the previous
document's checkpoint after nextDocument throws an exception, or
if it can't do that, it has to throw an exception. If the
checkpoint is returned for the document that threw the exception,
that document will be skipped.
D. Although connectors do not need to be immediately updated to
support these modifications, there are performance implications
in not doing so. Connectors that returned empty DocumentLists
(rather than null) when there are no more documents to index,
will now induce a busy wait. Also, checkpoint() will be called in
more cases (specifically, empty DocumentLists and more Exceptions)
and some existing connectors may not be expecting this,
failing upon having insufficient state to form a checkpoint.
Change Log:
----------
M projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/QueryTraverser.jav
a
- Differentiate between a null DocumentList and and empty DocumentList
in the count of documents returned. I chose -1 to represent a null
DocumentList and 0 to represent a DocumentList with zero documents.
- Checkpoint any batch that returned a DocumentList, even if that
DocumentList has zero documents.
- Replace several automated catch blocks with meaningful log messages.
- OutOfMemory error only skips the offending document, rather than
abandoning the whole batch.
- Handle RepositoryExceptions in checkpoint meaningfully, specifically
no new checkpoint, force a wait.
M projects/connector-
manager/source/java/com/google/enterprise/connector/scheduler/TraversalScheduler
.java
- Differentiate between a null DocumentList and and empty DocumentList
in the count of documents returned. I chose -1 to represent a null
DocumentList and 0 to represent a DocumentList with zero documents.
In the case of -1, tell the hostLoadManager to suspend the connector's
traversal for a few minutes. In the case of 0, allow the connector
to be rescheduled immediately. This will cause existing connectors
to spin looking for new content to index until they are fixed up to
return null DocumentLists.
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/Traverser.java
- Document the new behaviours for runBatch.
M
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Docum
entList.java
- Document the new behaviours for nextDocument and checkpoint.
M
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalManager.java
- Document the new behaviours for startTraversal resumeTravesal and checkpoint.
A projects/connector-
manager/source/java/com/google/enterprise/connector/spi/RepositoryDocumentExcept
ion.java
- Subclass of RepositoryException that indicatates a single document failure.
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/jcr/JcrTraversalManager
.java
- Now returns null DocumentList if no results available.
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/pusher/MockPusher.java
- Closes input streams after reading content.
- Fixes minor problems detailed in Connector Manager Issue 9.
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/jcr/JcrTraversalManager
Test.java
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/test/QueryTraversalUtil
.java
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/QueryTraverse
rTest.java
- fix tests.
D projects/connector-manager/testdata/dynamicConnectorConfig
- remove unused directory
Original comment by Brett.Mi...@gmail.com
on 7 Nov 2008 at 12:39
Original comment by jl1615@gmail.com
on 12 Jan 2009 at 3:28
hey lightbends,did you resolve the RPC issue?could you please share the
solution?I have the same problem ,for a particular object, it always through
the DM_SESSION_E_RPC_ERROR
when I try to get it by id ,even though the object is exist
workitem = (IDfWorkitem) getDfSession().getObject(new DfId(workitemID));
[DM_API_E_EXIST]error: "Document/object specified by <workitemID> does not
exist
[DM_SESSION_E_RPC_ERROR]error: "RPC error 116 occurred: (116) Error performing
send/receive. errno: 0, message: Error 0"
Original comment by sayson...@gmail.com
on 3 Sep 2010 at 3:09
Original issue reported on code.google.com by
jl1615@gmail.com
on 26 Mar 2008 at 11:36