PILLUTLAAVINASH / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
0 stars 0 forks source link

Fix exception handling in DocPusher #108

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. DocumentList.nextDocument() returns a Document object with no google:docid 
field (or 
docid is null)
2. DocPusher.getRequiredString() throws RuntimeException when failing to fetch 
google:docid 
value.
3. Uncaught RuntimeException halts traversal.

This problem was encountered by a customer using the Documentum connector and a 
pre- 
issue 72 version of the connector manager.

An excerpt from an email entitled "Re: [C#356549206] [google-connector-team] 
Re: 
Documentum Issue": 

[start of excerpt]

Then concerning the fact that the connector stops crawling after having 
encountered this 
problem, I caught the exception in the fetch and findproperty methods of 
DctmSysobjectDocument.

but in this case, even if the id problem exists only on one document, I get a 
warning from the 
DocPusher and then an exception is thrown by the WorkQueueThread because a 
required 
property is missing and the connector stops crawling :

INFO: r_object_id of the fetched object is 090004578001g915
31-Oct-2008 15:11:23 com.google.enterprise.connector.dctm.DctmSysobjectDocument 
fetch
SEVERE: Exception thrown while trying to get the object 090004578001f229 : 
[DM_API_E_EXIST]error:  "Document/object specified by 090004578001g915 does not 
exist."

[DM_SYSOBJECT_E_CANT_FETCH_INVALID_ID]error:  "Cannot fetch a sysobject - 
Invalid object ID 
090004578001g915"

31-Oct-2008 15:11:23 
com.google.enterprise.connector.dctm.dctmdfcwrap.DmSessionManager 
release
FINEST: session released
31-Oct-2008 15:11:23 com.google.enterprise.connector.dctm.DctmSysobjectDocument 
fetch
FINE: session released
31-Oct-2008 15:11:23 com.google.enterprise.connector.dctm.DctmSysobjectDocument 
findProperty
FINE: property google:docid has the value null
31-Oct-2008 15:11:23 com.google.enterprise.connector.pusher.DocPusher 
getRequiredString
WARNING: Document missing required property google:docid
31-Oct-2008 15:11:23 com.google.enterprise.connector.common.WorkQueueThread run
WARNING: WorkQueueThread work had problems:
java.lang.RuntimeException: Document missing required property google:docid
       at 
com.google.enterprise.connector.pusher.DocPusher.getRequiredString(DocPusher.jav
a:456)
       at com.google.enterprise.connector.pusher.DocPusher.buildXmlData(DocPusher.java:526)
       at com.google.enterprise.connector.pusher.DocPusher.take(DocPusher.java:690)
       at 
com.google.enterprise.connector.traversal.QueryTraverser.runBatch(QueryTraverser
.java:131)
       at 
com.google.enterprise.connector.scheduler.TraversalScheduler$TraversalWorkQueueI
tem.doWork
(TraversalScheduler.java:367)
       at 
com.google.enterprise.connector.common.WorkQueueThread.run(WorkQueueThread.java:
83)

Since the findProperty method is directly called by the DocPusher, I need to 
see perhaps with 
Marty and Jeff what we are supposed to do to prevent the connector from 
stopping when the 
DmSysobject object (from which we get all the properties) can't be created.

[end of excerpt]

What is the expected output? What do you see instead?

This is related to Connector Manager Issue 72 - Handling exceptions thrown in 
QueryTraverser.runBatch().   Although I addressed the connectors with regards
to throwing exceptions from DocumentList.nextDocument(), I did not look into 
what happens when exceptions are thrown from Document.findProperty() or
Document.getPropertyNames().  This problem also caused me to look more
closely at DocPusher.

Unfortunately we cannot work around the issue using the currently released
Connector Manager.  However even using the unreleased nightly build of
the Connector Manager wont fix the problem.   When a required property is null
(or missing) DocPusher throws a RuntimeException.   Although the exception
would get caught by the new QueryTraverser.runBatch() [rather than propagating
uncaught all the way up to WorkQueue, as seen below], runBatch will now 
drop the whole batch of documents on the floor.  Traversal will resume using
the old checkpoint.  The problem document will again be submitted, and it
will crash.  We would loop trying to fetch this one document repeatedly.

My suggested work around: Patch a post-Issue 72 snapshot of the Connector
Manager to throw RepositoryDocumentException, rather than RuntimeException
if a required Document property is missing or null.  QueryTraverser.runbatch()
will catch the RepositoryDocumentException and skip the problem document.

Additionally, some follow-on work to Issue 72 should be done before the 
release next month:
  - Document.findProperty() [and possibly Document.getPropertyNames()]
    interfaces should be enhanced to throw RepositoryDocumentException.

 - The Connector implementations of the Document Interface should try to
    distinguish transient problems from document-specific problems and throw
    RepositoryExceptions for the former, and RepositoryDocumentExceptions
    for the latter.

 - DocPusher should really not throw RuntimeExceptions.   It currently
   throws them for most errors it encounters.  It also catches most other
   exceptions and rethrows them as RuntimeException.   I don't know the
   reasoning behind this.  DocPusher should throw:
   - RepositoryException if it gets one from the Document methods.
   - RepositoryDocumentException if it gets one from the Document methods.
   - RepositoryDocumentException if it gets an invalid Document object
     (like one with missing required properties), or has problems extracting
     data from the DocumentObject, or even if DocPusher.buildXmlData()
     fails for most reasons.
   - FeedException thrown by the Feed API.
   - PushException for internal Pusher problems (like TeedFeedFile issues).

Original issue reported on code.google.com by Brett.Mi...@gmail.com on 31 Oct 2008 at 9:53

GoogleCodeExporter commented 8 years ago
Fixed in r1137 

This set of changes addresses Connector Manager Issue 108 -
Exception handling in DocPusher.take()

This is related to Connector Manager  Issue 72  - Handling exceptions thrown in
QueryTraverser.runBatch().   Although I addressed the connectors with regards
to throwing exceptions from DocumentList.nextDocument(), I did not look into
what happens when exceptions are thrown from Document.findProperty() or
Document.getPropertyNames().  This problem also caused me to look more
closely at DocPusher.

Unfortunately we cannot work around the issue using the currently released
Connector Manager.  However even using the unreleased nightly build of the
Connector Manager won't fix the problem.   When a required property is null
(or missing) DocPusher throws a RuntimeException.   Although the exception
would get caught by the new QueryTraverser.runBatch() [rather than propagating
uncaught all the way up to WorkQueue, as seen below], runBatch will now
drop the whole batch of documents on the floor.  Traversal will resume using
the old checkpoint.  The problem document will again be submitted, and it
will again crash.  We would loop trying to fetch this one document repeatedly.

My suggested work around: Patch a post- Issue 72  snapshot of the Connector
Manager to throw RepositoryDocumentException, rather than RuntimeException
if a required Document property is missing or null.  QueryTraverser.runbatch()
will catch the RepositoryDocumentException and skip the problem document.

Additionally, some follow-on work to  Issue 72  should be done before the
release next month:

Document.findProperty() [and possibly Document.getPropertyNames()]
interfaces should be enhanced to throw RepositoryDocumentException.

The Connector implementations of the Document Interface should try to
distinguish transient problems from document-specific problems and throw
RepositoryExceptions for the former, and RepositoryDocumentExceptions
for the latter.

DocPusher should really not throw RuntimeExceptions.   It currently
throws them for most errors it encounters.  It also catches most other
exceptions and rethrows them as RuntimeException.   I don't know the
reasoning behind this.  DocPusher should throw:
  - RepositoryException if it gets one from the Document methods.
  - RepositoryDocumentException if it gets one from the Document methods.
  - RepositoryDocumentException if it gets an invalid Document object
    (like one with missing required properties), or has problems extracting
    data from the Document object, or even if DocPusher.buildXmlData()
    fails for most reasons.
  - FeedException thrown by the Feed API.
  - PushException for internal Pusher problems (like TeedFeedFile issues).

Change Log:
----------
M  
projects/connector-manager/source/java/com/google/enterprise/connector/pusher/Pu
sher.java
   - Pusher.take() can now throw FeedException, PushException, RepositoryException.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/pusher/Do
cPusher.java
   - DocPusher.take() now throws more kinds of Exceptions.
   - No longer catch Exceptions and rethrow as RuntimeExceptions.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/QueryTraverser.jav
a
   - QueryTraverser.runBatch() handles FeedExceptions thrown by Pusher.take()
   - QueryTraverser.runBatch() handles RuntimeExceptions when processing a
     Document and skips that document.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/spi/Document.java
   - Enhanced comments to include throwing RepositoryDocumentException
     for non-transient fatal document errors.

M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/pusher/DocPusherTest.ja
va
   - Fix up tests to handle the new Pusher.take() Exceptions thrown
   - Add tests for throwing RepositoryDocumentExceptions in certain conditions

M  
projects/connectormanager/source/javatests/com/google/enterprise/connector/pushe
r/MockPusher.java
M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/pusher/MockPusherTest.j
ava
   - Fix up tests to handle the new Pusher.take() Exceptions thrown

Original comment by Brett.Mi...@gmail.com on 15 Nov 2008 at 4:07

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 12 Jan 2009 at 3:31