PILLUTLAAVINASH / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
0 stars 0 forks source link

Refactor the work queue threads #110

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
TraversalScheduler.run is essentially single-threaded because
TraversalWorkQueueItem has a wait/notify pair, so the scheduler can't start
another work queue thread until the active one finishes. There may be lots
of cleanup that could be done here.

Original issue reported on code.google.com by jl1615@gmail.com on 9 Dec 2008 at 1:39

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 28 Jan 2009 at 10:27

GoogleCodeExporter commented 8 years ago
Google bug #737300 is a duplicate of this issue.

From Danny Tom:

In TraversalScheduler.run() the method blocks on runnable.getNumDocsTraversed()
.  It shouldn't since this will block other connectors from getting scheduled.

Instead here are some thoughts on how to fix:

1.) Let TraversalWorkQueueItem call hostLoadManager.updateNumDocsTraversed().
2.) A connector that is currently running should not get rescheduled by
TraversalScheduler.  Instead, TraversalScheduler should just skip those items 
that
are "in progress".

Original comment by jl1615@gmail.com on 3 Feb 2009 at 12:58

GoogleCodeExporter commented 8 years ago
We are working with another customer to get their SharePoint server crawled 
using
SharePoint connector.

The attached log is the connector log for their production GSA. The
connector has been running for quite few days and it got stuck yesterday.
Here is snippet from the connector logs:

FINEST: Adding work:
TraversalWorkQueueItem[this=14207560;connectorName=sp_td_all;traverser=com.g
oogle.enterprise.connector.traversal.QueryTraverser@c8570c;numDocsTraversed=
0;isFinished=false;numConsecutiveFailures=0;timeOfFirstFailure=0;traversalTi
meout=2147483647]
Feb 3, 2009 1:29:38 AM com.google.enterprise.connector.common.WorkQueue
removeWork
FINEST: Removing work:
TraversalWorkQueueItem[this=14207560;connectorName=sp_td_all;traverser=com.g
oogle.enterprise.connector.traversal.QueryTraverser@c8570c;numDocsTraversed=
0;isFinished=false;numConsecutiveFailures=0;timeOfFirstFailure=0;traversalTi
meout=2147483647]
Feb 3, 2009 1:29:38 AM
com.google.enterprise.connector.scheduler.HostLoadManager determineBatchHint
FINEST: maxDocsPerPeriod=100
Feb 3, 2009 1:29:38 AM
com.google.enterprise.connector.scheduler.HostLoadManager determineBatchHint
FINEST: docsTraversed=100
Feb 3, 2009 1:29:38 AM
com.google.enterprise.connector.scheduler.HostLoadManager determineBatchHint
FINEST: remainingDocsToTraverse=0
Feb 3, 2009 1:29:38 AM
com.google.enterprise.connector.scheduler.TraversalScheduler$TraversalWorkQu
eueItem waitTillFinishedOrTimeout
FINEST: Beginning wait (timeout=2147483647)...
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.servlet.ConnectorManagerServlet doPost
INFO: HEADER user-agent: Jakarta Commons-HttpClient/3.0.1
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.servlet.ConnectorManagerServlet doPost
INFO: HEADER host: susday7763.td.teradata.com:6002
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.servlet.ConnectorManagerServlet doPost
INFO: HEADER content-length: 114
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.servlet.ConnectorManagerServlet doPost
INFO: HEADER content-type: text/xml
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.sharepoint.SharepointConnector login
INFO: login()
Feb 3, 2009 10:16:20 AM
com.google.enterprise.connector.sharepoint.SharepointSession <init>
INFO: SharepointSession(SharepointConnector
inConnector,SharepointClientContext inSharepointClientContext)

Why is there a gap between 1.29 am and 10.16? Between that period there is
no activity and after that CM creates a new SharePointConnector session but
does not call doTraversal to start the traversal.

Can you have a look at the logs and help us out here? Is there something on
the Connector Manager or do we have to look more on the connector front?

==============================
I've been looking into it.  It is not a hanging work thread.
The connector is cranking along.  Each call to runBatch() returns
only 10 documents, but it does so quite quickly, so it traverses
100 documents in less than a minute.  It then waits out the rest
of the minute spinning - checking once per second and adding 20
lines of log entries, the only meaningful one being
  "remainingDocsToTraverse=0"

The hostLoadManager is waiting for the minute to expire before
returning a non-zero batchHint().  For some reason, the check
for non-zero batchHint is done inside the workItem thread rather
than the scheduler thread.  This forces the workItem to get
scheduled once per second, only to say "I got nothin".

[In the work I'm doing for Issue 110, I moved that calculation to
the HostLoadManager.shouldDelay() method, so the workItem won't
be made runnable if it has exceed its quota for the time period.
That also cut down the once-per-second spamming of the log file.]

But the once-per-second spamming actually points to the failure.
While waiting for the minute to expire, each second we see:

key: [S] Scheduler Thread    [W] Worker Thread

[S]: Trying to add traversal work to workQueue
[S]: Adding work: TraversalWorkQueueItem
[S]: Beginning wait (timeout=MAX_INT)...
 - Scheduler Thread waits for signal from Worker
[W]: Removing work: TraversalWorkQueueItem
[W]: maxDocsPerPeriod=100, docsTraversed=100, remainingDocsToTraverse=0
 - Worker thread realizes that it can do no work
 - Worker thread finishes workItem and signals Scheduler
[S]: ...ending wait  (after receiving signal)

However in the case of the big gap, we see this:

[S]: Trying to add traversal work to workQueue
[S]: Adding work: TraversalWorkQueueItem
[W]: Removing work: TraversalWorkQueueItem
[W]: maxDocsPerPeriod=100, docsTraversed=100, remainingDocsToTraverse=0
 - Worker thread realizes that it can do no work
 - Worker thread finishes workItem and signals Scheduler
[S]: Beginning wait (timeout=MAX_INT)...
 - Scheduler Thread waits for signal from Worker
 - 8.75 hour delay
 - Servlet thread logging HTTP traffic

Basically, once made runnable, the Worker thread ran to
completion before the Scheduler thread started to wait().
The Worker sent the notifyAll() before the Scheduler
called wait(), so the notification appears to have been
missed.  If this is actually how Java Threads work,
I would consider it a bug, as multithreaded programming
on multi-core CPUs would become extremely fragile.

This could have happened even if the Sharepoint Connector
did not supply a huge timeout, as a timeout of 0
(wait forever) would be used instead to the same effect
(plus or minus a couple hundred years).

Original comment by jeffreyl...@gmail.com on 5 Feb 2009 at 6:43

GoogleCodeExporter commented 8 years ago
We should look into replacing the guts of TraversalScheduler and WorkQueue* 
with Java
5 java.util.concurrent technology.

Original comment by Brett.Mi...@gmail.com on 3 Mar 2009 at 12:48

GoogleCodeExporter commented 8 years ago
The prime motivator was to allow traversal WorkQueueItems
to run concurrently.  They were previously serialized by a
global lock in TraversalScheduler.

There are really 4 design modifications that are at the core
of this change set:

 1) TraversalScheduler no longer blocks waiting for a WorkQueueItem
    to return the number of documents traversed so it can update
    the the HostLoadManager.  This was the block that forced
    serialization of the scheduled items.  The WorkQueueItems are
    now responsible for updating numDocsTraversed to the
    HostLoadManager.

 2) Improved cancellability of WorkQueueItems and Traversers.
    Added WorkQueueItem.cancelWork() and Traverser.cancelBatch()
    methods.  This allows better propagation of cancellation when
    configuration changes, work item timeouts, and shutdown.

 3) Improved configurability of timeouts for system administrators.
    This involved modifying the WorkQueue constructors slightly,
    and improved documentation in the WorkQueue bean definition
    in applicationContext.xml.

 4) Deprecated the HasTimeOut SPI Interface.  It didn't actually
    work and the only Connector that specified a timeout, gave
    an absurd value of MAX_INT (196 years).  I also removed
    WorkQueueItem timeouts and Traverser timeouts.  Too many
    places to specify timeouts - and none of it really worked.
    The remaining timeout configuration is via 3) above.

TODO: Fix up WorkQueue bean definitions in the security manager???
      Merge in Java 5 fix before checking this in.

Change Log:
----------
M  projects/connector-manager/source/java/com/google/enterprise/connector/schedu
ler/TraversalScheduler.java
   - Don't wait for WorkItems to return before scheduling the next
     available work.  This means that all available work items should
     be made runnable for concurrent execution.
   - Don't attempt to run WorkItems that are already running.
   - TraversalScheduler no longer makes runnable work item that
     have exceeded their load as specified by HostLoadManager.
   - TraversalScheduler no longer update the HostLoadManager with
     number of docs traversed.
   - TraversalWorkQueueItem implements cancelWork() which calls
     Traverser.cancelBatch(), allowing the traverser to stop cleanly.
   - TraversalWorkQueueItems now update the HostLoadManager with
     number of docs traversed.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/Traverser.java
   - Added cancelBatch() method to Interface
   - Removed getTimeoutMillis() method from Interface

M  projects/connector-manager/source/java/com/google/enterprise/connector/traver
sal/QueryTraverser.java
   - Modified to support new Traverser Interface
   - cancelBatch() forces the traversal to halt and the batch to be dropped.
     (No checkpoint is taken, HostLoadManager is not updated.)

M  projects/connector-
manager/source/java/com/google/enterprise/connector/common/WorkQueueThread.java
   - Added public isKilled() method to determine if the WorkQueueThread
     is exiting via interruptAndKill().

M  projects/connector-
manager/source/java/com/google/enterprise/connector/common/WorkQueueItem.java
   - Added abstract cancelWork() method
   - Removed getTimeout() method

M  projects/connector-
manager/source/java/com/google/enterprise/connector/common/WorkQueue.java
   - Changed constructor timeout parameters from milliseconds to seconds and
     permuted their order so they make more sense to adminstrators trying to
     configure timeouts.
   - WorkItemTimeout of 0 means no timeout.
   - Cleaned up the logic that determines when the next likely timeout will
occur.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/spi/HasTimeout.java
   - Deprecated

M  projects/connector-manager/source/java/com/google/enterprise/connector/schedu
ler/HostLoadManager.java
   - Enhanced shouldDelay() to consider numDocsTraversedThisPeriod.
     If the quota has been exceeded, shouldDelay() returns true.
     This avoids WorkItems that get scheduled to run, only exit
     immediately when they can do no work (batchHint == 0).

M  projects/connector-manager/etc/applicationContext.xml
   - Fix WorkQueue bean definition to support changed WorkQueue constructors.
   - Enhanced documentation on the WorkQueue bean definition parameters.

M  projects/connector-manager/source/javatests/com/google/enterprise/connector/i
nstantiator/MockInstantiator.java
   - Support new CancellableQueryTraverser.

M  projects/connector-manager/source/javatests/com/google/enterprise/connector/c
ommon/WorkQueueThreadTest.java
   - Test cancelWork() functionality.

M  projects/connector-manager/source/javatests/com/google/enterprise/connector/c
ommon/WorkQueueTest.java
   - Support changed WorkQueue constructor interface.
   - Test cancelWork() functionality.

M  projects/connector-manager/source/javatests/com/google/enterprise/connector/t
raversal/NoopQueryTraverser.java
M  projects/connector-manager/source/javatests/com/google/enterprise/connector/t
raversal/LongRunningQueryTraverser.java
M  projects/connector-manager/source/javatests/com/google/enterprise/connector/t
raversal/NeverEndingQueryTraverser.java
M  projects/connector-manager/source/javatests/com/google/enterprise/connector/t
raversal/InterruptibleQueryTraverser.java
   - Modified to support new Traverser Interface

A  projects/connector-manager/source/javatests/com/google/enterprise/connector/t
raversal/CancellableQueryTraverser.java
   - New Traverser, like NeverEnding traverser, but exits if cancelled.

M  projects/connector-manager/testdata/mocktestdata/applicationContext.xml
   - Fix WorkQueue bean definition to support changed WorkQueue constructors.

===== Unrelated changes ======

M  projects/connector-manager/source/java/com/google/enterprise/connector/instan
tiator/InstanceInfo.java
   - Improved logging during persistent store migration.

Original comment by Brett.Mi...@gmail.com on 17 Mar 2009 at 10:20

GoogleCodeExporter commented 8 years ago
I was concerned that the following scenario was still possible:
  - Long running Traversal gets cancelled.
  - Its WorkQueueThread gets replaced, and a new thread is assigned
      to the WorkQueueItem.
  - The WorkQueueItem gets rescheduled for work.
  - TraversalWorkQueueItem.doWork() clears the cancelFlag, and starts working.
  - Long running Traversal comes back alive, does not noticed cancelled state.

The problem, is that although replaceHangingThread() disassociates the
hanging thread from the WorkQueueItem, it doesn't disassociate the WorkQueueItem
from the hanging thread.  So the thread does not know that it no longer belongs
to the WorkQueueItem.  [Actually, it does, but QueryTraverser.runBatch() does
not,
because the Traverser doesn't know anything about WorkQueueItems,
WorkQueueThreads,
etc.]

I needed to preserved the cancelled state of a QueryTraverser for long running
traversals.  A cancelled QueryTraverser could get reused previously because
ConnectorInterfaces cached a traverser.  I modified ConnectorInterfaces to
discard a cancelled cached traverser and construct a new one.  I also null
a cancelled QueryTraverser's TraversalStateStore, so that it won't be able
to save a checkpoint, if it does come back alive.

Upon request, I reverted the earlier changes to the WorkQueue constructors,
preserving the orginal (non-intuitive) parameter order and time units.

Lastly, we decided to change the default WorkQueue configuration.
We reduced the number of threads in the thread pool from 20 to 10.
We have seen a maximum of 8 or 9 concurrent connectors in the field and too
many threads in the thread pool is a significant waste of resources.
We also lengthened the WorkQueueItem timeout and KillItem timeout significantly
-
going from 5 + 1 minutes to 20 + 10 minutes.  In unusual, but not rare,
customer conditions, certain operations my take 10-20 minutes.  In the past,
we would repeatedly cancel these long running tasks.  Now that long running
traversals will no longer prevent other work from getting done, it seems OK
to give these guys an opportunity to run to completion.

Change Log:
-----------
M  projects/connector-manager/source/java/com/google/enterprise/connector/instan
tiator/ConnectorInterfaces.java
   - If cached QueryTraverser is cancelled, dump it and construct a new one.

M  projects/connector-manager/source/java/com/google/enterprise/connector/traver
sal/QueryTraverser.java
   - Cancelled QueryTraversers are not reusable.
   - Cancelling a QueryTraverser, drops its TraversalStateStore.

M  projects/connector-manager/source/java/com/google/enterprise/connector/schedu
ler/TraversalScheduler.java
   - If TraversalWorkQueueItem is cancelled, don't update hostLoadManager.

M  projects/connector-manager/etc/applicationContext.xml
M  projects/connector-
manager/source/java/com/google/enterprise/connector/common/WorkQueue.java
   - Revert WorkQueue constructor interface.
   - Change default WorkQueue configuration - fewer threads, longer timeouts.

M  projects/connector-manager/source/javatests/com/google/enterprise/connector/c
ommon/WorkQueueTest.java
M  projects/connector-manager/source/javatests/com/google/enterprise/connector/p
usher/DocPusherTest.java
M  projects/connector-manager/testdata/mocktestdata/applicationContext.xml
   - Revert WorkQueue constructor interface.

Original comment by Brett.Mi...@gmail.com on 17 Mar 2009 at 10:20

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 28 Apr 2009 at 8:42