Scheduler runs traversals faster than configured traversal rate

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Configure a new connector instance.
2. Configure the traversal rate so that it is much slower than the maximum 
traversal rate.
3. Log the traversal in some way so that the actual traversal rate can be 
measured.
4. Save the connector configuration to start the traversal.

What is the expected output? What do you see instead?

I have a connector that can traverse maybe 400-500 documents per minute on my 
laptop. I 
created a new connector instance and left the traversal rate at the default 
value of 100 
documents per minute. I stopped the traversal about 12 minutes later, and 3962 
documents had 
been traversed. That's about 330 documents per minute, much higher than the 
configured limit.

What version of the product are you using? On what operating system?

Connector Manager 1.0.3.

Please provide any additional information below.

The crux of the problem is that the TraversalWorkQueueItem class is not 
thread-safe, but the 
instances are used by multiple threads. There are at least two issues that I 
can see. First, the 
doWork method is not synchronized. When determineBatchHint is allowing 
traversals, multiple 
threads grab the batchHint and then wait on the synchronization lock when they 
call 
QueryTraverser.runBatch. By the time determineBatchHint is returning 0 to wait 
for the minute to 
elapse, it's too late. The multiple threads will all execute resumeTraversal 
anyway.

Second, it's not as simple as just synchronizing doWork. The call to 
getNumDocsTraversed is 
happening asynchronously from the scheduler thread. It may count traversals 
multiple times, if 
multiple calls to getNumDocsTraversed happen between runBatch calls. It may 
miss traversals, if 
multiple calls to runBatch happen between getNumDocsTraversed calls.

The attached log shows both of these problems, at least. It includes several 
method calls in 
TraversalScheduler, HostLoadManager, and QueryTraverser, and two method calls 
in the 
connector, resumeTraversal, which records call to that method, and listNodes, 
which records the 
number of returned documents. Each entry includes the thread making the call.

Note that the multiple threads problem only happens after the first call to to 
doWork. For the first 
call to doWork, which will call startTraversal rather than resumeTraversal, the 
traversal rate is 
obeyed, modulo problem number 2. That is, it asks for 100 documents, gets 61, 
counts them 
twice and stops, thinking that it has exceeded the 100 documents per minute 
rate. After the first 
minute has elapsed, you will see the first call to resumeTraversal. The 
resumeTraversal calls log 
the batchHint and the time elapsed since the call to startTraversal in 
milliseconds.

Starting just before this first call to resumeTraversal, note that 
WorkQueueThreads 4, 11, 7, 6, 
and 3 all call determineBatchHint and get 100 back. Those threads are then 
waiting on the lock 
for runBatch, but nothing else is going to stop them from running, which is why 
the traversal 
runs too fast. You can see that the calls to resumeTraversal for 100 documents 
are happening as 
often as every 15 seconds.

I stopped the traversal after about 12 minutes by configuring an empty schedule 
on the 
connector configuration page in the admin console. The only lines logged after 
that were the 
final two calls to resumeTraversal.

Original issue reported on code.google.com by jl1615@gmail.com on 22 Dec 2007 at 12:54

Attachments:

traversal.log

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 18 Apr 2008 at 10:32

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

r781 | mgronber | 2008-05-02 18:15:41 -0700 (Fri, 02 May 2008) | 63 lines

Fix for Bug 689553 (Connector Manager should pause periodically and retry
interval should be configurable via the Admin Console), Issue 66
(TraversalWorkQueueItem.doWork calls connectorFinishedTraversal too often), and
Issue 67 (Scheduler runs traversals faster than configured traversal rate).

The default retry delay between times when the runBatch() returns no documents
and the next time the schedule is tested has been changed from 0 to 5 minutes.

The fix for Issue 66 involved fixing the conditional that TWQI.doWork() method
was using to determine if the traversal was finished.

For Issue 67 it turned out that there was a problem with the
TraversalScheduler's use of the TraversalWorkQueueItem.  I was not involved with
the design, however, it appears it is designed so only one WorkQueueThread was
suppose to operate on a given TWQI at a time.  In the run() method, when the
TWQI was added to the WorkQueue after the first run, it's isFinished field was
incorrectly left as true.  This would cause the subsequent call to
TWQI.getNumDocsTraversed() to return right away with 0 rather than waiting until
the doWork() was finished.  This would be recorded as a failure() and the run()
loop would continue.  In the mean time, the TWQI added to the WorkQueue was
being processed by a WorkQueueThread.  After the failure in the run() method it
would restart the loop and incorrectly add the TWQI back on the WorkQueue
causing it to be picked up by another thread.

This might also be related to the issue where crawl rates seem to decay after
some time.  The subsequent (false) failures will cause the TraversalScheduler to
backoff, thinking the Connector is having problems.

Finally, the TraversalScheduler was not able to differentiate between the case
where the traversal batch was complete and the case where it was just throttled
back because of host load.  I have fixed this so now it stays aggressive when
the host load was exceeded and only observes the retry delay when it seems the
batch has been completed.

ChangeLog:

*
projects/connector-manager/source/java/com/google/enterprise/connector/servlet/S
etSchedule.java:
  - Changed default retryDelayMillis from 0 to (5 * 60 * 1000) 5 minutes.

*
projects/connector-manager/source/java/com/google/enterprise/connector/common/Wo
rkQueue.java:
  - Added some FINEST logging messages to monitor work being added and removed
    from the WorkQueue and create toString() method on the TWQI to provide
    detail to the log.

*
projects/connector-manager/source/java/com/google/enterprise/connector/scheduler
/TraversalScheduler.java:
  - Removed unused local to remove warning.
  - Fix for Issue 67 where the work item was being added with a state already
    finished causing multiple threads.
  - Related fix where incorrect condition used to determine failure case.
  - Added batchHint field to TraversalWorkQueueItem to support correct
    conditional check for both finished and failure.
  - Changed logging in doWork() from FINER to FINEST since used for debugging,
    added batchHint to log, and moved log statements around actual call to 
    runBatch().
  - Fix for Issue 66 where connectorFinishedTraversal() was being called
    incorrectly for those cases where the Connector was returning less documents
    than the recommended batchHint. 

*
projects/connector-manager/source/java/com/google/enterprise/connector/scheduler
/HostLoadManager.java:
  - Added Logger.
  - Added FINEST level logging to determineBatchHint() to aid in debugging.

Original comment by mgron...@gmail.com on 3 May 2008 at 1:21

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 19 Jun 2008 at 6:35

Added labels: Milestone-Release_1.1.0

GoogleCodeExporter commented 8 years ago

Google bug #737291 is a duplicate of this issue.

Original comment by jl1615@gmail.com on 3 Feb 2009 at 12:41

PILLUTLAAVINASH / google-enterprise-connector-manager

Scheduler runs traversals faster than configured traversal rate #67