AnantLabs / google-enterprise-connector-sharepoint

Automatically exported from code.google.com/p/google-enterprise-connector-sharepoint
0 stars 0 forks source link

Return batchHint or little more docs from start/resumeTraversal and not 2* batchHint #116

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
SharePoint connector always returns (2* batchHint) docs from its traversal. 
That should be changed to return batchHint or a little more and not (2 * 
batchHint)

For more details check:

http://code.google.com/p/google-enterprise-connector-manager/source/detail?
r=2255

Original issue reported on code.google.com by rakeshs101981@gmail.com on 4 Nov 2009 at 1:29

GoogleCodeExporter commented 9 years ago
Comments from Rakesh on email-thread and feedvback from John and Marty:
"I was reviewing the changes related to the batch size.

I am not inclined to remove the (2*batchHint) cap used by the SharePoint 
connector. 
Reasons:

•         The SharePoint connector tries to fetch bacthHint # of docs from 
each 
SharePoint List till the total no of docs >= (2*batchHint). This is important 
as the 
connector follows a greedy approach of discovering docs from as many Lists in 
SharePoint as possible in a single batch traversal rather than being sequential 
(one 
List at a time). If the cap of (2*batchHint) is removed, then, the connector 
needs to 
be changed to fetch something like (0.5 * batchHint) docs from a given 
SharePoint 
List so that the connector still looks for docs from multiple Lists rather than 
from 
a single list. This is important during the initial crawl as the chances that 
the 
batchHint # of docs being discovered from every list is high.
•         Even though the connector returns (2*batchHint) no of docs, the 
BatchSize.maximum ensures that host load is being honored. In fact these 
changes make 
sense with the SharePoint connector’s implementation since the chances of all 
(2*batchHint) docs making it through to the GSA in the same batch traversal are 
high 
as compared to earlier CM versions.

As Marty has pointed out that “a good citizen would follow the 'hint' and not 
the 
'max' “, looking at the above benefits I am inclined to retain the 
(2*batchHint) cap 
in the SharePoint connector.

"

Comments from John L:
"
The BatchSize maximum does not ensure the host load is being honored. If the 
SharePoint connector really returns 2 * batchHint documents in a DocumentList 
per 
minute, then the traversal will run at twice the host load. 

(There are details there. It will run at twice the host load up to a host load 
of 
500, and then linearly decline to match the host load at 1000. If it takes more 
than 
a minute then the extra time decreases the host load overrun.)

We seem to all agree that a good citizen will aim for the batch hint, not the 
batch 
maximum, but the SharePoint connector is aimed at the batch maximum.

Perhaps a different change would make sense, so that the DocumentList returned 
to the 
CM does not consistently have 2 * batchHint documents in it, even if internally 
the 
SharePoint connector tries to get 2 * batchHint candidates. Or we can leave it 
as-is. 
That's not my first choice, but it's not the end of the world.

I did forget to say, that if we implement Eric's suggestion for a delay when 
the 
connector exceeds the host load, then the SharePoint connector would not 
consistently 
exceed the host load over time, even if it did for every batch.

John L
"

Comments from Marty:
"Sounds like SP is not going to change it's behavior so when there's a lot of 
new 
content it will be exceeding the host load.  Not great but on the other hand 
it's 
only an issue during the initial crawl and after massive updates - both times 
load on 
the SP servers should be expected."

Original comment by rakeshs101981@gmail.com on 4 Nov 2009 at 1:51

GoogleCodeExporter commented 9 years ago
Final approach:

The current think-tank is to always query batchHint # of docs from each list 
and 
return whenever the # docs in current batch traversal >= batchHint.

Even with this there are chances that the connector can send (2*batchHint) # of 
docs 
but that is an extreme case. (Like for batchHint=100, first 99 lists have 1 doc 
each 
and the 100th list has 100 docs. In such a case the total docs returned will be 
199)

This will be rare and not occur frequently.

Original comment by rakeshs101981@gmail.com on 4 Nov 2009 at 1:52

GoogleCodeExporter commented 9 years ago
Fix details:

http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
420

Original comment by rakeshs101981@gmail.com on 4 Nov 2009 at 1:53

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 6 Nov 2009 at 11:34

GoogleCodeExporter commented 9 years ago
Verified in 2.4 Release

Original comment by ashwinip...@gmail.com on 14 Dec 2009 at 7:00