PILLUTLAAVINASH / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
0 stars 0 forks source link

Use batch hint to control the size of feed, instead of 1 document per feed #106

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
-   Existing implementation of the connector manager processes 
individual documents from the crawl queue and sends them as independent 
feed to the GSA. There are few concerns as to this behavior:
o   It generates a lot of network traffic by having to send each 
document with XML envelope as a feed
o   It’s a lot of overhead on the connector manager to package the 
content of each document into XML and send it as a feed. Also the CM needs 
to check delivery failures & fail-over for each document.
o   It would be an overhead from the GSA perspective as well to read, 
parse one XML per document. This would add overhead to GSA processing if 
there are lots of documents crawled by the connectors.

What is the expected output? What do you see instead?
-   The connector manager could consolidate a bunch of feeds, based on 
the batch hint, memory available & the feed size limits.
-   The connector manager processing logic could accommodate this 
feature and purge the processed documents only after the feed was 
successful. Else could reduce the batch size in every retry.

Original issue reported on code.google.com by mwarti...@gmail.com on 4 Aug 2008 at 8:43

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 21 Oct 2008 at 7:06

GoogleCodeExporter commented 8 years ago
In order to this, we need to replace the code in GsaFeedConnection that uses
HttpURLConnection. The URL connection output stream is buffered in the java.net
classes, so with multiple documents we run a strong risk of an OutOfMemoryError.

Original comment by jl1615@gmail.com on 8 Dec 2008 at 11:43

GoogleCodeExporter commented 8 years ago
Google bug #699631 is a duplicate of this issue. This is under the umbrella of 
issue 111, and it is also related to 
issue 24. Finally, a note from Max in #699631: "It would be more efficient, but 
is it really a bottleneck? We 
should measure before fixing."

Original comment by jl1615@gmail.com on 3 Feb 2009 at 9:15

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 6 May 2009 at 10:07

GoogleCodeExporter commented 8 years ago
See also Issue 111.

Original comment by mgron...@gmail.com on 6 May 2009 at 10:07

GoogleCodeExporter commented 8 years ago

Original comment by Brett.Mi...@gmail.com on 29 May 2009 at 4:35

GoogleCodeExporter commented 8 years ago
Even with all these changes, my understanding is that the documents are still 
gonna be 
sent in sequentially. Can we make the connector manager calling the 
"nextDocument()" 
unblocking/multi-thread? We can set a size of concurrent processing 
configurable. Of 
course, this changes how the checkpoint works.

Original comment by jeffreyl...@gmail.com on 17 Aug 2009 at 10:53

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 2 Sep 2009 at 8:02

GoogleCodeExporter commented 8 years ago
Support for multiple documents per feed was added to the 2.0.x branch and is 
available in the 2.0.2 release.
For details see r2223 r2233 r2239 .

Work to integrate this into the trunk is ongoing.

Original comment by Brett.Mi...@gmail.com on 9 Sep 2009 at 7:22

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 3:38

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 4:03

GoogleCodeExporter commented 8 years ago
Fixed 01 October 2009 in revision r2257

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:04

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 24 Oct 2009 at 2:59