Feed performance improvements

GoogleCodeExporter commented 8 years ago

Now that the GSA can handle faster feeds, identify the bottlenecks in the
connector manager. See issue 106 for one idea. We could also compress the
feed, or implement the google:contenturl property (less likely).

Original issue reported on code.google.com by jl1615@gmail.com on 9 Dec 2008 at 1:50

GoogleCodeExporter commented 8 years ago

For larger files, the complete buffering of the feed content in the 
HttpURLConnection output stream is a likely 
source of delays. With Java 5, we might be able to use chunked transfer, if the 
feedergate on the appliance 
supports it.

Original comment by jl1615@gmail.com on 12 Dec 2008 at 7:17

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 28 Jan 2009 at 10:54

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Need further investigation on what the bottlenecks are.

Original comment by mgron...@gmail.com on 28 Jan 2009 at 10:54

GoogleCodeExporter commented 8 years ago

I replaced the Base64Encoder class with the iHarder Base64 implementation from 
the security-manager.  I 
then added a new encoder method to the class that accepts a destination buffer. 
 This allows 
Base64FilterInputStream to write Base64 encoded data directly into the reader's 
buffer without intermediate 
byte[], char[], String, InputStreams, or Writers as was done in the previous 
implementation.  This improved 
performance about 4.5x over the previous Base64FilterInputStream/Base64Encoder 
implementation.  To 
even better leverage that performance increase, I increased the size of the I/O 
buffer in GsaFeedConnection 
to from 2KB to 32KB, allowing the Base64.encode() method more time in its tight 
encoding loop.

This was checked into the trunk as r1703 on 15 April 2009

Original comment by Brett.Mi...@gmail.com on 16 Apr 2009 at 9:42

GoogleCodeExporter commented 8 years ago

We might also consider compressing the feed.  This can either be done prior to 
Base64 encoding, or perhaps the 
entire feed http connection could be compressed.  A discussion between Brett, 
Marty, and John follows:

On Wed, Apr 8, 2009 at 7:25 AM, Brett Johnson wrote:
Marty,
[....]

Second,  can the GSA accept gzipped content?  Newer versions of
the iHarder Base64 class support gzipping the data before Base64
encoding it.   Upon examination, their implementation wouldn't be
suitable for our Base64FilterInputStream, but Base64FilterInputStream
could easily wrap its own inputstream with a java.util.zip.GZIPInputStream
to get the same effect.

Brett

----

On Apr 8, 2009, at 6:23 PM, Marty Gronberg wrote:
[...]

On the GSA, the feed parser currently can't handle gzip content and the GSA 
will throw away zipped content. I 
just talked to the GSA lead and he said they could support gzip without much 
problem.  We would have to add 
another attribute to the <content> element in the XML to indicate the content 
is also compressed and how 
(gzip/zip/rar/...).  I believe they already have some gzip and zip support 
on-box.  We would also have to agree 
on the order - sender: gzip then base64 encode // receiver: base64 decode then 
gunzip.

I'll put together a Design Proposal and we can iterate over it with the GSA 
team.  Related to this, what do you 
think about sending multiple documents in a single feed?  My main concern is we 
would need more than just a 
simple "success" response.  We should get some type of status for all the 
documents in the feed.  Thoughts?

    Marty

----

On Apr 9, 2009, at 8:45 AM, Brett Johnson wrote:

Marty,

We almost certainly want to compress, then Base64 encode the
compressed data.  Three reasons:
  1)  I have a gut (unproven) feeling that gzip might do a better 
       job compressing the raw data vs the encoded data.
  2) That is the order that the iHarder Base64 encoder uses.
  3) The resulting Base64 encoding has no XML special characters
       that would need to be escaped in the feed.

I prefer gzip over zip or rar for this purpose because gzip does 
not do archiving, so uncompressing would not result in the 
possibility of producing multiple files.

As far as sending multiple documents in the feed goes, it is my
understanding that the status return simply acknowledges the
receipt of the feed XML file.  The feed file is then queued for 
future processing.  Any failures processing the feed data are
not communicated back to the CM due to the asynchronous
feed model.

Brett

----

On April 9, 2009, at 12:13 PM,  John Wrote:

I'm sure there are compression experts running around. 
Here's a quick test using the Perl MIME::Decoder::Base64 module:

Word file: 11028 KB
Base64-encoded, then gzipped: 4097 KB in 1.7 seconds
gzipped, then Base64-encoded: 4126 KB in 0.8 seconds

google-connectors log file: 49874 KB bytes
Base64-encoded, then gzipped: 15870 KB in 10.9 seconds
gzipped, then Base64-encoded: 7538 KB in 3.5 seconds

I used a different Base64 encoder, so the timing is not directly relevant. It 
makes sense, though, because the 
gzip-first approach touches less data, unless the compression increases the 
size by more than a nominal 
amount.

If it's Base64-encoded then gzipped, we have to gzip the entire XML envelop of 
the feed, where if it's gzipped 
then Base64-encoded, that's just a variation on the content element as you 
mentioned.

I don't know what technology the feedergate uses, but we should check whether 
it supports Content-Encoding: 
gzip using Apache/mod_deflate or similar. That may just be even easier than 
supporting gzip in the parser, 
although it may not perform as well.

I agree with Brett on the feed status. All we get back is that the server 
accepted the feed, which isn't different for 
multiple documents compared to a single document. I'm OK with that.

[...]

John L

Original comment by Brett.Mi...@gmail.com on 16 Apr 2009 at 9:53

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 6 May 2009 at 9:52

Added labels: Milestone-Goal_2.4.0

GoogleCodeExporter commented 8 years ago

From Mohit:

Sounds good. zlib and base64binarycompressed it is.

I will expose the dtd via port 19900 as well (0:19900/dtd). The mere presence 
of it 
will be sufficient to know if compression is accepted or you can check for the 
attribute.

Original comment by mgron...@gmail.com on 22 May 2009 at 10:44

GoogleCodeExporter commented 8 years ago

In the latest 6.2 beta, the dtd is returned via

http://gsa:19900/getdtd

and the encodings supported are:

base64binary | base64compressed

Original comment by Brett.Mi...@gmail.com on 9 Sep 2009 at 9:24

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 4:03

Added labels: Performance

GoogleCodeExporter commented 8 years ago

Full support for compressed content feeds was added in revision r2246

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:07

GoogleCodeExporter commented 8 years ago

Multiple documents per feed (Issue 106) was added in revision r2257
Sending feeds to the GSA in a separate thread (Issue 184) was added in revision 
r2257

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:09

GoogleCodeExporter commented 8 years ago

Allow batchHint to be a hint - the Connector may return more.

Most of the Connectors, in startTraversal() or resumeTraversal()
actually collect more than batchHint candidate documents.  This
can be a result of dealing with inserts as well as deletes, or
simply for the matter of efficiency.  The connectors then build
a DocumentList with at most batchHint entries - discarding any
extraneous results.  This can be considered wasteful.

This change makes the batchHint a true hint.  If the connector
returns more documents than the hint, they may be processed.
This change also treats the host load as a target load, rather
than a hard ceiling.  The batchHints are calculated to aim for
the load, however the number of documents processed may exceed
the load.  To avoid ill-behaved connectors returning too many
results, a maximum batch sized is established.  It could exceed
the load by, at most, batchHint.

This change adds a new class BatchSize, which includes the hint,
but also a maximum size.
HostLoadManger.determineBatchHint() has been replaced with
HostLoadManger.determineBatchSize(), which returns a BatchSize
object specifying the preferred and maximum batch size.

This BatchSize object is passed down, eventually reaching
QueryTraverser.runBatch(), which passed the hint on to the
Connector, and uses the maximum value to stop processing
documents from the list.

Since this change alterred the Traverser.runBatch() Interface,
I also took care of an outstanding TODO - have runBatch()
return a BatchResult object rather than an integer number
of docs processed.  This cleaned up batch result processing.

-----

This new snapshot address most of your comments from the
initial snapshot. Specifically:
- The maximum number of documents per batch is no longer
  capped by the hostload.  The number of docs processed
  could exceed the load by, at most, batchHint.  This would
  occur if the batchHint was set to the remainingDocsToTraverse
  in the period, and the connector returned twice that.
  This addresses John's concerns while still handling runaway
  connectors that return millions of records.  This change
  makes both the batch hint and the host load "target values",
  rather than absolute limits, allowing the connector to
  miss the target by a standard deviation.

- I tweeked HostLoadManager.shouldDelay() to return true
  (do delay) if the remaining number of docs to traverse
  is less than 10% of the load.  This addresses John's
  problem of traversal requests with single-digit batchHints.

- I added several unit tests, to test these capabilities.

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:14

GoogleCodeExporter commented 8 years ago

INTRODUCTION:
=============

I did some informal performance testing of the Connector
Manager with the Documentum and Livelink Connectors,
comparing the performance of the 2.0.0 release with current
trunk or change branch versions (as of Friday Sept 18).

The tests were run against recent repositories, targeting
ones that have faster back-end servers.  These repositories
unfortunately contain small numbers of documents (ones or
tens of thousands).  The document sizes, too, tended to
be small (especially in the Documentum repository).
We should really load up 10-100 thousand real-world docs
into our repositories for more realistic tests.

The documents were fed to a 6.2 beta GSA that supports
compressed feeds.  No attempt was made to time feed
processing once the feed was handed off to the GSA.
These tests focused on the CM and the Connectors.

The tests were run on my development machine rather
than a dedicated test machine.  I made very little
effort to have minimal other tasks running or reboot
between runs, although I did shut down memory hogging
web browsers and noted that nothing else was using
significant computing or memory resources.

Logging levels were set to INFO.  Feed logs and
teedFeedFile were disabled.  John found that logging
incurs a 10-15% overhead.  We can address that issue
separately.

2.0.0 RELEASE RESULTS:
======================

Livelink Repository:
-------------------
Batch Hint: 100
Host Load: 1000
Documents Fed: 1148
Feed Count: 1148
Time: 3m 11s (191s)
Docs per Second: 6.0

Documentum Repository:
---------------------
Batch Hint: 100
Host Load: 1000
Documents Fed: 13,860
Feed Count: 13,860
Time: 74m 7s (4447s)
Docs per Second: 3.1

2.3.x DEVELOPMENT BRANCH RESULTS:
==================================

Connector Manager changes include:
- Multiple Documents per Feed.
- Increased batch size from 100 to 500.
- Compressed document content.
- Asynchronous feed submission*.
- Increased I/O buffer sizes.
- Reduction in I/O buffers and data copying.

Documentum Connector changes include:
- Cached type information.

Livelink Connector changes include:
- none**.

* Because of small document sizes,
 all of the tests managed to fit each
 traversal batch into a single feed file,
 so asynchronous feed submission was not
 leveraged.

** Performance profiling has shown that
  The Livelink Connector should focus on
  improving the retrieval of Category
  Attributes.

Livelink Repository:
-------------------
Batch Hint: 500
Host Load: 1000
Documents Fed: 1148
Feed Count: 3
Feed Size: 10,777,048 bytes
Time: 1m 59s (119s)
Docs per Second: 9.6

Documentum Repository:
---------------------
Batch Hint: 500
Host Load: 1000
Documents Fed: 13,860
Feed Count: 28
Feed Size: 34,776,925 bytes
Time: 22m 44s (1364s)
Docs per Second: 10.2

CONCLUSIONS:
============

The modifications resulted in a 50% improvement
in wall-clock traversal time for the Livelink
Connector, and a 320% improvement for the Documentum
Connector.  We are nearing the 12 documents per second
rate we need to feed 1M docs per day in a single
connector instance.  Feeding from 2 or 3 connector
instances could easily meet that rate.

We need more real-world document data.  Our sample
sizes are too small.

We might consider lowering the default target feed
size to take better advantage of asynchronous feed
submission.  But perhaps real-world documents will
take care of this.

We should consider switching to log4j logging (at
least for the external connectors), which writes
log messages in a separate thread.

If the customer can segment their repository and
use multiple connector instances for each part,
they should.  Making it easier to create Collections
from Connector Content would help.

--
Brett M. Johnson

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:27

GoogleCodeExporter commented 8 years ago

Closing the lid on this issue with the compression and batch hint changes.

Original comment by jl1615@gmail.com on 21 Oct 2009 at 7:13

Changed state: CodeReview

GoogleCodeExporter commented 8 years ago

Original comment by Brett.Mi...@gmail.com on 21 Oct 2009 at 10:19

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 27 Oct 2009 at 11:03

Added labels: Doc

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 27 Oct 2009 at 11:05

Added labels: Milestone-Release_2.4.0
Removed labels: Milestone-Goal_2.4.0

GoogleCodeExporter commented 8 years ago

The doc impact of this issue is the loosening of the batch hint limits, so that 
the connector manager might read 
more than the batch hint number of documents (currently two times the batch 
hint) before stopping the 
processing of a DocumentList.

Original comment by jl1615@gmail.com on 27 Oct 2009 at 11:08

PILLUTLAAVINASH / google-enterprise-connector-manager

Feed performance improvements #111