Closed GoogleCodeExporter closed 9 years ago
For larger files, the complete buffering of the feed content in the
HttpURLConnection output stream is a likely
source of delays. With Java 5, we might be able to use chunked transfer, if the
feedergate on the appliance
supports it.
Original comment by jl1615@gmail.com
on 12 Dec 2008 at 7:17
Original comment by mgron...@gmail.com
on 28 Jan 2009 at 10:54
Need further investigation on what the bottlenecks are.
Original comment by mgron...@gmail.com
on 28 Jan 2009 at 10:54
I replaced the Base64Encoder class with the iHarder Base64 implementation from
the security-manager. I
then added a new encoder method to the class that accepts a destination buffer.
This allows
Base64FilterInputStream to write Base64 encoded data directly into the reader's
buffer without intermediate
byte[], char[], String, InputStreams, or Writers as was done in the previous
implementation. This improved
performance about 4.5x over the previous Base64FilterInputStream/Base64Encoder
implementation. To
even better leverage that performance increase, I increased the size of the I/O
buffer in GsaFeedConnection
to from 2KB to 32KB, allowing the Base64.encode() method more time in its tight
encoding loop.
This was checked into the trunk as r1703 on 15 April 2009
Original comment by Brett.Mi...@gmail.com
on 16 Apr 2009 at 9:42
We might also consider compressing the feed. This can either be done prior to
Base64 encoding, or perhaps the
entire feed http connection could be compressed. A discussion between Brett,
Marty, and John follows:
On Wed, Apr 8, 2009 at 7:25 AM, Brett Johnson wrote:
Marty,
[....]
Second, can the GSA accept gzipped content? Newer versions of
the iHarder Base64 class support gzipping the data before Base64
encoding it. Upon examination, their implementation wouldn't be
suitable for our Base64FilterInputStream, but Base64FilterInputStream
could easily wrap its own inputstream with a java.util.zip.GZIPInputStream
to get the same effect.
Brett
----
On Apr 8, 2009, at 6:23 PM, Marty Gronberg wrote:
[...]
On the GSA, the feed parser currently can't handle gzip content and the GSA
will throw away zipped content. I
just talked to the GSA lead and he said they could support gzip without much
problem. We would have to add
another attribute to the <content> element in the XML to indicate the content
is also compressed and how
(gzip/zip/rar/...). I believe they already have some gzip and zip support
on-box. We would also have to agree
on the order - sender: gzip then base64 encode // receiver: base64 decode then
gunzip.
I'll put together a Design Proposal and we can iterate over it with the GSA
team. Related to this, what do you
think about sending multiple documents in a single feed? My main concern is we
would need more than just a
simple "success" response. We should get some type of status for all the
documents in the feed. Thoughts?
Marty
----
On Apr 9, 2009, at 8:45 AM, Brett Johnson wrote:
Marty,
We almost certainly want to compress, then Base64 encode the
compressed data. Three reasons:
1) I have a gut (unproven) feeling that gzip might do a better
job compressing the raw data vs the encoded data.
2) That is the order that the iHarder Base64 encoder uses.
3) The resulting Base64 encoding has no XML special characters
that would need to be escaped in the feed.
I prefer gzip over zip or rar for this purpose because gzip does
not do archiving, so uncompressing would not result in the
possibility of producing multiple files.
As far as sending multiple documents in the feed goes, it is my
understanding that the status return simply acknowledges the
receipt of the feed XML file. The feed file is then queued for
future processing. Any failures processing the feed data are
not communicated back to the CM due to the asynchronous
feed model.
Brett
----
On April 9, 2009, at 12:13 PM, John Wrote:
I'm sure there are compression experts running around.
Here's a quick test using the Perl MIME::Decoder::Base64 module:
Word file: 11028 KB
Base64-encoded, then gzipped: 4097 KB in 1.7 seconds
gzipped, then Base64-encoded: 4126 KB in 0.8 seconds
google-connectors log file: 49874 KB bytes
Base64-encoded, then gzipped: 15870 KB in 10.9 seconds
gzipped, then Base64-encoded: 7538 KB in 3.5 seconds
I used a different Base64 encoder, so the timing is not directly relevant. It
makes sense, though, because the
gzip-first approach touches less data, unless the compression increases the
size by more than a nominal
amount.
If it's Base64-encoded then gzipped, we have to gzip the entire XML envelop of
the feed, where if it's gzipped
then Base64-encoded, that's just a variation on the content element as you
mentioned.
I don't know what technology the feedergate uses, but we should check whether
it supports Content-Encoding:
gzip using Apache/mod_deflate or similar. That may just be even easier than
supporting gzip in the parser,
although it may not perform as well.
I agree with Brett on the feed status. All we get back is that the server
accepted the feed, which isn't different for
multiple documents compared to a single document. I'm OK with that.
[...]
John L
Original comment by Brett.Mi...@gmail.com
on 16 Apr 2009 at 9:53
Original comment by mgron...@gmail.com
on 6 May 2009 at 9:52
From Mohit:
Sounds good. zlib and base64binarycompressed it is.
I will expose the dtd via port 19900 as well (0:19900/dtd). The mere presence
of it
will be sufficient to know if compression is accepted or you can check for the
attribute.
Original comment by mgron...@gmail.com
on 22 May 2009 at 10:44
In the latest 6.2 beta, the dtd is returned via
http://gsa:19900/getdtd
and the encodings supported are:
base64binary | base64compressed
Original comment by Brett.Mi...@gmail.com
on 9 Sep 2009 at 9:24
Original comment by jl1615@gmail.com
on 19 Sep 2009 at 4:03
Full support for compressed content feeds was added in revision r2246
Original comment by Brett.Mi...@gmail.com
on 6 Oct 2009 at 9:07
Multiple documents per feed (Issue 106) was added in revision r2257
Sending feeds to the GSA in a separate thread (Issue 184) was added in revision
r2257
Original comment by Brett.Mi...@gmail.com
on 6 Oct 2009 at 9:09
Allow batchHint to be a hint - the Connector may return more.
Most of the Connectors, in startTraversal() or resumeTraversal()
actually collect more than batchHint candidate documents. This
can be a result of dealing with inserts as well as deletes, or
simply for the matter of efficiency. The connectors then build
a DocumentList with at most batchHint entries - discarding any
extraneous results. This can be considered wasteful.
This change makes the batchHint a true hint. If the connector
returns more documents than the hint, they may be processed.
This change also treats the host load as a target load, rather
than a hard ceiling. The batchHints are calculated to aim for
the load, however the number of documents processed may exceed
the load. To avoid ill-behaved connectors returning too many
results, a maximum batch sized is established. It could exceed
the load by, at most, batchHint.
This change adds a new class BatchSize, which includes the hint,
but also a maximum size.
HostLoadManger.determineBatchHint() has been replaced with
HostLoadManger.determineBatchSize(), which returns a BatchSize
object specifying the preferred and maximum batch size.
This BatchSize object is passed down, eventually reaching
QueryTraverser.runBatch(), which passed the hint on to the
Connector, and uses the maximum value to stop processing
documents from the list.
Since this change alterred the Traverser.runBatch() Interface,
I also took care of an outstanding TODO - have runBatch()
return a BatchResult object rather than an integer number
of docs processed. This cleaned up batch result processing.
-----
This new snapshot address most of your comments from the
initial snapshot. Specifically:
- The maximum number of documents per batch is no longer
capped by the hostload. The number of docs processed
could exceed the load by, at most, batchHint. This would
occur if the batchHint was set to the remainingDocsToTraverse
in the period, and the connector returned twice that.
This addresses John's concerns while still handling runaway
connectors that return millions of records. This change
makes both the batch hint and the host load "target values",
rather than absolute limits, allowing the connector to
miss the target by a standard deviation.
- I tweeked HostLoadManager.shouldDelay() to return true
(do delay) if the remaining number of docs to traverse
is less than 10% of the load. This addresses John's
problem of traversal requests with single-digit batchHints.
- I added several unit tests, to test these capabilities.
Original comment by Brett.Mi...@gmail.com
on 6 Oct 2009 at 9:14
INTRODUCTION:
=============
I did some informal performance testing of the Connector
Manager with the Documentum and Livelink Connectors,
comparing the performance of the 2.0.0 release with current
trunk or change branch versions (as of Friday Sept 18).
The tests were run against recent repositories, targeting
ones that have faster back-end servers. These repositories
unfortunately contain small numbers of documents (ones or
tens of thousands). The document sizes, too, tended to
be small (especially in the Documentum repository).
We should really load up 10-100 thousand real-world docs
into our repositories for more realistic tests.
The documents were fed to a 6.2 beta GSA that supports
compressed feeds. No attempt was made to time feed
processing once the feed was handed off to the GSA.
These tests focused on the CM and the Connectors.
The tests were run on my development machine rather
than a dedicated test machine. I made very little
effort to have minimal other tasks running or reboot
between runs, although I did shut down memory hogging
web browsers and noted that nothing else was using
significant computing or memory resources.
Logging levels were set to INFO. Feed logs and
teedFeedFile were disabled. John found that logging
incurs a 10-15% overhead. We can address that issue
separately.
2.0.0 RELEASE RESULTS:
======================
Livelink Repository:
-------------------
Batch Hint: 100
Host Load: 1000
Documents Fed: 1148
Feed Count: 1148
Time: 3m 11s (191s)
Docs per Second: 6.0
Documentum Repository:
---------------------
Batch Hint: 100
Host Load: 1000
Documents Fed: 13,860
Feed Count: 13,860
Time: 74m 7s (4447s)
Docs per Second: 3.1
2.3.x DEVELOPMENT BRANCH RESULTS:
==================================
Connector Manager changes include:
- Multiple Documents per Feed.
- Increased batch size from 100 to 500.
- Compressed document content.
- Asynchronous feed submission*.
- Increased I/O buffer sizes.
- Reduction in I/O buffers and data copying.
Documentum Connector changes include:
- Cached type information.
Livelink Connector changes include:
- none**.
* Because of small document sizes,
all of the tests managed to fit each
traversal batch into a single feed file,
so asynchronous feed submission was not
leveraged.
** Performance profiling has shown that
The Livelink Connector should focus on
improving the retrieval of Category
Attributes.
Livelink Repository:
-------------------
Batch Hint: 500
Host Load: 1000
Documents Fed: 1148
Feed Count: 3
Feed Size: 10,777,048 bytes
Time: 1m 59s (119s)
Docs per Second: 9.6
Documentum Repository:
---------------------
Batch Hint: 500
Host Load: 1000
Documents Fed: 13,860
Feed Count: 28
Feed Size: 34,776,925 bytes
Time: 22m 44s (1364s)
Docs per Second: 10.2
CONCLUSIONS:
============
The modifications resulted in a 50% improvement
in wall-clock traversal time for the Livelink
Connector, and a 320% improvement for the Documentum
Connector. We are nearing the 12 documents per second
rate we need to feed 1M docs per day in a single
connector instance. Feeding from 2 or 3 connector
instances could easily meet that rate.
We need more real-world document data. Our sample
sizes are too small.
We might consider lowering the default target feed
size to take better advantage of asynchronous feed
submission. But perhaps real-world documents will
take care of this.
We should consider switching to log4j logging (at
least for the external connectors), which writes
log messages in a separate thread.
If the customer can segment their repository and
use multiple connector instances for each part,
they should. Making it easier to create Collections
from Connector Content would help.
--
Brett M. Johnson
Original comment by Brett.Mi...@gmail.com
on 6 Oct 2009 at 9:27
Closing the lid on this issue with the compression and batch hint changes.
Original comment by jl1615@gmail.com
on 21 Oct 2009 at 7:13
Original comment by Brett.Mi...@gmail.com
on 21 Oct 2009 at 10:19
Original comment by jl1615@gmail.com
on 27 Oct 2009 at 11:03
Original comment by jl1615@gmail.com
on 27 Oct 2009 at 11:05
The doc impact of this issue is the loosening of the batch hint limits, so that
the connector manager might read
more than the batch hint number of documents (currently two times the batch
hint) before stopping the
processing of a DocumentList.
Original comment by jl1615@gmail.com
on 27 Oct 2009 at 11:08
Original issue reported on code.google.com by
jl1615@gmail.com
on 9 Dec 2008 at 1:50