Enforce maxDocumentSize in feed.

GoogleCodeExporter commented 8 years ago

Issue 106 adds the capability of creating a single feed containing multiple 
documents.   This 
made it much more difficult to handle OutOfMemoryErrors when processing a 
single large 
document.  Indeed, there could be the [unlikely] situation where adding a 
relatively small 
document to a relatively large feed could get an OutOfMemoryError.

In the past, with one document per feed, we could treat an OutOfMemoryError 
with the 
assumption that that single document was too big.  QueryTraverser would skip 
that document 
and go on to the next one.   Now that we can accumulate many documents into a 
single feed, the 
earlier assumption is no longer valid.  OutOfMemoryError represents a true 
resource problem, 
that might require the administrator to tune the JVM memory configuration, the 
maxDocumentSize, the maxFeedSize, or the number of concurrently traversing 
connector 
instances.

Early snapshots of the Issue 106 changes mirrored the old behavior - large 
documents were 
skipped in their entirety.  After much discussion, we decided to mirror the 
behaviour of 
TraversalContext.maxDocumentSize support in the Connectors - skip the content, 
but index the 
metadata.

Original issue reported on code.google.com by Brett.Mi...@gmail.com on 11 Sep 2009 at 11:49

GoogleCodeExporter commented 8 years ago

This is the email discussion:

On Sep 2, 2009, at 6:51 PM, John Lacey wrote:

-------

On Sep 2, 2009, at 6:51 PM, John Lacey wrote:

As part of fixing issue 106, we have added code to the connector
manager that explicitly skips documents (content and metadata) when
the content is larger than 30 MB.

For completeness, the old behavior was that if an OutOfMemoryError
was thrown, the document was skipped no matter how big it was. The
new behavior is that truly large documents are explicitly skipped,
and if an OutOfMemoryError is thrown the batch is aborted and
retried later. That is, an OutOfMemoryError is treated as a
situation in which we cannot do work, and not as an indication that
a particular document is too large. Previously, a small, or even
empty document could be skipped. But also, previously, a 40 MB or
100 MB or larger document that did not throw an OutOfMemoryError
would be sent to the GSA.

Then I saw this thread in the GSA group, which suggests that 30 MB
is not in fact a hard limit:

http://groups.google.com/group/Google-Search-Appliance-Help/browse_thread/thread
/415557d5710a08b7?
hl=en#

I don't know how to mesh that behavior with the feeder bug that
lifted the limit from 9.5 MB to 30 MB (I am disconnected now and
don't have the bug number). I thought that we should figure this out
before dropping the hard limit into the connector manager.

John L

-------

On Thu, Sep 3, 2009 at 10:08 AM, Brett Johnson wrote:

The new Connector Manager skips documents whose (non-Base64-encoded)
length exceeds the maxDocumentSize specified in connectorInstance.xml.
As shipped, that parameter is set to 30MB, however the connector
administrator can set maxDocumentSize to any value.

Unfortunately, the value setter method does not perform any input
validation, so the value could be negative or greater than 1GB.
I should fix that.

Even though a connector administrator may change the maxDocumentSize,
one must remember that the ConnectorManager, the Connectors, and
the Tomcat server are running in a Java VM.  The GSA does not support
chunked HTTP transfers, so the Connector Manager builds up the feed
in memory before sending it to the GSA. This is when OutOfMemoryErrors
can happen.

I suppose I could spool a feed to disk if it gets too large, but
that goes against the feed performance improvement game plan I'm
working toward. The work required to spool a large feed certainly
won't make it into a 2.0.2 patch release.

Raising the maxDocumentSize may require adjusting the JVM memory
configuration options -Xms and -Xmx during Tomcat startup.  Note
that Tomcat deployments created by the Google Connectors Installer
(GCI) configure the JVM -Xms and -Xmx settings in a non-standard way.
GCI sets these parameters in $CATLINA_BASE/bin/catalina.sh (or .bat),
rather than in the conventional location $CATLINA_BASE/bin/setenv.sh
(or .bat).

Also keep in mind that a single Connector Manager can run multiple
Connector instances.  Theoretically you could configure 10 Connector
instances, each one trying to feed a 100MB file to the GSA.
10 150MB feed files in memory at once ==> OutOfMemoryErrors.
If this happens too frequently, the connector administrator could
adjust maxDocumentSize, JVM memory parameters, or stagger Connector
traversal schedules.

IIRC, before we implemented TraversalContext.maxDocumentSize() last
year, were getting OutOfMemoryErrors when attempting to feed 200MB
and 300MB files.  However, I think the JVM -Xms and -Xmx settings
were also configured with lower values (if at all) back then.

--
Brett M. Johnson

-------

On Sep 3, 2009, at 1:03 PM, Jeff Ling wrote:

The GSA can't process docs larger than 30M as of today, period.

Let's not over design this. Let's stick to what has been done for
Dctm connector: send in meta data.

Thanks,
Jeff

-------

On Thu, Sep 3, 2009 at 1:47 PM, John Lacey wrote:
I'm happy with the conclusion, but the story is more subtle than that. 
I think the situation is this:

1. Incoming documents (crawled or fed) are truncated to 30 MB before 
   document conversion. (http://b/1384, http://b/1344929)
2. Documents that fail conversion are indexed, but without content or
   metadata, and truncated binary documents are likely to fail conversion,
   but text files (including HTML) do not. (ibid, plus http://b/730954)
3. The first 2 MB (or maybe 2.5 MB; I've seen both) of the converted text
   is indexed, and up to 4 million bytes of the converted text is cached.
   (For XML, see also http://b/2025495)

This still doesn't mesh with the reported behavior in the groups thread,
where a 100 MB Word doc was fed, and available in search results.
That part actually matches, but the OP Adam Burr further states that the
first part of the document is available in the cache, which contradicts
bug 1344929. This seems impossible according to well documented limitations,
so I'm tempted to doubt the post.

Brett and I have been talking about dropping the content if it's over 30 MB.
We could also implement the content type filtering in the connector manager.
The use of TraversalContext in the connector will always be more efficient
and more flexible (renditions, etc.), but we could be defensive.

We still want chunked transfer in the future, which will make this harder.
With chunked transfer, we will at best be able to stop sending the content
at 30 MB (or whatever the latest and greatest limit is then), rather than
not sending it at all. If bug 730954 is fixed by then, however, the overall
system behavior will be the same.

The timing for 2.0.2 is troubling, but it seems like Brett can get started on
this, and if he makes it and Marty has time to do the code review, then good.
Otherwise, we'll put this into 2.4. I'm also fine with explicitly saying we'll
leave this out of 2.0.2.

John L

-------

On Thu, Sep 3, 2009 at 2:15 PM, Mohit Oberoi wrote:
John,

Since the feed doc is truncated at 30MB, some binary docs may be okay
with this when it comes to coversion (for ex. Word is okay ...I just
tested with a 50 MB file), which is what Adam is seeing (he did mention
that only part of the document is searchable, not the whole).

I think for connectors, we should just stick to the 30 MB limit for
various reasons (performance, doc truncation, documented limits).
Post 6.2, there are plans to increase this limit, so we can fix it
across the board at that point.

thanks,
-mohit

-------

On Thu, Sep 3, 2009 at 2:18 PM, Jeff Ling wrote:
We have just promised Eli Lilly that we will make the behavior
consistent across connectors by sending metadata. Let's keep our
promise before the 30MB limit is lifted.

What's even better: the connector manager should record what files
are larger than 30M and have been skipped. Do I need to file  a FR?

Thanks,
Jeff

-------

Original comment by Brett.Mi...@gmail.com on 12 Sep 2009 at 12:03

GoogleCodeExporter commented 8 years ago

Original comment by Brett.Mi...@gmail.com on 12 Sep 2009 at 12:05

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

This was added to Connector Manager 2.0.x branch as revision r2239 and was 
included in the 2.0.2 release.
It has yet to be added to the trunk.

Original comment by Brett.Mi...@gmail.com on 12 Sep 2009 at 12:07

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 3:34

Added labels: Milestone-Goal_2.4.0
Removed labels: Milestone-Release_2.4.0

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 25 Sep 2009 at 1:29

Added labels: Doc

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 25 Sep 2009 at 1:30

Changed state: Started
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Fixed on the trunk 01 October 2009 in revision r2257

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:23

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 24 Oct 2009 at 2:59

Removed labels: Milestone-Goal_2.4.0

PILLUTLAAVINASH / google-enterprise-connector-manager

Enforce maxDocumentSize in feed. #182