Connector is unable to traverse the repository upon restart if the state file is very large

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Make sure the connector state file is very large (100MB+)
2.Start the connector.

What is the expected output? 
Connector should be able to crawl the repository successfully.

What do you see instead?
Connector fails to load the state file due to insufficient memory. An 
OutOfMemory Error is thrown while parsing the state XML.

Original issue reported on code.google.com by j.dars...@gmail.com on 23 Sep 2009 at 8:38

GoogleCodeExporter commented 9 years ago

Work around: Increase the memory allocated to tomcat so that enough memory is 
available for parsing the state XML.

By default the connector is configured to use only up to 1GB memory. Opening a 
large 
file (say around 100MB) will consume at least 1.5GB of memory and hence the 
OutOfMemory error. In order to overcome this problem, we must allocate around 
2GB 
memory to the connector.

To increase the memory allocated to the connector follow the steps given below:

This change will allocate 2GB (i.e. 2048MB) of memory. 

LINUX
Open the file: <Connector Installation Path>/<Connector 
Name>/Tomcat/bin/catalina.sh
Search for: JAVA_OPTS="$JAVA_OPTS -Xms256m -Xmx1024m
Change it to: JAVA_OPTS="$JAVA_OPTS -Xms256m -Xmx2048m

WINDOWS
Open the file: <Connector Installation Path>\<Connector 
Name>\Tomcat\bin\service.bat
Search for (must be towards the end of the file) :  --JvmMs 256 --JvmMx 1024
Change it to:  --JvmMs 256 --JvmMx 2048

Note: To be able to allocate memory successfully, the system must have at least 
twice 
as much physical memory.

Original comment by j.dars...@gmail.com on 23 Sep 2009 at 8:43

GoogleCodeExporter commented 9 years ago

In the long term we must consider reducing the memory usage for parsing the 
XML. The 
current implementation is based on DOM parser. Using SAX Parser or even JAXB 
should 
help alleviate this problem. Moreover if we decide to use some kind of 
persistent 
storage for the state file, the problem might be overcome completely, but need 
to 
assess performance impact in that case.

Original comment by j.dars...@gmail.com on 23 Sep 2009 at 8:47

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 25 Sep 2009 at 2:11

Added labels: CustomersAffected-1

GoogleCodeExporter commented 9 years ago

Original comment by j.dars...@gmail.com on 6 Nov 2009 at 12:11

GoogleCodeExporter commented 9 years ago

When I try to reset the connector, the statefile is supposed to be deleted. It 
seems 
even that call fails & hence recrawl doesnt work.

Original comment by mwarti...@gmail.com on 6 Nov 2009 at 12:19

Added labels: Priority-High, CustomersAffected-2
Removed labels: Priority-Medium, CustomersAffected-1

GoogleCodeExporter commented 9 years ago

We need an estimate on how much exactly will changing the parsers help and if 
it is 
worth the effort required.

Original comment by darsh...@google.com on 16 Nov 2009 at 10:12

GoogleCodeExporter commented 9 years ago

In some cases connector seems to hang up due to memory shortage after running 
for
several days with state files as low as 40MB. It logs "SEVERE: Java Heap Space"
message. It stops sending feeds and even the Connector Admin page in GSA Admin
console does not load. 

We need to do some memory profiling and longevity testing of the connector and 
ensure
that there are no memory leaks.

Original comment by j.dars...@gmail.com on 2 Dec 2009 at 12:21

Added labels: CustomersAffected-3
Removed labels: CustomersAffected-2

GoogleCodeExporter commented 9 years ago

The problem was mainly because of the DOM based parsing done by the connector 
for the
state file. Changed this to SAX based parsing. Refer to the fix at revision 574:
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
574

Original comment by th.nitendra on 21 Jan 2010 at 1:26

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Verified fix on Google SharePoint connector version 2.4.4

Performance Test Results for Metadata & URL Feed mode:

Test Environment Details:

OS: Linux 32 bit

Heap Size Settings 256M-1024M (Connector Default)

Nos. of Web State: 9560

Nos. of List State: 56784

Feed Mode: metadata-and-URL

Test Results:

Memory Usage:6 MB (Average) - 63.2MB (Max)

CPU Usage:3.06% (Average)

Size of State file: -30MB

Observations:

1. Memory consumption is low initially and increases gradually after state file 
size
increases.

2. The traversal rate was set to 500 documents per minute

3. Memory consumption gradually becomes constant at end of the test cycle as no
activity on SharePoint sever.

4. During Test cycle, 0.6% of the readings are above 10% of CPU usage, few 
times CPU
usage has also crossed 50% but dropped immediately to normal CPU usage.

5. Average CPU usage is 3%.

Performance Test Results for Content Feed mode:

Test Environment Details:

OS: Windows 2003 Server (64 bit)

Heap Size Settings 256M-1024M (Connector Default)

Nos. of Web State: 6610

Nos. of List State: 14252

Feed Mode: Content

Test Results:

Performance Paramenter: Average--Max

Memory Usage:120(Average)-404 MB (Max)

CPU Usage:12%(Average)

Size of State file: -10MB

Observations:

1. Memory utilization range 200-300 MB with a traversal rate of 200 Documents 
per Minute

2. Memory utilization increase in case of content feed mode as Connector is 
involved
in fetching content from documents.

3. Average CPU usage is 12%, few times CPU usage has also crossed 50 % but 
dropped
immediately to normal CPU usage. There was no constant increase in the CPU 
Usage.

Original comment by vishwas....@gmail.com on 12 Feb 2010 at 2:08

Changed state: Verified

AnantLabs / google-enterprise-connector-sharepoint

Connector is unable to traverse the repository upon restart if the state file is very large #108