Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

java.lang.OutOfMemoryError: Java heap space #106

Closed essiembre closed 9 years ago

essiembre commented 9 years ago

Post from @csaezl, moved from https://github.com/Norconex/collector-http/issues/100#issuecomment-100172544:

I've got an error, perhaps not related to the issue itself: From the log:

INFO - Sending 10 documents to Solr for update/deletion.
INFO - Done sending documents to Solr for update/deletion.
INFO - MC (crawler): Crawler finishing: committing documents.
INFO - Committing 0 files
INFO - Sending 7 documents to Solr for update/deletion.
INFO - Done sending documents to Solr for update/deletion.
INFO - MC (crawler): 893 reference(s) processed.
INFO -          CRAWLER_FINISHED

From the console:

INFO - Sending 10 documents to Solr for update/deletion.
INFO - Done sending documents to Solr for update/deletion.
Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap spaceINFO - MC (crawler): Crawler finishing: committing documents.
INFO - Committing 0 files
INFO - Sending 7 documents to Solr for update/deletion.
INFO - Done sending documents to Solr for update/deletion.
INFO - MC (crawler): 893 reference(s) processed.
INFO -          CRAWLER_FINISHED

An Exception happened not shown in the log. I have more demanding crawlers running that don't get out of memory errors, so I'll check the initial conditions for the test and will try again.

I've run the test again and got the same result

essiembre commented 9 years ago

@csaezl, do you process many large documents by any chance? Are you using the snapshot release and have many PDFs? There is a related memory issue that seems tied to PDFBox (see #85).

The memory error log entry occurs in between Solr postings, but without profiling it is hard to tell whether the error is really from the Solr Committer even if it looks like it. If it indeed occurs in the Solr Committer, the cause could likely be that the Committer is using SolrJ to create and send Solr documents. In doing so, it will store its objects in memory for the batch it is sending. If you have big files, it could become too much at some point. Short of reducing your batch size (which seems quite small already), I recommend you increase the Java heap size by adding the proper JVM arguments to the startup scripts.

csaezl commented 9 years ago

This can be reproduced, if you have 1.5 hours (5 threads, 100ms). After processing about 893 documents (https://valitsus.ee/, <trustAllSSLCertificates>true</trustAllSSLCertificates>, no robotstxt, no sitemaps), the error arises. I've done it twice.

In the log you can see several repetitions for some URLs, that is for the group of events from DOCUMENT_FETCHED to DOCUMENT_COMMITTED_ADD. Some examples:

DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/kontakt                     17
DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid          379

These documents are in <queuedir>. And, yes, I have applied 2.2.0 snapshot. No errors shown in Solr logs

essiembre commented 9 years ago

It may be related to this importer issue: https://github.com/Norconex/importer/issues/9

You can check if upgrading the Importer module with the latest importer snapshot libraries fixes this issue as well.

csaezl commented 9 years ago

I have applied Import snapshot and get the repetitions mentioned in previous post. You only have to run it for about 10 minutes (not 1.5 hours) to see how the repetition begins:

[non-job]: 2015-05-15 14:04:28 INFO - Starting execution.
MC (crawler): 2015-05-15 14:13:45 INFO -    DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:15:58 INFO -    DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:15:59 INFO -    DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:16:05 INFO -    DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid

Perhaps if you have a look at the page https://valitsus.ee/et/logofailid/grupid you'll get an idea of the page content

I've increased the java heap to 2GB. the crawler java process uses 1,442.6 MB and Solr's 639.3MB. The crawler seems to be doing the same thing forever from DOCUMENT_FETCHED to DOCUMENT_COMMITTED_ADD for https://valitsus.ee/et/logofailid/grupid

essiembre commented 9 years ago

2.2.0 was officially released. Can you confirm if you still witness this behavior?

csaezl commented 9 years ago

It doesn't happen anymore

essiembre commented 9 years ago

Thank you for confirming. Closing.