Closed essiembre closed 9 years ago
@csaezl, do you process many large documents by any chance? Are you using the snapshot release and have many PDFs? There is a related memory issue that seems tied to PDFBox (see #85).
The memory error log entry occurs in between Solr postings, but without profiling it is hard to tell whether the error is really from the Solr Committer even if it looks like it. If it indeed occurs in the Solr Committer, the cause could likely be that the Committer is using SolrJ to create and send Solr documents. In doing so, it will store its objects in memory for the batch it is sending. If you have big files, it could become too much at some point. Short of reducing your batch size (which seems quite small already), I recommend you increase the Java heap size by adding the proper JVM arguments to the startup scripts.
This can be reproduced, if you have 1.5 hours (5 threads, 100ms). After processing about 893 documents (https://valitsus.ee/, <trustAllSSLCertificates>true</trustAllSSLCertificates>
, no robotstxt, no sitemaps), the error arises. I've done it twice.
In the log you can see several repetitions for some URLs, that is for the group of events from DOCUMENT_FETCHED
to DOCUMENT_COMMITTED_ADD
. Some examples:
DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/kontakt 17
DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid 379
These documents are in <queuedir>
.
And, yes, I have applied 2.2.0 snapshot. No errors shown in Solr logs
It may be related to this importer issue: https://github.com/Norconex/importer/issues/9
You can check if upgrading the Importer module with the latest importer snapshot libraries fixes this issue as well.
I have applied Import snapshot and get the repetitions mentioned in previous post. You only have to run it for about 10 minutes (not 1.5 hours) to see how the repetition begins:
[non-job]: 2015-05-15 14:04:28 INFO - Starting execution.
MC (crawler): 2015-05-15 14:13:45 INFO - DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:15:58 INFO - DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:15:59 INFO - DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
MC (crawler): 2015-05-15 14:16:05 INFO - DOCUMENT_COMMITTED_ADD: https://valitsus.ee/et/logofailid/grupid
Perhaps if you have a look at the page https://valitsus.ee/et/logofailid/grupid
you'll get an idea of the page content
I've increased the java heap to 2GB. the crawler java process uses 1,442.6 MB and Solr's 639.3MB.
The crawler seems to be doing the same thing forever from DOCUMENT_FETCHED
to DOCUMENT_COMMITTED_ADD
for https://valitsus.ee/et/logofailid/grupid
2.2.0 was officially released. Can you confirm if you still witness this behavior?
It doesn't happen anymore
Thank you for confirming. Closing.
Post from @csaezl, moved from https://github.com/Norconex/collector-http/issues/100#issuecomment-100172544:
I've got an error, perhaps not related to the issue itself: From the log:
From the console:
An Exception happened not shown in the log. I have more demanding crawlers running that don't get out of memory errors, so I'll check the initial conditions for the test and will try again.
I've run the test again and got the same result