Closed OkkeKlein closed 9 years ago
All though exception was not properly caught, the code of AbstractStringTransformer hints to raising heap size which seems to solve the issue.
The collector and importer pieces are usually quite memory-concious and we always try to improve that. Just to make sure there is some sort of logic the OOM exception, can you answer the following:
If you have a way to reproduce, that would be best for attempting to fix this.
In the meantime, increasing the Java heap size like you did is the way to go.
IIRC it was just HTML content with 2 threads and 200ms delay the system had 1G left of memory (1G in use). I only noticed it when using snapshot release. I will try to reproduce,
With new heap size I never see exceptions. But neither is more memory used.
I don't know why this one did not make the logs. I suspect it bypassed Log4J altogether to print the error directly to STDERR. BTW, if you do not want to duplicate the logging (file + console), you can change the "rootLogger" in the log4j.properties
file to be FILE_ONLY
instead of CONSOLE
:
log4j.rootLogger=INFO, FILE_ONLY
I'll leave this issue open for a while even if there is no clear way to reproduce. If it every comes up again and you detect a certain pattern, please report it here.
log4j.rootLogger=INFO,FILE_ONLY did not do the trick. The exception still only showed in console.This happened once after crawling for 3 hours, so hard to reproduce :)
For the exception, if it is automatically thrown to the console and bypassing log4j, we may have to live with it for now, but for "normal" logging, did log entries stop appearing on the console and in the logs only with FILE_ONLY?
I had already removed CONSOLE, so yeah entries only showed in logs.
Will keep an eye on cause of issue.
The memory error reported here is believed to be tied to the Importer module. Progress is being tracked here.
Marking as fixed, since offline discussions established it was related to PDFs and that has been fixed in the latest stable release of the importer.
Only seeing this in console, not log file. Occured since using snapshot.
Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.(StringBuilder.java:97)
at com.norconex.importer.handler.transformer.AbstractStringTransformer.transformTextDocument(AbstractStringTransformer.java:82)
at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.ava:66)
at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:59)
at com.norconex.importer.Importer.transformDocument(Importer.java:542)
at com.norconex.importer.Importer.executeHandlers(Importer.java:348)
at com.norconex.importer.Importer.importDocument(Importer.java:305)
at com.norconex.importer.Importer.doImportDocument(Importer.java:267)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)