OOM with snapshot 2.2.0

OkkeKlein commented 9 years ago

Only seeing this in console, not log file. Occured since using snapshot.

Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.transformer.AbstractStringTransformer.transformTextDocument(AbstractStringTransformer.java:82) at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.ava:66) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:59) at com.norconex.importer.Importer.transformDocument(Importer.java:542) at com.norconex.importer.Importer.executeHandlers(Importer.java:348) at com.norconex.importer.Importer.importDocument(Importer.java:305) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

OkkeKlein commented 9 years ago

All though exception was not properly caught, the code of AbstractStringTransformer hints to raising heap size which seems to solve the issue.

essiembre commented 9 years ago

The collector and importer pieces are usually quite memory-concious and we always try to improve that. Just to make sure there is some sort of logic the OOM exception, can you answer the following:

Did you notice very large files being processed around the time of failure?
Can you tell what type of documents were processed around the time of failure (or in general)?
How many threads were you using?
How fast (the delay config option) are you crawling?
Was there enough RAM left on your system in general?

If you have a way to reproduce, that would be best for attempting to fix this.

In the meantime, increasing the Java heap size like you did is the way to go.

OkkeKlein commented 9 years ago

IIRC it was just HTML content with 2 threads and 200ms delay the system had 1G left of memory (1G in use). I only noticed it when using snapshot release. I will try to reproduce,

With new heap size I never see exceptions. But neither is more memory used.

essiembre commented 9 years ago

I don't know why this one did not make the logs. I suspect it bypassed Log4J altogether to print the error directly to STDERR. BTW, if you do not want to duplicate the logging (file + console), you can change the "rootLogger" in the log4j.properties file to be FILE_ONLY instead of CONSOLE:

log4j.rootLogger=INFO, FILE_ONLY

I'll leave this issue open for a while even if there is no clear way to reproduce. If it every comes up again and you detect a certain pattern, please report it here.

OkkeKlein commented 9 years ago

log4j.rootLogger=INFO,FILE_ONLY did not do the trick. The exception still only showed in console.This happened once after crawling for 3 hours, so hard to reproduce :)

essiembre commented 9 years ago

For the exception, if it is automatically thrown to the console and bypassing log4j, we may have to live with it for now, but for "normal" logging, did log entries stop appearing on the console and in the logs only with FILE_ONLY?

OkkeKlein commented 9 years ago

I had already removed CONSOLE, so yeah entries only showed in logs.

Will keep an eye on cause of issue.

essiembre commented 9 years ago

The memory error reported here is believed to be tied to the Importer module. Progress is being tracked here.

essiembre commented 9 years ago

Marking as fixed, since offline discussions established it was related to PDFs and that has been fixed in the latest stable release of the importer.

Norconex / crawlers

OOM with snapshot 2.2.0 #85