Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

OOM with snapshot 2.2.0 #85

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

Only seeing this in console, not log file. Occured since using snapshot.

Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.transformer.AbstractStringTransformer.transformTextDocument(AbstractStringTransformer.java:82) at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.ava:66) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:59) at com.norconex.importer.Importer.transformDocument(Importer.java:542) at com.norconex.importer.Importer.executeHandlers(Importer.java:348) at com.norconex.importer.Importer.importDocument(Importer.java:305) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

OkkeKlein commented 9 years ago

All though exception was not properly caught, the code of AbstractStringTransformer hints to raising heap size which seems to solve the issue.

essiembre commented 9 years ago

The collector and importer pieces are usually quite memory-concious and we always try to improve that. Just to make sure there is some sort of logic the OOM exception, can you answer the following:

If you have a way to reproduce, that would be best for attempting to fix this.

In the meantime, increasing the Java heap size like you did is the way to go.

OkkeKlein commented 9 years ago

IIRC it was just HTML content with 2 threads and 200ms delay the system had 1G left of memory (1G in use). I only noticed it when using snapshot release. I will try to reproduce,

With new heap size I never see exceptions. But neither is more memory used.

essiembre commented 9 years ago

I don't know why this one did not make the logs. I suspect it bypassed Log4J altogether to print the error directly to STDERR. BTW, if you do not want to duplicate the logging (file + console), you can change the "rootLogger" in the log4j.properties file to be FILE_ONLY instead of CONSOLE:

log4j.rootLogger=INFO, FILE_ONLY

I'll leave this issue open for a while even if there is no clear way to reproduce. If it every comes up again and you detect a certain pattern, please report it here.

OkkeKlein commented 9 years ago

log4j.rootLogger=INFO,FILE_ONLY did not do the trick. The exception still only showed in console.This happened once after crawling for 3 hours, so hard to reproduce :)

essiembre commented 9 years ago

For the exception, if it is automatically thrown to the console and bypassing log4j, we may have to live with it for now, but for "normal" logging, did log entries stop appearing on the console and in the logs only with FILE_ONLY?

OkkeKlein commented 9 years ago

I had already removed CONSOLE, so yeah entries only showed in logs.

Will keep an eye on cause of issue.

essiembre commented 9 years ago

The memory error reported here is believed to be tied to the Importer module. Progress is being tracked here.

essiembre commented 9 years ago

Marking as fixed, since offline discussions established it was related to PDFs and that has been fixed in the latest stable release of the importer.