Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Unfinishable crawler runs (could not execute close) #783

Closed stfnwp closed 2 years ago

stfnwp commented 2 years ago

We have different crawlers running in a collector. When trying to finish a crawler run (just after printing the execution summary), sometimes the following situation occurs: Crawler A (here Additional Website Crawler) tries to access a file from the workDir of Crawler B (here DVO Crawler) which leads to a NoSuchFileException. Immediately after that the error "Could not execute \"close\" on committer" appears.

Finally, the whole collector cannot finish its run (maybe some resources are still in use/blocked).

All of this happens randomly, with different crawlers and with different frequency.

See logs:

{"timestamp":"2022-04-20 01:15:51.429","level":"INFO","thread":"Additional Website Crawler","logger":"com.norconex.collector.core.crawler.Crawler","message":"Execution Summary:\nTotal processed:   6\nSince (re)start:\n  Crawl duration:  28 seconds\n  Avg. throughput: 0.2 processed/seconds\n  Event counts:\n    COMMITTER_ACCEPT_YES:      21\n    COMMITTER_UPSERT_BEGIN:    21\n    COMMITTER_UPSERT_END:      20\n    CRAWLER_RUN_BEGIN:         5\n    CRAWLER_RUN_END:           1\n    CRAWLER_RUN_THREAD_BEGIN:  30\n    CRAWLER_RUN_THREAD_END:    10\n    CREATED_ROBOTS_META:       13\n    DOCUMENT_COMMITTED_UPSERT: 20\n    DOCUMENT_FETCHED:          13\n    DOCUMENT_IMPORTED:         21\n    DOCUMENT_PROCESSED:        25\n    DOCUMENT_QUEUED:           2,014\n    IMPORTER_HANDLER_BEGIN:    412\n    IMPORTER_HANDLER_END:      412\n    IMPORTER_PARSER_BEGIN:     26\n    IMPORTER_PARSER_END:       26\n    REJECTED_FILTER:           803\n    REJECTED_IMPORT:           4\n    URLS_EXTRACTED:            5","context":"pbd1.suche"}
{"timestamp":"2022-04-20 01:15:51.872","level":"WARN","thread":"Additional Website Crawler","logger":"com.norconex.commons.lang.exec.Retrier","message":"Execution failed, retrying (1 of 3 maximum retries). Cause:\n  → NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004","context":"pbd1.suche"}
{"timestamp":"2022-04-20 01:15:51.873","level":"WARN","thread":"Additional Website Crawler","logger":"com.norconex.commons.lang.exec.Retrier","message":"Execution failed, retrying (2 of 3 maximum retries). Cause:\n  → UncheckedIOException: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n    → NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004","context":"pbd1.suche"}
{"timestamp":"2022-04-20 01:15:51.874","level":"WARN","thread":"Additional Website Crawler","logger":"com.norconex.commons.lang.exec.Retrier","message":"Execution failed, retrying (3 of 3 maximum retries). Cause:\n  → UncheckedIOException: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n    → NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004","context":"pbd1.suche"}
{"timestamp":"2022-04-20 01:15:51.928","level":"ERROR","thread":"Additional Website Crawler","logger":"CommitterEvent.COMMITTER_CLOSE_ERROR","message":"CommitterEvent[crawlerConfig=de.xxx.apok.crawler.CrawlerConfiguration$$EnhancerBySpringCGLIB$$68a34a22@1c91032,connectionTimeout=1000,credentials=Credentials[username=<null>,password=********,passwordKey=<null>],discoverNodes=false,dotReplacement=<null>,fixBadIds=false,ignoreResponseErrors=false,indexName=content_develop_1,jsonFieldsPattern=<null>,socketTimeout=30000,sourceIdField=<null>,targetContentField=content,typeName=<null>,queue=FSQueue[batchSize=20,commitLeftoversOnInit=false,ignoreErrors=false,maxPerFolder=500,retrier=Retrier[exceptionFilter=<null>,maxCauses=10,maxRetries=3,retryDelay=0],splitBatch=OFF],committerContext=CommitterContext[eventManager=com.norconex.commons.lang.event.EventManager@ddfe04db,streamFactory=com.norconex.commons.lang.io.CachedStreamFactory@ab258e69,workDir=/app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0],fieldMappings={},restrictions=[],request=<null>]","context":"pbd1.suche","exception":"com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not process one or more files form committer batch located at /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004 and could not copy it under /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/error\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.moveUnrecoverableBatchError(FSQueue.java:420)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:364)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeBatchDirectory(FSQueue.java:338)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRemainingBatches(FSQueue.java:497)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.close(FSQueue.java:487)\n\tat com.norconex.committer.core3.batch.AbstractBatchCommitter.doClose(AbstractBatchCommitter.java:97)\n\tat com.norconex.committer.core3.AbstractCommitter.close(AbstractCommitter.java:258)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService$$Lambda$861/0x0000000028ac2e30.accept(Unknown Source)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService.executeAll(CrawlerCommitterService.java:129)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService.close(CrawlerCommitterService.java:118)\n\tat com.norconex.collector.core.crawler.Crawler$$Lambda$859/0x0000000028ac0690.accept(Unknown Source)\n\tat java.base/java.util.Optional.ifPresent(Unknown Source)\n\tat com.norconex.collector.core.crawler.Crawler.destroyCrawler(Crawler.java:401)\n\tat com.norconex.collector.core.crawler.Crawler.start(Crawler.java:281)\n\tat com.norconex.collector.core.Collector.lambda$null$2(Collector.java:233)\n\tat com.norconex.collector.core.Collector$$Lambda$736/0x00000000d5a2fde0.run(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not consume batch. Number of attempts: 4\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:407)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356)\n\t... 17 common frames omitted\nCaused by: com.norconex.commons.lang.exec.RetriableException: Execution failed, maximum number of retries reached.\n\tat com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:204)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:395)\n\t... 18 common frames omitted\nCaused by: java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.iterator(FSBatch.java:58)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.lambda$consumeRetriableBatch$1(FSQueue.java:396)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue$$Lambda$856/0x000000004400e960.execute(Unknown Source)\n\tat com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:177)\n\t... 19 common frames omitted\nCaused by: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n\tat java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(Unknown Source)\n\tat java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(Unknown Source)\n\tat java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(Unknown Source)\n\tat java.base/java.nio.file.Files.readAttributes(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.getAttributes(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.visit(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.walk(Unknown Source)\n\tat java.base/java.nio.file.FileTreeIterator.<init>(Unknown Source)\n\tat java.base/java.nio.file.Files.find(Unknown Source)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueueUtil.findZipFiles(FSQueueUtil.java:77)\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.zipIterator(FSBatch.java:88)\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.iterator(FSBatch.java:56)\n\t... 22 common frames omitted\n"}
{"timestamp":"2022-04-20 01:15:51.932","level":"ERROR","thread":"Additional Website Crawler","logger":"com.norconex.collector.core.crawler.CrawlerCommitterService","message":"Could not execute \"close\" on committer: CrawlerElasticsearchCommitter[crawlerConfig=de.xxx.apok.crawler.CrawlerConfiguration$$EnhancerBySpringCGLIB$$68a34a22@1c91032,connectionTimeout=1000,credentials=Credentials[username=<null>,password=********,passwordKey=<null>],discoverNodes=false,dotReplacement=<null>,fixBadIds=false,ignoreResponseErrors=false,indexName=content_develop_1,jsonFieldsPattern=<null>,socketTimeout=30000,sourceIdField=<null>,targetContentField=content,typeName=<null>,queue=FSQueue[batchSize=20,commitLeftoversOnInit=false,ignoreErrors=false,maxPerFolder=500,retrier=Retrier[exceptionFilter=<null>,maxCauses=10,maxRetries=3,retryDelay=0],splitBatch=OFF],committerContext=CommitterContext[eventManager=com.norconex.commons.lang.event.EventManager@ddfe04db,streamFactory=com.norconex.commons.lang.io.CachedStreamFactory@ab258e69,workDir=/app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0],fieldMappings={},restrictions=[]]","context":"pbd1.suche","exception":"com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not process one or more files form committer batch located at /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004 and could not copy it under /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/error\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.moveUnrecoverableBatchError(FSQueue.java:420)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:364)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeBatchDirectory(FSQueue.java:338)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRemainingBatches(FSQueue.java:497)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.close(FSQueue.java:487)\n\tat com.norconex.committer.core3.batch.AbstractBatchCommitter.doClose(AbstractBatchCommitter.java:97)\n\tat com.norconex.committer.core3.AbstractCommitter.close(AbstractCommitter.java:258)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService$$Lambda$861/0x0000000028ac2e30.accept(Unknown Source)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService.executeAll(CrawlerCommitterService.java:129)\n\tat com.norconex.collector.core.crawler.CrawlerCommitterService.close(CrawlerCommitterService.java:118)\n\tat com.norconex.collector.core.crawler.Crawler$$Lambda$859/0x0000000028ac0690.accept(Unknown Source)\n\tat java.base/java.util.Optional.ifPresent(Unknown Source)\n\tat com.norconex.collector.core.crawler.Crawler.destroyCrawler(Crawler.java:401)\n\tat com.norconex.collector.core.crawler.Crawler.start(Crawler.java:281)\n\tat com.norconex.collector.core.Collector.lambda$null$2(Collector.java:233)\n\tat com.norconex.collector.core.Collector$$Lambda$736/0x00000000d5a2fde0.run(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not consume batch. Number of attempts: 4\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:407)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356)\n\t... 17 common frames omitted\nCaused by: com.norconex.commons.lang.exec.RetriableException: Execution failed, maximum number of retries reached.\n\tat com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:204)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:395)\n\t... 18 common frames omitted\nCaused by: java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.iterator(FSBatch.java:58)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue.lambda$consumeRetriableBatch$1(FSQueue.java:396)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueue$$Lambda$856/0x000000004400e960.execute(Unknown Source)\n\tat com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:177)\n\t... 19 common frames omitted\nCaused by: java.nio.file.NoSuchFileException: /app/work/APOK-Contentsuche-Crawler/DVO_32_Crawler/committer/0/queue/batch-1650410122859000004\n\tat java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)\n\tat java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(Unknown Source)\n\tat java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(Unknown Source)\n\tat java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(Unknown Source)\n\tat java.base/java.nio.file.Files.readAttributes(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.getAttributes(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.visit(Unknown Source)\n\tat java.base/java.nio.file.FileTreeWalker.walk(Unknown Source)\n\tat java.base/java.nio.file.FileTreeIterator.<init>(Unknown Source)\n\tat java.base/java.nio.file.Files.find(Unknown Source)\n\tat com.norconex.committer.core3.batch.queue.impl.FSQueueUtil.findZipFiles(FSQueueUtil.java:77)\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.zipIterator(FSBatch.java:88)\n\tat com.norconex.committer.core3.batch.queue.impl.FSBatch.iterator(FSBatch.java:56)\n\t... 22 common frames omitted\n"}
Exception in thread "Additional Website Crawler" com.norconex.collector.core.CollectorException: Could not execute "close" on 1 committer(s): "CrawlerElasticsearchCommitter". Check the logs for more details.
    at com.norconex.collector.core.crawler.CrawlerCommitterService.executeAll(CrawlerCommitterService.java:140)
    at com.norconex.collector.core.crawler.CrawlerCommitterService.close(CrawlerCommitterService.java:118)
    at com.norconex.collector.core.crawler.Crawler$$Lambda$859/0x0000000028ac0690.accept(Unknown Source)
    at java.base/java.util.Optional.ifPresent(Unknown Source)
    at com.norconex.collector.core.crawler.Crawler.destroyCrawler(Crawler.java:401)
    at com.norconex.collector.core.crawler.Crawler.start(Crawler.java:281)
    at com.norconex.collector.core.Collector.lambda$null$2(Collector.java:233)
    at com.norconex.collector.core.Collector$$Lambda$736/0x00000000d5a2fde0.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Thank you in advance for your analysis!

UtsavVanodiya7 commented 2 years ago

Hello there,

Recently we have fixed one issue related crawlers not finishing properly, you can try v3.0.1, download page link https://opensource.norconex.com/crawlers/web/download

If you still face issue with this version, please send your config files here.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.