Closed abolotnov closed 4 years ago
Here's the tail of the log:
2019-01-26 04:39:10,211 [Crawler-20190126043059-89-4] ERROR Crawling Exception at https://www.advanceddisposal.com/for-business/collection-services/recycling.aspx
java.lang.IllegalStateException: Future got interrupted
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.get(AbstractCrawlerService.java:326) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:86) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:40) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.getParentEncoding(FessTransformer.java:258) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.decodeUrlAsName(FessTransformer.java:231) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.getFileName(FessTransformer.java:206) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessXpathTransformer.putAdditionalData(FessXpathTransformer.java:424) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessXpathTransformer.storeData(FessXpathTransformer.java:181) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:120) ~[fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77) ~[fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330) [fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240) ~[classes/:?]
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176) [fess-crawler-2.4.4.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) ~[?:1.8.0_191]
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
... 16 more
2019-01-26 04:39:10,222 [WebFsCrawler] INFO [EXEC TIME] crawling time: 484436ms
2019-01-26 04:39:10,222 [main] INFO Finished Crawler
2019-01-26 04:39:10,254 [main] INFO [CRAWL INFO] DataCrawlEndTime=2019-01-26T04:31:05.742+0000,CrawlerEndTime=2019-01-26T04:39:10.223+0000,WebFsCrawlExecTime=484436,CrawlerStatus=true,CrawlerStartTime=2019-01-26T04:31:05.704+0000,WebFsCrawlEndTime=2019-01-26T04:39:10.222+0000,WebFsIndexExecTime=329841,WebFsIndexSize=1738,CrawlerExecTime=484519,DataCrawlStartTime=2019-01-26T04:31:05.729+0000,WebFsCrawlStartTime=2019-01-26T04:31:05.729+0000
2019-01-26 04:39:42,416 [main] INFO Disconnected to elasticsearch:localhost:9300
2019-01-26 04:39:42,418 [Crawler-20190126043059-4-12] INFO I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Socket closed
2019-01-26 04:39:42,418 [Crawler-20190126043059-4-12] INFO Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 04:39:42,418 [Crawler-20190126043059-33-1] INFO I/O exception (java.net.SocketException) caught when processing request to {}->http://www.arborrealtytrust.com:80: Socket closed
2019-01-26 04:39:42,418 [Crawler-20190126043059-33-1] INFO Retrying request to {}->http://www.arborrealtytrust.com:80
2019-01-26 04:39:42,419 [Crawler-20190126043059-4-12] INFO Could not process http://www.americanaddictioncenters.com/robots.txt. Connection pool shut down
2019-01-26 04:39:42,419 [Crawler-20190126043059-33-1] INFO Could not process http://www.arborrealtytrust.com/robots.txt. Connection pool shut down
2019-01-26 04:39:43,414 [main] INFO Destroyed LaContainer.
Why did you set max_access = 32
?
and check elasticsearch log file.
I don't want to store too many pages per site at this point - just want to check how many are going to be accessible. Elasticsearch logs seem to be ok. Should I look for something? I don't see anything other than some exceptions that I trigger while trying to graph things in kibana. When I restart the default crawler job, it will work and through in additional documents. But it seems to start with the same crawlers so not much new documents added to index.
Here's a sample of my crawler config:
{
"name": "AMZN",
"urls": "http://www.amazon.com/
https://www.amazon.com/",
"included_urls": "http://www.amazon.com/.*
https://www.amazon.com/.*",
"excluded_urls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"num_of_thread": 16,
"interval_time": 10,
"sort_order": 1,
"boost": 1.0,
"available": "True",
"permissions": "{role}guest",
"depth": 5,
"max_access_count": 32
}
After I started default crawler again it worked for some time and ended:
2019-01-26 05:07:45,813 [Crawler-20190126045933-89-7] ERROR Crawling Exception at https://www.advanceddisposal.com/for-business/disposal-recycling-services/special-waste/blackfoot-landfill-winslow,-in.aspx
java.lang.IllegalStateException: Future got interrupted
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.get(AbstractCrawlerService.java:326) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:86) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:40) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.getParentEncoding(FessTransformer.java:258) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.decodeUrlAsName(FessTransformer.java:231) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessTransformer.getFileName(FessTransformer.java:206) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessXpathTransformer.putAdditionalData(FessXpathTransformer.java:424) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.FessXpathTransformer.storeData(FessXpathTransformer.java:181) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:120) ~[fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77) ~[fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330) [fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240) ~[classes/:?]
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176) [fess-crawler-2.4.4.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) ~[?:1.8.0_191]
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
... 16 more
2019-01-26 05:07:46,306 [WebFsCrawler] INFO [EXEC TIME] crawling time: 487118ms
2019-01-26 05:07:46,307 [main] INFO Finished Crawler
2019-01-26 05:07:46,327 [main] INFO [CRAWL INFO] DataCrawlEndTime=2019-01-26T04:59:39.151+0000,CrawlerEndTime=2019-01-26T05:07:46.307+0000,WebFsCrawlExecTime=487118,CrawlerStatus=true,CrawlerStartTime=2019-01-26T04:59:39.114+0000,WebFsCrawlEndTime=2019-01-26T05:07:46.306+0000,WebFsIndexExecTime=320871,WebFsIndexSize=1482,CrawlerExecTime=487193,DataCrawlStartTime=2019-01-26T04:59:39.136+0000,WebFsCrawlStartTime=2019-01-26T04:59:39.135+0000
2019-01-26 05:08:28,267 [main] INFO Disconnected to elasticsearch:localhost:9300
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Socket closed
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO I/O exception (java.net.SocketException) caught when processing request to {}->http://www.arborrealtytrust.com:80: Socket closed
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO Retrying request to {}->http://www.arborrealtytrust.com:80
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO Could not process http://www.arborrealtytrust.com/robots.txt. Connection pool shut down
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO Could not process http://www.americanaddictioncenters.com/robots.txt. Connection pool shut down
2019-01-26 05:08:29,264 [main] INFO Destroyed LaContainer.
I see many exceptions of this sort:
2019-01-26 05:07:31,313 [Crawler-20190126045933-88-12] ERROR Crawling Exception at https://www.autodesk.com/sitemap-autodesk-dotcom.xml
org.codelibs.fess.crawler.exception.EsAccessException: Failed to insert [UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlc3RlZWwvbGVhcm5pbmctY2VudGVy, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advancesteel/learning-center, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlc3RlZWwvd2hpdGVwYXBlcg, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advancesteel/whitepaper, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLWluLXRyaWFsLW1hcmtldGluZy9jb250YWN0LW1l, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-in-trial-marketing/contact-me, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLXBsYW50LWtvLWNvbnRhY3QtbWU, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-plant-ko-contact-me, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLXRyaWFs, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-trial, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY2xvdWQtY29ubmVjdGVkLXByb21vdGlvbg, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-cloud-connected-promotion, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGFib3JhdGlvbi1wcm9tb3Rpb24, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collaboration-promotion, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGFib3JhdGlvbi1wcm9tb3Rpb24tdGNz, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collaboration-promotion-tcs, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGVjdGlvbi1jb250YWN0LW1lLWVuLXVz, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collection-contact-me-en-us, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGVjdGlvbi1tdWx0aS11c2Vy, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collection-multi-user, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487]]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:301) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.lambda$insertAll$6(AbstractCrawlerService.java:243) ~[fess-crawler-es-2.4.4.jar:?]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) ~[?:1.8.0_191]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_191]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.insertAll(AbstractCrawlerService.java:240) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.EsUrlQueueService.offerAll(EsUrlQueueService.java:179) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.CrawlerThread.storeChildUrls(CrawlerThread.java:357) [fess-crawler-2.4.4.jar:?]
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:196) [fess-crawler-2.4.4.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.IllegalStateException: Future got interrupted
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:288) ~[fess-crawler-es-2.4.4.jar:?]
... 8 more
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) ~[?:1.8.0_191]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) ~[?:1.8.0_191]
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:288) ~[fess-crawler-es-2.4.4.jar:?]
... 8 more
Does it have something to do with the problem?
Did you check elasticsearch log file?
yes, I don't see anything bad in it:
lots of this mostly
[2019-01-26T04:01:26,239][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,248][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,256][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,259][INFO ][o.e.c.m.MetaDataMappingService] [t8cOK8R] [.fess_config.access_token/o-csvKs0SMyFvBIr2_wCEA] create_mapping [access_token]
[2019-01-26T04:01:26,400][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,408][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,417][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,439][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,447][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,456][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,465][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,466][INFO ][o.e.c.m.MetaDataCreateIndexService] [t8cOK8R] [.fess_config.bad_word] creating index, cause [api], templates [], shards [1]/[0], mappings []
[2019-01-26T04:01:26,482][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,491][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,499][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,521][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,529][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,538][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,546][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,588][INFO ][o.e.c.r.a.AllocationService] [t8cOK8R] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.fess_config.bad_word][0]] ...]).
[2019-01-26T04:01:26,635][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,643][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,651][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,672][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,681][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
It seems to be a logging problem... I'll improve it in a future release.
also this is another scenario: the default crawler is running but not really capturing real pages and stuff: just kind of reports the system info
and processing no docs
and that's all.
2019-01-26 06:47:44,194 [Crawler-20190126061518-99-2] INFO Crawling URL: https://www.ameren.com/business-partners/account-data-management
2019-01-26 06:47:44,414 [Crawler-20190126061518-99-2] INFO Crawling URL: https://www.ameren.com/business-partners/real-estate
2019-01-26 06:47:44,656 [Crawler-20190126061518-99-2] INFO Crawling URL: https://www.ameren.com/sustainability/employee-sustainability
2019-01-26 06:47:45,976 [Crawler-20190126061518-99-4] INFO META(robots=noindex,nofollow): https://www.ameren.com/account/forgot-userid
2019-01-26 06:47:47,615 [IndexUpdater] INFO Processing 20/31 docs (Doc:{access 7ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:47,700 [IndexUpdater] INFO Processing 1/11 docs (Doc:{access 3ms, cleanup 30ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:47,717 [IndexUpdater] INFO Processing 0/10 docs (Doc:{access 4ms, cleanup 10ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,720 [IndexUpdater] INFO Processing 10/10 docs (Doc:{access 3ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,765 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,920 [IndexUpdater] INFO Sent 32 docs (Doc:{process 84ms, send 155ms, size 1MB}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,922 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,949 [IndexUpdater] INFO Deleted completed document data. The execution time is 27ms.
2019-01-26 06:47:56,313 [Crawler-20190126061518-100-1] INFO Crawling URL: http://www.aberdeench.com/
2019-01-26 06:47:56,316 [Crawler-20190126061518-100-1] INFO Checking URL: http://www.aberdeench.com/robots.txt
2019-01-26 06:47:57,132 [Crawler-20190126061518-100-1] INFO Redirect to URL: http://www.aberdeenaef.com/
2019-01-26 06:48:02,921 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:48:07,813 [Crawler-20190126061518-4-1] INFO I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Connection timed out (Read failed)
2019-01-26 06:48:07,813 [Crawler-20190126061518-4-1] INFO Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 06:48:12,922 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 791MB, heap 1GB, max 31GB})
2019-01-26 06:48:22,922 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 792MB, heap 1GB, max 31GB})
2019-01-26 06:48:22,933 [IndexUpdater] INFO Deleted completed document data. The execution time is 11ms.
2019-01-26 06:48:25,904 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82318585856,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[1.08, 2.16, 2.67]},"process":{"file_descriptor":{"open":477,"max":65535},"cpu":{"percent":0,"total":314320},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":832015712,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136460680,"committed":144531456}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":1987680},"elasticsearch":null,"timestamp":1548485305904}
2019-01-26 06:48:32,921 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 795MB, heap 1GB, max 31GB})
2019-01-26 06:48:42,922 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 797MB, heap 1GB, max 31GB})
2019-01-26 06:48:52,921 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 799MB, heap 1GB, max 31GB})
2019-01-26 06:49:02,922 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 800MB, heap 1GB, max 31GB})
2019-01-26 06:49:12,921 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 802MB, heap 1GB, max 31GB})
2019-01-26 06:49:22,922 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 804MB, heap 1GB, max 31GB})
2019-01-26 06:49:25,954 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82327228416,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.4, 1.77, 2.5]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":315460},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":844083448,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136528264,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2047731},"elasticsearch":null,"timestamp":1548485365953}
2019-01-26 06:49:32,922 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 806MB, heap 1GB, max 31GB})
2019-01-26 06:49:42,923 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 808MB, heap 1GB, max 31GB})
2019-01-26 06:49:52,924 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 810MB, heap 1GB, max 31GB})
2019-01-26 06:50:02,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 812MB, heap 1GB, max 31GB})
2019-01-26 06:50:12,923 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 813MB, heap 1GB, max 31GB})
2019-01-26 06:50:22,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 815MB, heap 1GB, max 31GB})
2019-01-26 06:50:25,999 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82332979200,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.14, 1.44, 2.34]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":316070},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":856040960,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136556104,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2107776},"elasticsearch":null,"timestamp":1548485425999}
2019-01-26 06:50:32,923 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 818MB, heap 1GB, max 31GB})
2019-01-26 06:50:42,924 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 820MB, heap 1GB, max 31GB})
2019-01-26 06:50:52,923 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 821MB, heap 1GB, max 31GB})
2019-01-26 06:51:02,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 823MB, heap 1GB, max 31GB})
2019-01-26 06:51:12,925 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 825MB, heap 1GB, max 31GB})
2019-01-26 06:51:22,925 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 827MB, heap 1GB, max 31GB})
2019-01-26 06:51:26,044 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82317217792,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.05, 1.18, 2.2]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":316590},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":867934096,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136557432,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2167821},"elasticsearch":null,"timestamp":1548485486044}
2019-01-26 06:51:32,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 829MB, heap 1GB, max 31GB})
2019-01-26 06:51:42,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 831MB, heap 1GB, max 31GB})
2019-01-26 06:51:52,925 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 833MB, heap 1GB, max 31GB})
2019-01-26 06:52:02,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 834MB, heap 1GB, max 31GB})
2019-01-26 06:52:12,925 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 836MB, heap 1GB, max 31GB})
2019-01-26 06:52:22,924 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 838MB, heap 1GB, max 31GB})
the crawler process seems to be there but isn't doing anything:
sudo jps -v
43781 FessBoot -Xms512m -Xmx512m -Djava.awt.headless=true -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Dgroovy.use.classvalue=true -Dfess.home=/usr/share/fess -Dfess.context.path=/ -Dfess.port=8080 -Dfess.webapp.path=/usr/share/fess/app -Dfess.temp.path=/var/tmp/fess -Dfess.log.name=fess -Dfess.log.path=/var/log/fess -Dfess.log.level=warn -Dlasta.env=web -Dtomcat.config.path=tomcat_config.properties -Dfess.conf.path=/etc/fess -Dfess.var.path=/var/lib/fess -Dfess.es.http_address=http://localhost:9200 -Dfess.es.transport_addresses=localhost:9300 -Dfess.dictionary.path=/var/lib/elasticsearch/config/ -Dfess -Dfess.foreground=yes -Dfess.es.dir=/usr/share/elas
60166 Jps -Dapplication.home=/usr/lib/jvm/java-8-openjdk-amd64 -Xms8m
49863 Crawler -Dfess.es.transport_addresses=localhost:9300 -Dfess.es.cluster_name=elasticsearch -Dlasta.env=crawler -Dfess.conf.path=/etc/fess -Dfess.crawler.process=true -Dfess.log.path=/var/log/fess -Dfess.log.name=fess-crawler -Dfess.log.level=info -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Xmx32G -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=128m -XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -Djcifs.smb.client.responseTimeout=30000 -Djcifs.smb.client.soTimeout=35000 -Djcifs.smb.client.connTimeout=60000 -Djcifs.smb.client.sessionTimeout=60000 -Dgroovy.use.classvalue=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Dsun.java2d.cmm=sun.j
no docs
means indexing queue is empty.
yeah but I had 5K web crawlers with one site each, and it only collected documents (1 or more) for 125. Something is not working out for me.
If you can provide steps to reproduce it in my environment, I'll check it.
Well, I configure FESS per instructions, then create 5K Web Crawlers with the following config via admin api:
{
"name": "AMZN",
"urls": "http://www.amazon.com/
https://www.amazon.com/",
"included_urls": "http://www.amazon.com/.*
https://www.amazon.com/.*",
"excluded_urls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"num_of_thread": 6,
"interval_time": 10,
"sort_order": 1,
"boost": 1.0,
"available": "True",
"permissions": "{role}guest",
"depth": 5,
"max_access_count": 32
}
The only different fields are name
, URL
, included_url
I disable the thumb generator job and start the default crawler. after some time, the default crawler just stops and only leaves documents for about 100 config_id
s or so.
I can get you the list of sites or maybe dump the config_web index - let me know how I can help to reproduce it.
I can generate a large json with all the admin api objects for creating the web crawlers if that helps.
Could you provide actual steps to do that? ex. 1) ... 2) ... 3) ... ...
Commands you executed or like are better.
Sorry, maybe I don't understand, but this is what I do:
PUT
to the api to create the web crawlers
3a. After the crawlers are created, I go to FESS admin UI to check that they exist in there (see screenshot below how it looks in UI)Does this make sense?
I think I have little more info. I tried again on a clean box. The symptoms are very similar: Default Crawler
says it's running on the Scheduling
page, but in fact it's not crawling anything after about 100th site or so.
mkdir tmp && cd tmp
wget https://github.com/codelibs/fess/releases/download/fess-12.4.3/fess-12.4.3.deb
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.5.4-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.deb
sudo apt-get update
sudo apt-get install openjdk-8-jre
sudo apt install ./elasticsearch-6.5.4.deb
sudo apt-get install ./kibana-6.5.4-amd64.deb
sudo apt install ./fess-12.4.3.deb
sudo nano /etc/elasticsearch/elasticsearch.yml
^ added configsync.config_path: /var/lib/elasticsearch/config
sudo nano /etc/elasticsearch/jvm.options
^ added memory
sudo nano /etc/kibana/kibana.yml
^ added nonloopback to access externally
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-fess:6.5.0
/usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-configsync:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-dataformat:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-langfield:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-minhash:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-learning-to-rank:6.5.0
sudo service elasticsearch start
sudo service kibana start
sudo service fess start
sudo tail -f /var/log/fess/fess-crawler.log
Next steps are:
Access token
and create one:{
"_index": ".fess_config.access_token",
"_type": "access_token",
"_id": "S4iBlWgBnlDwesKS-gLp",
"_version": 1,
"_score": 1,
"_source": {
"updatedTime": 1548696550112,
"updatedBy": "admin",
"createdBy": "admin",
"permissions": [
"Radmin-api"
],
"name": "Bot",
"createdTime": 1548696550112,
"token": "klbmvQ4UJNXspPE1pOMT1mUqQPNGGk1jgNEaX#mO7RGkVz1EkIGL8sk8iZk1"
}
}
Now, having the access token I create a lot of web crawlers, they look like this in elastic:
{
"_index": ".fess_config.web_config",
"_type": "web_config",
"_id": "UYiDlWgBnlDwesKS2gLx",
"_version": 1,
"_score": 1,
"_source": {
"updatedTime": 1548696673008,
"includedUrls": "http://www.kroger.com/.* \n https://www.kroger.com/.*",
"virtualHosts": [],
"updatedBy": "guest",
"excludedUrls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
"available": true,
"maxAccessCount": 24,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"numOfThread": 8,
"urls": "http://www.kroger.com/ \n https://www.kroger.com/",
"depth": 5,
"createdBy": "guest",
"permissions": [
"Rguest"
],
"sortOrder": 1,
"name": "KR",
"createdTime": 1548696673008,
"boost": 1,
"intervalTime": 10
}
}
And then I change param Remove Documents Before
to 100 to avoid having FESS delete documents from index. I also disable thumb generation jobs and what not and start the Default Crawler
.
The crawler is reporting active
but for a long time now, it hasn't crawled a single page. I have 1000 crawers in the system but it had only crawled about 100 and stopped crawling extra pages.
I don't see anything in logs that reports in problem. let me know if you want me to look up elsewhere.
This is now far it went with 1000 web crawlers and max_access=24 for each: Looks like after 100th crawler or so, it will not crawl for further web crawlers. Has anyone tried using many crawlers?
I dumped the 1000 web crawlers in here: https://gist.github.com/abolotnov/5ac96731651300c248807b021c183718
let me know how else I can help nail this down.
full crawler log here: https://gist.github.com/abolotnov/09bc44869f0dc0672992f3491e52db15
looks like at start it only picks up a portion of the crawlers, not all 1000
Change the following value in fess_config.properties:
page.web.config.max.fetch.size=100
this could be it, I will try and report back! Thanks a lot.
After changing the config and restarting FESS, is there anything else I need to do before I start the Default Crawler.
I changed the page.web.config.max.fetch.size
to 10000 and after I restarted FESS and then started the crawler, I can see the config has picked up additional sites:
...
2019-01-28 20:51:05,110 [WebFsCrawler] INFO Target URL: http://www.gapinc.com/
2019-01-28 20:51:05,110 [WebFsCrawler] INFO Included URL: http://www.gapinc.com/.*
2019-01-28 20:51:05,110 [WebFsCrawler] INFO Included URL: https://www.gapinc.com/.*
2019-01-28 20:51:05,110 [WebFsCrawler] INFO Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
2019-01-28 20:51:05,121 [WebFsCrawler] INFO Target URL: http://www.grifols.com/
2019-01-28 20:51:05,121 [WebFsCrawler] INFO Included URL: http://www.grifols.com/.*
2019-01-28 20:51:05,122 [WebFsCrawler] INFO Included URL: https://www.grifols.com/.*
2019-01-28 20:51:05,122 [WebFsCrawler] INFO Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
2019-01-28 20:51:05,132 [WebFsCrawler] INFO Target URL: http://www.garmin.com/
2019-01-28 20:51:05,132 [WebFsCrawler] INFO Included URL: http://www.garmin.com/.*
2019-01-28 20:51:05,132 [WebFsCrawler] INFO Included URL: https://www.garmin.com/.*
2019-01-28 20:51:05,132 [WebFsCrawler] INFO Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
...
But after working for some time, it again stopped:
Default Crawler
shows as Running
in System > Scheduling
but no new crawling events and documents added to index, only this in logs now:
2019-01-28 21:28:01,310 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":15566577664,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.16, 1.22]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":387190},"virtual_memory":{"total":5500514304}},"jvm":{"memory":{"heap":{"used":310984328,"committed":514654208,"max":518979584,"percent":59},"non_heap":{"used":144869488,"committed":158670848}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":422,"time":3118},"old":{"count":17,"time":882}},"threads":{"count":103,"peak":104},"classes":{"loaded":13002,"total_loaded":14178,"unloaded":1176},"uptime":2226970},"elasticsearch":null,"timestamp":1548710881310}
2019-01-28 21:28:08,734 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 11ms}, Mem:{used 300MB, heap 490MB, max 494MB})
2019-01-28 21:28:18,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 304MB, heap 490MB, max 494MB})
2019-01-28 21:28:28,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 308MB, heap 490MB, max 494MB})
2019-01-28 21:28:38,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 312MB, heap 490MB, max 494MB})
2019-01-28 21:28:48,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 316MB, heap 490MB, max 494MB})
2019-01-28 21:28:58,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 320MB, heap 490MB, max 494MB})
2019-01-28 21:29:01,337 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":15565189120,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.13, 1.14]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":388230},"virtual_memory":{"total":5500514304}},"jvm":{"memory":{"heap":{"used":337573088,"committed":514654208,"max":518979584,"percent":65},"non_heap":{"used":144861360,"committed":158670848}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":422,"time":3118},"old":{"count":17,"time":882}},"threads":{"count":103,"peak":104},"classes":{"loaded":13002,"total_loaded":14178,"unloaded":1176},"uptime":2286997},"elasticsearch":null,"timestamp":1548710941337}
2019-01-28 21:29:08,733 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 325MB, heap 490MB, max 494MB})
I appreciate your help so much, please let me fix this. There may be something else small that needs to be fixed? Do you want me to remove all indexes, restart elastic and recreate crawers?
Disable Check Laste Modified flag.
Ok, disabled this in General config, restarted the Default Crawler
job. Will report.
It worked for a little bit and stopped crawling again:
2019-01-28 22:10:28,840 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":14454079488,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.21, 1.25]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":367530},"virtual_memory":{"total":5502623744}},"jvm":{"memory":{"heap":{"used":308023528,"committed":518979584,"max":518979584,"percent":59},"non_heap":{"used":145401592,"committed":159129600}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":417,"time":3188},"old":{"count":19,"time":1227}},"threads":{"count":103,"peak":104},"classes":{"loaded":13015,"total_loaded":14199,"unloaded":1184},"uptime":1986435},"elasticsearch":null,"timestamp":1548713428840}
2019-01-28 22:10:31,115 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 296MB, heap 494MB, max 494MB})
2019-01-28 22:10:41,116 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 9ms}, Mem:{used 299MB, heap 494MB, max 494MB})
2019-01-28 22:10:51,115 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 303MB, heap 494MB, max 494MB})
2019-01-28 22:11:01,115 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 307MB, heap 494MB, max 494MB})
2019-01-28 22:11:11,116 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 9ms}, Mem:{used 312MB, heap 494MB, max 494MB})
There are over 4K objects in the .crawler.queue
index, too - the number hasn't changed for some time, too. 30 minutes ago it was over 7K and now froze at 4K.
Need more details... Fess has many settings in fess_config.properties or the like to apply it to several situations. You can check source code to change them, or contact Commercial Support if you need more supports.
What kind of details do you want me to give you? At this point, I tried three times with clean install and enabled debug logging, so plenty of logging.
I think this is a defect, not a configuration issue: at this point, I only loaded 300 sites, they all got picked up by the crawler but eventually, after some time, the crawling activity stopped. I have 87 item in the .crawler.queue
that are just sitting there. The default crawler
is running
on the scheduler
page but there is no crawling/document indexing activity.
re commercial support - I'd probably get this at some point to help with a few things that I am missing in FESS. But while evaluating if I want to use FESS in the first, not being able to solve a very fundamental issue (where crawlers don't crawl all of the sites registered with it) and having to go to commercial support for this is probably a bit of a stretch.
I can further help find the issue and fix it if you are interested in making sure it's working properly. Am I the only guy trying to use FESS for a large number of sites? It may be helpful for the project to use my usecase to fix an issue or two.
Ok so as I mentioned earlier, I did another clean install on AWS, 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
instance with 8 vCPU and 16G RAM.
Installed as per instructions:
mkdir tmp && cd tmp
wget https://github.com/codelibs/fess/releases/download/fess-12.4.3/fess-12.4.3.deb
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.5.4-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.deb
sudo apt-get update
sudo apt-get install openjdk-8-jre
sudo apt install ./elasticsearch-6.5.4.deb
sudo apt-get install ./kibana-6.5.4-amd64.deb
sudo apt install ./fess-12.4.3.deb
sudo nano /etc/elasticsearch/elasticsearch.yml
^ added configsync.config_path: /var/lib/elasticsearch/config
sudo nano /etc/elasticsearch/jvm.options
^ added memory
sudo nano /etc/kibana/kibana.yml
^ added nonloopback to access externally
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-fess:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-configsync:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-dataformat:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-langfield:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-minhash:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-learning-to-rank:6.5.0
sudo service elasticsearch start
sudo service kibana start
sudo service fess start
then I change the page.web.config.max.fetch.size
setting to 10000 and increase a couple of other figures.
After that I create web crawlers via rest api, like this
:
data_template = """{
\"name\": \"%name%\",
\"urls\": %url%,
\"included_urls\": %included%,
\"excluded_urls\": \".*\\\\.gif .*\\\\.jpg .*\\\\.jpeg .*\\\\.jpe .*\\\\.pcx .*\\\\.png .*\\\\.tiff .*\\\\.bmp .*\\\\.ics .*\\\\.msg .*\\\\.css .*\\\\.js\",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
\"num_of_thread\": 8,
\"interval_time\": 10,
\"sort_order\": 1,
\"boost\": 1.0,
\"available\": \"True\",
\"permissions\": \"{role}guest\",
\"depth\": 5,
\"max_access_count\": 12
}
bearer_token = 'cXq6gk8tiq8xYJP7T7pBgq0K7tsv761Sbt544ElQomYrHkm7mzfCnjPn7GQv'
r_header = {'Authorization': 'Bearer {}'.format(bearer_token), 'Content-type': 'application/json'}
r_url = 'http://35.163.6.168:8080/api/admin/webconfig/setting'
orgs_to_index = models.Organization.objects.filter(website__startswith='http://')[0:300]
for o in orgs_to_index:
payload = data_template.replace('%name%', o.symbol).replace('%url%', url_name_enhancer(o.website)).replace('%included%', url_included_enhancer(o.website))
outcome = requests.put(url=r_url, data=payload, headers=r_header)
try:
crawler_id = json.loads(outcome.text)['response']['id']
o.elastic_config_id = crawler_id
o.save()
except Exception as e:
print(e)
At this point, the web configs will show in FESS UI, too. Here's one from Elastic index:
{
"_index": ".fess_config.web_config",
"_type": "web_config",
"_id": "GIkHl2gBC0KNaBG5Jso_",
"_version": 1,
"_score": 1,
"_source": {
"updatedTime": 1548722054718,
"includedUrls": "http://www.berkshirehathaway.com/.* \n https://www.berkshirehathaway.com/.*",
"virtualHosts": [],
"updatedBy": "guest",
"excludedUrls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
"available": true,
"maxAccessCount": 12,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"numOfThread": 8,
"urls": "http://www.berkshirehathaway.com/ \n https://www.berkshirehathaway.com/",
"depth": 5,
"createdBy": "guest",
"permissions": [
"Rguest"
],
"sortOrder": 1,
"name": "BRK.B",
"createdTime": 1548722054718,
"boost": 1,
"intervalTime": 10
}
}
so at this point, the only couple things I go:
debug
logging in general settingscheck last modify
Similar to few other times I did this, the symptoms are very similar:
FESS will stop crawling at some point. Farthest I got was 162 sites. the crawling queue at this point has 72 sites. The Scheduler
page is showing default
crawler as running.
It basically just stops crawling without saying anything and looks like the crawling job is handing there endlessly. I will leave it like this overnight but I am pretty positive it's not going to finish.
I copied the logs and essential configs here: https://www.dropbox.com/sh/ci7yfu5pbv7xa8v/AACmREoBf9Cc9GLgA7dyGUDYa?dl=0
Let me know if there is anything else in terms of "more info" that I could supply. I am about to give up.
"more info" is steps to reproduce it.
2019-01-29 01:19:25,945 [Crawler-20190129003919-213-1] INFO Checking URL: http://www.oreillyauto.com/robots.txt
The above access seems to be brocked. To apply a debug level logging on crawler, change logLevel to debug for Default Crawler in Scheduler.
I gave you the exact steps, including commands and the list of crawlers that I use to reproduce this. I am happy to provide more data, let me know what exactly.
Overnight I see there are about 12,000 objects in the .crawler_queue
(were 72 6 hours ago). So I see the list has grown. Default crawler
is shown as Active
in he Scheduling
page.
I have changed the logging for the default crawler to debug
, do you want me to start this?
But I think the question is: Why does not the default crawler run when there are remaining items in the crawling queue?
Why does not the default crawler run when there are remaining items in the crawling queue?
You set maxAccessCount
to 12 to stop crawling.
For Fess user I know, https://www.chizen.jp/ crawls massive sites. It's like Google.
the maxAccessCount
in the WebCrawler
controls how many documents will be indexed for a particular site. It does not mean that it will only crawl 12 pages for all the crawlers. At this this is how I understand this. Let me know if my understanding is wrong.
Now, when I crawl a single site or 10 sites and regardless of the maxAccessCount
setting, it will work just fine. I tried one site unlimited and it did it well. My difference from many other use cases is that I seem to crawl many individual sites. I need to do 8K but can't get it to finish 300.
The crawling queue is just a queue to crawl URLs. .crawler* indices are a meta index to crawl documents, not for document search.
So, you should check the blocked access in fess-crawler.log with debug logging.
I understand this. My expectation was that if I created 300 web crawlers and 12 maxAccessCount for each, I would end up with 300*12 documents in the index. And this is how much it's done for this 162 sites out of 300.
My understanding was that the default crawler
will not stop crawling until the crawl_queue
is not empty. For the failed URLs - once a URL failed, it will remove that URL from the queue and proceed with other URLs right?
Ok, I enabled debug mode and will start the crawler now.
ok I started the default crawler
, there are 14,000 URLs in the queue. Is my assertion right that the crawler should run until the queue is empty?
There are 150K items in the queue now.
Ok it worked for some time and again stopped collecting documents into the index.
I see a lot of stuff in the logs like this:
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,426 [Crawler-20190129142043-32-7] DEBUG The url is null. (16074)
2019-01-29 16:41:14,426 [Crawler-20190129142043-169-8] DEBUG The url is null. (12341)
2019-01-29 16:41:14,426 [Crawler-20190129142043-213-5] DEBUG The url is null. (10563)
2019-01-29 16:41:14,426 [Crawler-20190129142043-168-7] DEBUG The url is null. (12360)
2019-01-29 16:41:14,426 [Crawler-20190129142043-43-8] DEBUG The url is null. (15927)
2019-01-29 16:41:14,431 [Crawler-20190129142043-32-1] DEBUG The url is null. (16073)
2019-01-29 16:41:14,431 [Crawler-20190129142043-43-3] DEBUG The url is null. (15928)
2019-01-29 16:41:14,501 [Crawler-20190129142043-168-4] DEBUG The url is null. (12360)
.crawl_queue
has over 12K items in it.
First, Crawler seems to be blocked, not stopped. So, you need to check last accesses in fess-crawler.log.
My understanding was that the default crawler will not stop crawling until the crawl_queue is not empty.
Documents in meta indices will be removed when Crawler finishes all crawling. Therefore, it's not empty when crawling or blocked. The number of documents in the meta indices does not help to solve your problem.
ok, thank you for still helping me through this struggle. I have collected 2.1G worth of debug logs. It is not possible to review the whole thing. Can you recommend for the keywords/phrases I should be looking for?
So, you need to check last accesses in fess-crawler.log.
Like literally last access
? cat fess-crawler.log | grep -i "last access"
yields nothing.
access
gives over 40K lines.
There are a few of: 2019-01-29 15:09:51,796 [CoreLib-TimeoutManager] DEBUG Failed to access Elasticsearch stats.
ones
But most are just 2019-01-29 14:22:31,704 [Crawler-20190129142043-17-1] DEBUG Accessing https://www.antheminc.com/
type of messages
or 2019-01-29 14:43:26,019 [Crawler-20190129142043-150-5] DEBUG Processing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-150, ruleId=webHtmlRule ...
messages
or 2019-01-29 14:44:39,229 [Crawler-20190129142043-153-6] DEBUG Storing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-153, ruleId=webHtmlRule ...
messages
No... As mentioned, the problem is the access was blocked.
2019-01-29 01:19:25,945 [Crawler-20190129003919-213-1] INFO Checking URL: http://www.oreillyauto.com/robots.txt
I think it's better to put your log files to somewhere...
Sure, here's the link to the log gz: https://www.dropbox.com/s/bxj61eyr05rswy6/crawler.log.gz?dl=0
let me know if you want me to grep something out of it, it's big :(
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> GET /ro-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Host: www.ibm.com
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Connection: Keep-Alive
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Accept-Encoding: gzip,deflate
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "GET /ro-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "[\r][\n]"
The above log means no response from www.ibm.com
.
So, I think it's a problem about your network or the like, not Fess.
If you use t?.instance in AWS, you need to change to others.
I use c. and m. and they are on 10G networks.
I wonder what exactly indicates that the request was blocked though?
Same worker/thread? also had this in the logs for same host:
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> GET /uk-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Host: www.ibm.com
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Connection: Keep-Alive
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Accept-Encoding: gzip,deflate
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "GET /uk-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "HTTP/1.1 200 OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Backside-Transport: OK OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Encoding: gzip[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Type: text/xml; charset=utf-8[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "ETag: W/"721e6-2nmQElG0g2j8bIaNrL6mOty9wBU"[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Global-Transaction-ID: 3344377617[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Length: 27965[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Date: Tue, 29 Jan 2019 14:53:27 GMT[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Connection: keep-alive[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Vary: Accept-Encoding[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Robots-Tag: noindex[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Content-Type-Options: nosniff[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-XSS-Protection: 1; mode=block[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Security-Policy: upgrade-insecure-requests[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Strict-Transport-Security: max-age=31536000[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "[\r][\n]"
2019-01-29 14:53:2
does it make sense to play with Excluded Failure Type
?
I guess what I could try instead is build a single EC2 image with clean install of FESS configured to a remote elastic instance. I will get something big with maybe 96 CPUs or something for a large index thread pool and all. Spin off 100 instances of FESS and have each work on 100 sites at a time. 100 sites FESS seems to process fairly well. If I go with a1.large, it's fair deal I guess. I only collect about 1000 pages per site so it should not take too long. What do you think?
I have around 5K web crawlers configured with
max_access
= 32. When I start the default crawler, it only seems to to a portion of these - like maybe a 100 and then stops. The only suspicious thing I see in the logs isFuture got interrupted
. Otherwise it seems to look ok but doesn't even touch most of the sites.