codelibs / fess

Fess is very powerful and easily deployable Enterprise Search Server.
https://fess.codelibs.org
Apache License 2.0
1k stars 166 forks source link

Default crawler does not run all Web Crawlers #1992

Closed abolotnov closed 4 years ago

abolotnov commented 5 years ago

I have around 5K web crawlers configured with max_access = 32. When I start the default crawler, it only seems to to a portion of these - like maybe a 100 and then stops. The only suspicious thing I see in the logs is Future got interrupted. Otherwise it seems to look ok but doesn't even touch most of the sites.

abolotnov commented 5 years ago

Here's the tail of the log:

2019-01-26 04:39:10,211 [Crawler-20190126043059-89-4] ERROR Crawling Exception at https://www.advanceddisposal.com/for-business/collection-services/recycling.aspx
java.lang.IllegalStateException: Future got interrupted
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.get(AbstractCrawlerService.java:326) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:86) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:40) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.getParentEncoding(FessTransformer.java:258) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.decodeUrlAsName(FessTransformer.java:231) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.getFileName(FessTransformer.java:206) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessXpathTransformer.putAdditionalData(FessXpathTransformer.java:424) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessXpathTransformer.storeData(FessXpathTransformer.java:181) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:120) ~[fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77) ~[fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330) [fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240) ~[classes/:?]
    at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176) [fess-crawler-2.4.4.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) ~[?:1.8.0_191]
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
    ... 16 more
2019-01-26 04:39:10,222 [WebFsCrawler] INFO  [EXEC TIME] crawling time: 484436ms
2019-01-26 04:39:10,222 [main] INFO  Finished Crawler
2019-01-26 04:39:10,254 [main] INFO  [CRAWL INFO] DataCrawlEndTime=2019-01-26T04:31:05.742+0000,CrawlerEndTime=2019-01-26T04:39:10.223+0000,WebFsCrawlExecTime=484436,CrawlerStatus=true,CrawlerStartTime=2019-01-26T04:31:05.704+0000,WebFsCrawlEndTime=2019-01-26T04:39:10.222+0000,WebFsIndexExecTime=329841,WebFsIndexSize=1738,CrawlerExecTime=484519,DataCrawlStartTime=2019-01-26T04:31:05.729+0000,WebFsCrawlStartTime=2019-01-26T04:31:05.729+0000
2019-01-26 04:39:42,416 [main] INFO  Disconnected to elasticsearch:localhost:9300
2019-01-26 04:39:42,418 [Crawler-20190126043059-4-12] INFO  I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Socket closed
2019-01-26 04:39:42,418 [Crawler-20190126043059-4-12] INFO  Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 04:39:42,418 [Crawler-20190126043059-33-1] INFO  I/O exception (java.net.SocketException) caught when processing request to {}->http://www.arborrealtytrust.com:80: Socket closed
2019-01-26 04:39:42,418 [Crawler-20190126043059-33-1] INFO  Retrying request to {}->http://www.arborrealtytrust.com:80
2019-01-26 04:39:42,419 [Crawler-20190126043059-4-12] INFO  Could not process http://www.americanaddictioncenters.com/robots.txt. Connection pool shut down
2019-01-26 04:39:42,419 [Crawler-20190126043059-33-1] INFO  Could not process http://www.arborrealtytrust.com/robots.txt. Connection pool shut down
2019-01-26 04:39:43,414 [main] INFO  Destroyed LaContainer.
marevol commented 5 years ago

Why did you set max_access = 32? and check elasticsearch log file.

abolotnov commented 5 years ago

I don't want to store too many pages per site at this point - just want to check how many are going to be accessible. Elasticsearch logs seem to be ok. Should I look for something? I don't see anything other than some exceptions that I trigger while trying to graph things in kibana. When I restart the default crawler job, it will work and through in additional documents. But it seems to start with the same crawlers so not much new documents added to index.

abolotnov commented 5 years ago

Here's a sample of my crawler config:

{
    "name": "AMZN",
    "urls": "http://www.amazon.com/ 
 https://www.amazon.com/",
    "included_urls": "http://www.amazon.com/.* 
 https://www.amazon.com/.*",
    "excluded_urls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    "num_of_thread": 16,
    "interval_time": 10,
    "sort_order": 1,
    "boost": 1.0,
    "available": "True",
    "permissions": "{role}guest",
    "depth": 5,
    "max_access_count": 32
    }
abolotnov commented 5 years ago

After I started default crawler again it worked for some time and ended:

2019-01-26 05:07:45,813 [Crawler-20190126045933-89-7] ERROR Crawling Exception at https://www.advanceddisposal.com/for-business/disposal-recycling-services/special-waste/blackfoot-landfill-winslow,-in.aspx
java.lang.IllegalStateException: Future got interrupted
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.get(AbstractCrawlerService.java:326) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:86) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:40) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.getParentEncoding(FessTransformer.java:258) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.decodeUrlAsName(FessTransformer.java:231) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessTransformer.getFileName(FessTransformer.java:206) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessXpathTransformer.putAdditionalData(FessXpathTransformer.java:424) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.FessXpathTransformer.storeData(FessXpathTransformer.java:181) ~[classes/:?]
    at org.codelibs.fess.crawler.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:120) ~[fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77) ~[fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330) [fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240) ~[classes/:?]
    at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176) [fess-crawler-2.4.4.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) ~[?:1.8.0_191]
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
    ... 16 more
2019-01-26 05:07:46,306 [WebFsCrawler] INFO  [EXEC TIME] crawling time: 487118ms
2019-01-26 05:07:46,307 [main] INFO  Finished Crawler
2019-01-26 05:07:46,327 [main] INFO  [CRAWL INFO] DataCrawlEndTime=2019-01-26T04:59:39.151+0000,CrawlerEndTime=2019-01-26T05:07:46.307+0000,WebFsCrawlExecTime=487118,CrawlerStatus=true,CrawlerStartTime=2019-01-26T04:59:39.114+0000,WebFsCrawlEndTime=2019-01-26T05:07:46.306+0000,WebFsIndexExecTime=320871,WebFsIndexSize=1482,CrawlerExecTime=487193,DataCrawlStartTime=2019-01-26T04:59:39.136+0000,WebFsCrawlStartTime=2019-01-26T04:59:39.135+0000
2019-01-26 05:08:28,267 [main] INFO  Disconnected to elasticsearch:localhost:9300
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO  I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Socket closed
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO  Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO  I/O exception (java.net.SocketException) caught when processing request to {}->http://www.arborrealtytrust.com:80: Socket closed
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO  Retrying request to {}->http://www.arborrealtytrust.com:80
2019-01-26 05:08:28,268 [Crawler-20190126045933-33-1] INFO  Could not process http://www.arborrealtytrust.com/robots.txt. Connection pool shut down
2019-01-26 05:08:28,268 [Crawler-20190126045933-4-1] INFO  Could not process http://www.americanaddictioncenters.com/robots.txt. Connection pool shut down
2019-01-26 05:08:29,264 [main] INFO  Destroyed LaContainer.
abolotnov commented 5 years ago

I see many exceptions of this sort:

2019-01-26 05:07:31,313 [Crawler-20190126045933-88-12] ERROR Crawling Exception at https://www.autodesk.com/sitemap-autodesk-dotcom.xml
org.codelibs.fess.crawler.exception.EsAccessException: Failed to insert [UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlc3RlZWwvbGVhcm5pbmctY2VudGVy, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advancesteel/learning-center, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlc3RlZWwvd2hpdGVwYXBlcg, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advancesteel/whitepaper, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLWluLXRyaWFsLW1hcmtldGluZy9jb250YWN0LW1l, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-in-trial-marketing/contact-me, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLXBsYW50LWtvLWNvbnRhY3QtbWU, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-plant-ko-contact-me, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZHZhbmNlLXN0ZWVsLXRyaWFs, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/advance-steel-trial, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY2xvdWQtY29ubmVjdGVkLXByb21vdGlvbg, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-cloud-connected-promotion, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGFib3JhdGlvbi1wcm9tb3Rpb24, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collaboration-promotion, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGFib3JhdGlvbi1wcm9tb3Rpb24tdGNz, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collaboration-promotion-tcs, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGVjdGlvbi1jb250YWN0LW1lLWVuLXVz, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collection-contact-me-en-us, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487], UrlQueueImpl [id=20190126045933-88.aHR0cHM6Ly93d3cuYXV0b2Rlc2suY29tL2NhbXBhaWducy9hZWMtY29sbGVjdGlvbi1tdWx0aS11c2Vy, sessionId=20190126045933-88, method=GET, url=https://www.autodesk.com/campaigns/aec-collection-multi-user, encoding=null, parentUrl=https://www.autodesk.com/sitemap-autodesk-dotcom.xml, depth=3, lastModified=null, createTime=1548479224487]]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:301) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.lambda$insertAll$6(AbstractCrawlerService.java:243) ~[fess-crawler-es-2.4.4.jar:?]
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) ~[?:1.8.0_191]
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_191]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.insertAll(AbstractCrawlerService.java:240) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.EsUrlQueueService.offerAll(EsUrlQueueService.java:179) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.CrawlerThread.storeChildUrls(CrawlerThread.java:357) [fess-crawler-2.4.4.jar:?]
    at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:196) [fess-crawler-2.4.4.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.lang.IllegalStateException: Future got interrupted
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:82) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:288) ~[fess-crawler-es-2.4.4.jar:?]
    ... 8 more
Caused by: java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) ~[?:1.8.0_191]
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) ~[?:1.8.0_191]
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:234) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.5.4.jar:6.5.4]
    at org.codelibs.fess.crawler.client.EsClient.get(EsClient.java:207) ~[fess-crawler-es-2.4.4.jar:?]
    at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.doInsertAll(AbstractCrawlerService.java:288) ~[fess-crawler-es-2.4.4.jar:?]
    ... 8 more

Does it have something to do with the problem?

marevol commented 5 years ago

Did you check elasticsearch log file?

abolotnov commented 5 years ago

yes, I don't see anything bad in it:

lots of this mostly

[2019-01-26T04:01:26,239][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,248][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,256][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.access_token] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,259][INFO ][o.e.c.m.MetaDataMappingService] [t8cOK8R] [.fess_config.access_token/o-csvKs0SMyFvBIr2_wCEA] create_mapping [access_token]
[2019-01-26T04:01:26,400][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,408][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,417][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,439][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,447][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,456][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,465][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,466][INFO ][o.e.c.m.MetaDataCreateIndexService] [t8cOK8R] [.fess_config.bad_word] creating index, cause [api], templates [], shards [1]/[0], mappings []
[2019-01-26T04:01:26,482][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,491][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,499][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,521][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,529][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
[2019-01-26T04:01:26,538][INFO ][o.c.e.f.i.a.JapanesePartOfSpeechFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiPartOfSpeechFilterFactory is found.
[2019-01-26T04:01:26,546][INFO ][o.c.e.f.i.a.JapaneseReadingFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiReadingFormFilterFactory is found.
[2019-01-26T04:01:26,588][INFO ][o.e.c.r.a.AllocationService] [t8cOK8R] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.fess_config.bad_word][0]] ...]).
[2019-01-26T04:01:26,635][INFO ][o.c.e.f.i.a.JapaneseIterationMarkCharFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiIterationMarkCharFilterFactory is found.
[2019-01-26T04:01:26,643][INFO ][o.c.e.f.i.a.JapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,651][INFO ][o.c.e.f.i.a.ReloadableJapaneseTokenizerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.ReloadableKuromojiTokenizerFactory is found.
[2019-01-26T04:01:26,672][INFO ][o.c.e.f.i.a.JapaneseBaseFormFilterFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiBaseFormFilterFactory is found.
[2019-01-26T04:01:26,681][INFO ][o.c.e.f.i.a.JapaneseKatakanaStemmerFactory] [t8cOK8R] [.fess_config.bad_word] org.codelibs.elasticsearch.extension.analysis.KuromojiKatakanaStemmerFactory is found.
marevol commented 5 years ago

It seems to be a logging problem... I'll improve it in a future release.

abolotnov commented 5 years ago

also this is another scenario: the default crawler is running but not really capturing real pages and stuff: just kind of reports the system info and processing no docs and that's all.

2019-01-26 06:47:44,194 [Crawler-20190126061518-99-2] INFO  Crawling URL: https://www.ameren.com/business-partners/account-data-management
2019-01-26 06:47:44,414 [Crawler-20190126061518-99-2] INFO  Crawling URL: https://www.ameren.com/business-partners/real-estate
2019-01-26 06:47:44,656 [Crawler-20190126061518-99-2] INFO  Crawling URL: https://www.ameren.com/sustainability/employee-sustainability
2019-01-26 06:47:45,976 [Crawler-20190126061518-99-4] INFO  META(robots=noindex,nofollow): https://www.ameren.com/account/forgot-userid
2019-01-26 06:47:47,615 [IndexUpdater] INFO  Processing 20/31 docs (Doc:{access 7ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:47,700 [IndexUpdater] INFO  Processing 1/11 docs (Doc:{access 3ms, cleanup 30ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:47,717 [IndexUpdater] INFO  Processing 0/10 docs (Doc:{access 4ms, cleanup 10ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,720 [IndexUpdater] INFO  Processing 10/10 docs (Doc:{access 3ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,765 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,920 [IndexUpdater] INFO  Sent 32 docs (Doc:{process 84ms, send 155ms, size 1MB}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:47:52,949 [IndexUpdater] INFO  Deleted completed document data. The execution time is 27ms.
2019-01-26 06:47:56,313 [Crawler-20190126061518-100-1] INFO  Crawling URL: http://www.aberdeench.com/
2019-01-26 06:47:56,316 [Crawler-20190126061518-100-1] INFO  Checking URL: http://www.aberdeench.com/robots.txt
2019-01-26 06:47:57,132 [Crawler-20190126061518-100-1] INFO  Redirect to URL: http://www.aberdeenaef.com/
2019-01-26 06:48:02,921 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 1GB, heap 1GB, max 31GB})
2019-01-26 06:48:07,813 [Crawler-20190126061518-4-1] INFO  I/O exception (java.net.SocketException) caught when processing request to {}->http://www.americanaddictioncenters.com:80: Connection timed out (Read failed)
2019-01-26 06:48:07,813 [Crawler-20190126061518-4-1] INFO  Retrying request to {}->http://www.americanaddictioncenters.com:80
2019-01-26 06:48:12,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 791MB, heap 1GB, max 31GB})
2019-01-26 06:48:22,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 792MB, heap 1GB, max 31GB})
2019-01-26 06:48:22,933 [IndexUpdater] INFO  Deleted completed document data. The execution time is 11ms.
2019-01-26 06:48:25,904 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82318585856,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[1.08, 2.16, 2.67]},"process":{"file_descriptor":{"open":477,"max":65535},"cpu":{"percent":0,"total":314320},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":832015712,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136460680,"committed":144531456}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":1987680},"elasticsearch":null,"timestamp":1548485305904}
2019-01-26 06:48:32,921 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 795MB, heap 1GB, max 31GB})
2019-01-26 06:48:42,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 797MB, heap 1GB, max 31GB})
2019-01-26 06:48:52,921 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 799MB, heap 1GB, max 31GB})
2019-01-26 06:49:02,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 800MB, heap 1GB, max 31GB})
2019-01-26 06:49:12,921 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 802MB, heap 1GB, max 31GB})
2019-01-26 06:49:22,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 804MB, heap 1GB, max 31GB})
2019-01-26 06:49:25,954 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82327228416,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.4, 1.77, 2.5]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":315460},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":844083448,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136528264,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2047731},"elasticsearch":null,"timestamp":1548485365953}
2019-01-26 06:49:32,922 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 806MB, heap 1GB, max 31GB})
2019-01-26 06:49:42,923 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 808MB, heap 1GB, max 31GB})
2019-01-26 06:49:52,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 810MB, heap 1GB, max 31GB})
2019-01-26 06:50:02,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 812MB, heap 1GB, max 31GB})
2019-01-26 06:50:12,923 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 813MB, heap 1GB, max 31GB})
2019-01-26 06:50:22,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 815MB, heap 1GB, max 31GB})
2019-01-26 06:50:25,999 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82332979200,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.14, 1.44, 2.34]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":316070},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":856040960,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136556104,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2107776},"elasticsearch":null,"timestamp":1548485425999}
2019-01-26 06:50:32,923 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 818MB, heap 1GB, max 31GB})
2019-01-26 06:50:42,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 820MB, heap 1GB, max 31GB})
2019-01-26 06:50:52,923 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 821MB, heap 1GB, max 31GB})
2019-01-26 06:51:02,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 823MB, heap 1GB, max 31GB})
2019-01-26 06:51:12,925 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 825MB, heap 1GB, max 31GB})
2019-01-26 06:51:22,925 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 827MB, heap 1GB, max 31GB})
2019-01-26 06:51:26,044 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":82317217792,"total":128920035328},"swap_space":{"free":0,"total":0}},"cpu":{"percent":0},"load_averages":[0.05, 1.18, 2.2]},"process":{"file_descriptor":{"open":464,"max":65535},"cpu":{"percent":0,"total":316590},"virtual_memory":{"total":44742717440}},"jvm":{"memory":{"heap":{"used":867934096,"committed":1948188672,"max":34246361088,"percent":2},"non_heap":{"used":136557432,"committed":144662528}},"pools":{"direct":{"count":99,"used":541081601,"capacity":541081600},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":88,"time":1038},"old":{"count":2,"time":157}},"threads":{"count":104,"peak":112},"classes":{"loaded":13391,"total_loaded":13525,"unloaded":134},"uptime":2167821},"elasticsearch":null,"timestamp":1548485486044}
2019-01-26 06:51:32,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 829MB, heap 1GB, max 31GB})
2019-01-26 06:51:42,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 831MB, heap 1GB, max 31GB})
2019-01-26 06:51:52,925 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 833MB, heap 1GB, max 31GB})
2019-01-26 06:52:02,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 834MB, heap 1GB, max 31GB})
2019-01-26 06:52:12,925 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 17ms}, Mem:{used 836MB, heap 1GB, max 31GB})
2019-01-26 06:52:22,924 [IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 17ms}, Mem:{used 838MB, heap 1GB, max 31GB})
abolotnov commented 5 years ago

the crawler process seems to be there but isn't doing anything:

 sudo jps -v
43781 FessBoot -Xms512m -Xmx512m -Djava.awt.headless=true -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Dgroovy.use.classvalue=true -Dfess.home=/usr/share/fess -Dfess.context.path=/ -Dfess.port=8080 -Dfess.webapp.path=/usr/share/fess/app -Dfess.temp.path=/var/tmp/fess -Dfess.log.name=fess -Dfess.log.path=/var/log/fess -Dfess.log.level=warn -Dlasta.env=web -Dtomcat.config.path=tomcat_config.properties -Dfess.conf.path=/etc/fess -Dfess.var.path=/var/lib/fess -Dfess.es.http_address=http://localhost:9200 -Dfess.es.transport_addresses=localhost:9300 -Dfess.dictionary.path=/var/lib/elasticsearch/config/ -Dfess -Dfess.foreground=yes -Dfess.es.dir=/usr/share/elas
60166 Jps -Dapplication.home=/usr/lib/jvm/java-8-openjdk-amd64 -Xms8m
49863 Crawler -Dfess.es.transport_addresses=localhost:9300 -Dfess.es.cluster_name=elasticsearch -Dlasta.env=crawler -Dfess.conf.path=/etc/fess -Dfess.crawler.process=true -Dfess.log.path=/var/log/fess -Dfess.log.name=fess-crawler -Dfess.log.level=info -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Xmx32G -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=128m -XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -Djcifs.smb.client.responseTimeout=30000 -Djcifs.smb.client.soTimeout=35000 -Djcifs.smb.client.connTimeout=60000 -Djcifs.smb.client.sessionTimeout=60000 -Dgroovy.use.classvalue=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Dsun.java2d.cmm=sun.j
marevol commented 5 years ago

no docs means indexing queue is empty.

abolotnov commented 5 years ago

yeah but I had 5K web crawlers with one site each, and it only collected documents (1 or more) for 125. Something is not working out for me.

marevol commented 5 years ago

If you can provide steps to reproduce it in my environment, I'll check it.

abolotnov commented 5 years ago

Well, I configure FESS per instructions, then create 5K Web Crawlers with the following config via admin api:

{
    "name": "AMZN",
    "urls": "http://www.amazon.com/ 
 https://www.amazon.com/",
    "included_urls": "http://www.amazon.com/.* 
 https://www.amazon.com/.*",
    "excluded_urls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    "num_of_thread": 6,
    "interval_time": 10,
    "sort_order": 1,
    "boost": 1.0,
    "available": "True",
    "permissions": "{role}guest",
    "depth": 5,
    "max_access_count": 32
    }

The only different fields are name, URL, included_url

I disable the thumb generator job and start the default crawler. after some time, the default crawler just stops and only leaves documents for about 100 config_ids or so.

I can get you the list of sites or maybe dump the config_web index - let me know how I can help to reproduce it.

abolotnov commented 5 years ago

I can generate a large json with all the admin api objects for creating the web crawlers if that helps.

marevol commented 5 years ago

Could you provide actual steps to do that? ex. 1) ... 2) ... 3) ... ...

Commands you executed or like are better.

abolotnov commented 5 years ago

Sorry, maybe I don't understand, but this is what I do:

  1. Install/configure Elasticsearch and FESS on a AWS instance
  2. generate access token in FESS
  3. Use a python script to generate admin api calls to generate web crawlers (around 5,000) - I have Jupyter notebook that does this - I have no problem sharing the notebook but it needs a database and orm backend, let me know if you want the list of sites I use or maybe a json with all the admin api json objects I PUT to the api to create the web crawlers 3a. After the crawlers are created, I go to FESS admin UI to check that they exist in there (see screenshot below how it looks in UI)
  4. Manually go to System > Scheduling and start the Default Crawling job from there.

image

Does this make sense?

abolotnov commented 5 years ago

I think I have little more info. I tried again on a clean box. The symptoms are very similar: Default Crawler says it's running on the Scheduling page, but in fact it's not crawling anything after about 100th site or so.

    mkdir tmp && cd tmp
    wget https://github.com/codelibs/fess/releases/download/fess-12.4.3/fess-12.4.3.deb
    wget https://artifacts.elastic.co/downloads/kibana/kibana-6.5.4-amd64.deb
    wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.deb
    sudo apt-get update
   sudo apt-get install openjdk-8-jre
   sudo apt install ./elasticsearch-6.5.4.deb 
   sudo apt-get install ./kibana-6.5.4-amd64.deb 
   sudo apt install ./fess-12.4.3.deb 
   sudo nano /etc/elasticsearch/elasticsearch.yml
   ^ added configsync.config_path: /var/lib/elasticsearch/config
   sudo nano /etc/elasticsearch/jvm.options
   ^ added memory
   sudo nano /etc/kibana/kibana.yml 
   ^ added nonloopback to access externally
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-fess:6.5.0
   /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-configsync:6.5.0
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-dataformat:6.5.0
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-langfield:6.5.0
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-minhash:6.5.0
   sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-learning-to-rank:6.5.0
   sudo service elasticsearch start
   sudo service kibana start
   sudo service fess start
   sudo tail -f /var/log/fess/fess-crawler.log

Next steps are:

{
  "_index": ".fess_config.access_token",
  "_type": "access_token",
  "_id": "S4iBlWgBnlDwesKS-gLp",
  "_version": 1,
  "_score": 1,
  "_source": {
    "updatedTime": 1548696550112,
    "updatedBy": "admin",
    "createdBy": "admin",
    "permissions": [
      "Radmin-api"
    ],
    "name": "Bot",
    "createdTime": 1548696550112,
    "token": "klbmvQ4UJNXspPE1pOMT1mUqQPNGGk1jgNEaX#mO7RGkVz1EkIGL8sk8iZk1"
  }
}

Now, having the access token I create a lot of web crawlers, they look like this in elastic:

{
  "_index": ".fess_config.web_config",
  "_type": "web_config",
  "_id": "UYiDlWgBnlDwesKS2gLx",
  "_version": 1,
  "_score": 1,
  "_source": {
    "updatedTime": 1548696673008,
    "includedUrls": "http://www.kroger.com/.* \n https://www.kroger.com/.*",
    "virtualHosts": [],
    "updatedBy": "guest",
    "excludedUrls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
    "available": true,
    "maxAccessCount": 24,
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    "numOfThread": 8,
    "urls": "http://www.kroger.com/ \n https://www.kroger.com/",
    "depth": 5,
    "createdBy": "guest",
    "permissions": [
      "Rguest"
    ],
    "sortOrder": 1,
    "name": "KR",
    "createdTime": 1548696673008,
    "boost": 1,
    "intervalTime": 10
  }
}

And then I change param Remove Documents Before to 100 to avoid having FESS delete documents from index. I also disable thumb generation jobs and what not and start the Default Crawler.

image

The crawler is reporting active but for a long time now, it hasn't crawled a single page. I have 1000 crawers in the system but it had only crawled about 100 and stopped crawling extra pages.

I don't see anything in logs that reports in problem. let me know if you want me to look up elsewhere.

abolotnov commented 5 years ago

This is now far it went with 1000 web crawlers and max_access=24 for each: image Looks like after 100th crawler or so, it will not crawl for further web crawlers. Has anyone tried using many crawlers?

abolotnov commented 5 years ago

I dumped the 1000 web crawlers in here: https://gist.github.com/abolotnov/5ac96731651300c248807b021c183718

let me know how else I can help nail this down.

abolotnov commented 5 years ago

full crawler log here: https://gist.github.com/abolotnov/09bc44869f0dc0672992f3491e52db15

looks like at start it only picks up a portion of the crawlers, not all 1000

marevol commented 5 years ago

Change the following value in fess_config.properties:

page.web.config.max.fetch.size=100
abolotnov commented 5 years ago

this could be it, I will try and report back! Thanks a lot.

abolotnov commented 5 years ago

After changing the config and restarting FESS, is there anything else I need to do before I start the Default Crawler.

I changed the page.web.config.max.fetch.size to 10000 and after I restarted FESS and then started the crawler, I can see the config has picked up additional sites:

...

2019-01-28 20:51:05,110 [WebFsCrawler] INFO  Target URL: http://www.gapinc.com/
2019-01-28 20:51:05,110 [WebFsCrawler] INFO  Included URL: http://www.gapinc.com/.*
2019-01-28 20:51:05,110 [WebFsCrawler] INFO  Included URL: https://www.gapinc.com/.*
2019-01-28 20:51:05,110 [WebFsCrawler] INFO  Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
2019-01-28 20:51:05,121 [WebFsCrawler] INFO  Target URL: http://www.grifols.com/
2019-01-28 20:51:05,121 [WebFsCrawler] INFO  Included URL: http://www.grifols.com/.*
2019-01-28 20:51:05,122 [WebFsCrawler] INFO  Included URL: https://www.grifols.com/.*
2019-01-28 20:51:05,122 [WebFsCrawler] INFO  Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
2019-01-28 20:51:05,132 [WebFsCrawler] INFO  Target URL: http://www.garmin.com/
2019-01-28 20:51:05,132 [WebFsCrawler] INFO  Included URL: http://www.garmin.com/.*
2019-01-28 20:51:05,132 [WebFsCrawler] INFO  Included URL: https://www.garmin.com/.*
2019-01-28 20:51:05,132 [WebFsCrawler] INFO  Excluded URL: .*\.gif .*\.jpg .*\.jpeg .*\.jpe .*\.pcx .*\.png .*\.tiff .*\.bmp .*\.ics .*\.msg .*\.css .*\.js
...

But after working for some time, it again stopped:

Default Crawler shows as Running in System > Scheduling but no new crawling events and documents added to index, only this in logs now:

2019-01-28 21:28:01,310 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":15566577664,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.16, 1.22]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":387190},"virtual_memory":{"total":5500514304}},"jvm":{"memory":{"heap":{"used":310984328,"committed":514654208,"max":518979584,"percent":59},"non_heap":{"used":144869488,"committed":158670848}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":422,"time":3118},"old":{"count":17,"time":882}},"threads":{"count":103,"peak":104},"classes":{"loaded":13002,"total_loaded":14178,"unloaded":1176},"uptime":2226970},"elasticsearch":null,"timestamp":1548710881310}
2019-01-28 21:28:08,734 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms, cleanup 11ms}, Mem:{used 300MB, heap 490MB, max 494MB})
2019-01-28 21:28:18,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 304MB, heap 490MB, max 494MB})
2019-01-28 21:28:28,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 308MB, heap 490MB, max 494MB})
2019-01-28 21:28:38,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 312MB, heap 490MB, max 494MB})
2019-01-28 21:28:48,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 316MB, heap 490MB, max 494MB})
2019-01-28 21:28:58,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 320MB, heap 490MB, max 494MB})
2019-01-28 21:29:01,337 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":15565189120,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.13, 1.14]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":388230},"virtual_memory":{"total":5500514304}},"jvm":{"memory":{"heap":{"used":337573088,"committed":514654208,"max":518979584,"percent":65},"non_heap":{"used":144861360,"committed":158670848}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":422,"time":3118},"old":{"count":17,"time":882}},"threads":{"count":103,"peak":104},"classes":{"loaded":13002,"total_loaded":14178,"unloaded":1176},"uptime":2286997},"elasticsearch":null,"timestamp":1548710941337}
2019-01-28 21:29:08,733 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 11ms}, Mem:{used 325MB, heap 490MB, max 494MB})

I appreciate your help so much, please let me fix this. There may be something else small that needs to be fixed? Do you want me to remove all indexes, restart elastic and recreate crawers?

marevol commented 5 years ago

Disable Check Laste Modified flag.

abolotnov commented 5 years ago

Ok, disabled this in General config, restarted the Default Crawler job. Will report.

abolotnov commented 5 years ago

It worked for a little bit and stopped crawling again:

2019-01-28 22:10:28,840 [CoreLib-TimeoutManager] INFO  [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":14454079488,"total":33359687680},"swap_space":{"free":0,"total":0}},"cpu":{"percent":1},"load_averages":[0.0, 0.21, 1.25]},"process":{"file_descriptor":{"open":373,"max":65535},"cpu":{"percent":0,"total":367530},"virtual_memory":{"total":5502623744}},"jvm":{"memory":{"heap":{"used":308023528,"committed":518979584,"max":518979584,"percent":59},"non_heap":{"used":145401592,"committed":159129600}},"pools":{"direct":{"count":40,"used":85999617,"capacity":85999616},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":417,"time":3188},"old":{"count":19,"time":1227}},"threads":{"count":103,"peak":104},"classes":{"loaded":13015,"total_loaded":14199,"unloaded":1184},"uptime":1986435},"elasticsearch":null,"timestamp":1548713428840}
2019-01-28 22:10:31,115 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 296MB, heap 494MB, max 494MB})
2019-01-28 22:10:41,116 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms, cleanup 9ms}, Mem:{used 299MB, heap 494MB, max 494MB})
2019-01-28 22:10:51,115 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 303MB, heap 494MB, max 494MB})
2019-01-28 22:11:01,115 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms, cleanup 9ms}, Mem:{used 307MB, heap 494MB, max 494MB})
2019-01-28 22:11:11,116 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms, cleanup 9ms}, Mem:{used 312MB, heap 494MB, max 494MB})

There are over 4K objects in the .crawler.queue index, too - the number hasn't changed for some time, too. 30 minutes ago it was over 7K and now froze at 4K.

marevol commented 5 years ago

Need more details... Fess has many settings in fess_config.properties or the like to apply it to several situations. You can check source code to change them, or contact Commercial Support if you need more supports.

abolotnov commented 5 years ago

What kind of details do you want me to give you? At this point, I tried three times with clean install and enabled debug logging, so plenty of logging.

I think this is a defect, not a configuration issue: at this point, I only loaded 300 sites, they all got picked up by the crawler but eventually, after some time, the crawling activity stopped. I have 87 item in the .crawler.queue that are just sitting there. The default crawler is running on the scheduler page but there is no crawling/document indexing activity.

re commercial support - I'd probably get this at some point to help with a few things that I am missing in FESS. But while evaluating if I want to use FESS in the first, not being able to solve a very fundamental issue (where crawlers don't crawl all of the sites registered with it) and having to go to commercial support for this is probably a bit of a stretch.

I can further help find the issue and fix it if you are interested in making sure it's working properly. Am I the only guy trying to use FESS for a large number of sites? It may be helpful for the project to use my usecase to fix an issue or two.

abolotnov commented 5 years ago

Ok so as I mentioned earlier, I did another clean install on AWS, 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux instance with 8 vCPU and 16G RAM.

Installed as per instructions:

mkdir tmp && cd tmp
wget https://github.com/codelibs/fess/releases/download/fess-12.4.3/fess-12.4.3.deb
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.5.4-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.deb
sudo apt-get update
sudo apt-get install openjdk-8-jre
sudo apt install ./elasticsearch-6.5.4.deb 
sudo apt-get install ./kibana-6.5.4-amd64.deb 
sudo apt install ./fess-12.4.3.deb 
sudo nano /etc/elasticsearch/elasticsearch.yml
   ^ added configsync.config_path: /var/lib/elasticsearch/config
sudo nano /etc/elasticsearch/jvm.options
   ^ added memory
 sudo nano /etc/kibana/kibana.yml 
   ^ added nonloopback to access externally
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-fess:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-extension:6.5.1
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-configsync:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-dataformat:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-langfield:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-minhash:6.5.0
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install org.codelibs:elasticsearch-learning-to-rank:6.5.0
sudo service elasticsearch start
sudo service kibana start
sudo service fess start

then I change the page.web.config.max.fetch.size setting to 10000 and increase a couple of other figures.

After that I create web crawlers via rest api, like this:

data_template = """{
    \"name\": \"%name%\",
    \"urls\": %url%,
    \"included_urls\": %included%,
    \"excluded_urls\": \".*\\\\.gif .*\\\\.jpg .*\\\\.jpeg .*\\\\.jpe .*\\\\.pcx .*\\\\.png .*\\\\.tiff .*\\\\.bmp .*\\\\.ics .*\\\\.msg .*\\\\.css .*\\\\.js\",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    \"num_of_thread\": 8,
    \"interval_time\": 10,
    \"sort_order\": 1,
    \"boost\": 1.0,
    \"available\": \"True\",
    \"permissions\": \"{role}guest\",
    \"depth\": 5,
    \"max_access_count\": 12
    }

bearer_token = 'cXq6gk8tiq8xYJP7T7pBgq0K7tsv761Sbt544ElQomYrHkm7mzfCnjPn7GQv'
r_header = {'Authorization': 'Bearer {}'.format(bearer_token), 'Content-type': 'application/json'}
r_url = 'http://35.163.6.168:8080/api/admin/webconfig/setting'

orgs_to_index = models.Organization.objects.filter(website__startswith='http://')[0:300]
for o in orgs_to_index:
    payload = data_template.replace('%name%', o.symbol).replace('%url%', url_name_enhancer(o.website)).replace('%included%', url_included_enhancer(o.website))
    outcome = requests.put(url=r_url, data=payload, headers=r_header)
    try:
        crawler_id = json.loads(outcome.text)['response']['id']
        o.elastic_config_id = crawler_id
        o.save()
    except Exception as e:
        print(e)

At this point, the web configs will show in FESS UI, too. Here's one from Elastic index:

{
  "_index": ".fess_config.web_config",
  "_type": "web_config",
  "_id": "GIkHl2gBC0KNaBG5Jso_",
  "_version": 1,
  "_score": 1,
  "_source": {
    "updatedTime": 1548722054718,
    "includedUrls": "http://www.berkshirehathaway.com/.* \n https://www.berkshirehathaway.com/.*",
    "virtualHosts": [],
    "updatedBy": "guest",
    "excludedUrls": ".*\\.gif .*\\.jpg .*\\.jpeg .*\\.jpe .*\\.pcx .*\\.png .*\\.tiff .*\\.bmp .*\\.ics .*\\.msg .*\\.css .*\\.js",
    "available": true,
    "maxAccessCount": 12,
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    "numOfThread": 8,
    "urls": "http://www.berkshirehathaway.com/ \n https://www.berkshirehathaway.com/",
    "depth": 5,
    "createdBy": "guest",
    "permissions": [
      "Rguest"
    ],
    "sortOrder": 1,
    "name": "BRK.B",
    "createdTime": 1548722054718,
    "boost": 1,
    "intervalTime": 10
  }
}

so at this point, the only couple things I go:

Similar to few other times I did this, the symptoms are very similar:

FESS will stop crawling at some point. Farthest I got was 162 sites. the crawling queue at this point has 72 sites. The Scheduler page is showing default crawler as running.

It basically just stops crawling without saying anything and looks like the crawling job is handing there endlessly. I will leave it like this overnight but I am pretty positive it's not going to finish.

I copied the logs and essential configs here: https://www.dropbox.com/sh/ci7yfu5pbv7xa8v/AACmREoBf9Cc9GLgA7dyGUDYa?dl=0

Let me know if there is anything else in terms of "more info" that I could supply. I am about to give up.

marevol commented 5 years ago

"more info" is steps to reproduce it.

2019-01-29 01:19:25,945 [Crawler-20190129003919-213-1] INFO Checking URL: http://www.oreillyauto.com/robots.txt

The above access seems to be brocked. To apply a debug level logging on crawler, change logLevel to debug for Default Crawler in Scheduler.

abolotnov commented 5 years ago

I gave you the exact steps, including commands and the list of crawlers that I use to reproduce this. I am happy to provide more data, let me know what exactly.

Overnight I see there are about 12,000 objects in the .crawler_queue (were 72 6 hours ago). So I see the list has grown. Default crawler is shown as Active in he Scheduling page.

image

I have changed the logging for the default crawler to debug, do you want me to start this?

image

But I think the question is: Why does not the default crawler run when there are remaining items in the crawling queue?

marevol commented 5 years ago

Why does not the default crawler run when there are remaining items in the crawling queue?

You set maxAccessCount to 12 to stop crawling.

For Fess user I know, https://www.chizen.jp/ crawls massive sites. It's like Google.

abolotnov commented 5 years ago

the maxAccessCount in the WebCrawler controls how many documents will be indexed for a particular site. It does not mean that it will only crawl 12 pages for all the crawlers. At this this is how I understand this. Let me know if my understanding is wrong.

Now, when I crawl a single site or 10 sites and regardless of the maxAccessCount setting, it will work just fine. I tried one site unlimited and it did it well. My difference from many other use cases is that I seem to crawl many individual sites. I need to do 8K but can't get it to finish 300.

marevol commented 5 years ago

The crawling queue is just a queue to crawl URLs. .crawler* indices are a meta index to crawl documents, not for document search.

So, you should check the blocked access in fess-crawler.log with debug logging.

abolotnov commented 5 years ago

I understand this. My expectation was that if I created 300 web crawlers and 12 maxAccessCount for each, I would end up with 300*12 documents in the index. And this is how much it's done for this 162 sites out of 300.

image

My understanding was that the default crawler will not stop crawling until the crawl_queue is not empty. For the failed URLs - once a URL failed, it will remove that URL from the queue and proceed with other URLs right?

Ok, I enabled debug mode and will start the crawler now.

abolotnov commented 5 years ago

ok I started the default crawler, there are 14,000 URLs in the queue. Is my assertion right that the crawler should run until the queue is empty?

abolotnov commented 5 years ago

There are 150K items in the queue now.

abolotnov commented 5 years ago

Ok it worked for some time and again stopped collecting documents into the index.

image

I see a lot of stuff in the logs like this:

2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,426 [Crawler-20190129142043-32-7] DEBUG The url is null. (16074)
2019-01-29 16:41:14,426 [Crawler-20190129142043-169-8] DEBUG The url is null. (12341)
2019-01-29 16:41:14,426 [Crawler-20190129142043-213-5] DEBUG The url is null. (10563)
2019-01-29 16:41:14,426 [Crawler-20190129142043-168-7] DEBUG The url is null. (12360)
2019-01-29 16:41:14,426 [Crawler-20190129142043-43-8] DEBUG The url is null. (15927)
2019-01-29 16:41:14,431 [Crawler-20190129142043-32-1] DEBUG The url is null. (16073)
2019-01-29 16:41:14,431 [Crawler-20190129142043-43-3] DEBUG The url is null. (15928)
2019-01-29 16:41:14,501 [Crawler-20190129142043-168-4] DEBUG The url is null. (12360)

.crawl_queue has over 12K items in it.

marevol commented 5 years ago

First, Crawler seems to be blocked, not stopped. So, you need to check last accesses in fess-crawler.log.

My understanding was that the default crawler will not stop crawling until the crawl_queue is not empty.

Documents in meta indices will be removed when Crawler finishes all crawling. Therefore, it's not empty when crawling or blocked. The number of documents in the meta indices does not help to solve your problem.

abolotnov commented 5 years ago

ok, thank you for still helping me through this struggle. I have collected 2.1G worth of debug logs. It is not possible to review the whole thing. Can you recommend for the keywords/phrases I should be looking for?

marevol commented 5 years ago

So, you need to check last accesses in fess-crawler.log.

abolotnov commented 5 years ago

Like literally last access? cat fess-crawler.log | grep -i "last access" yields nothing.

access gives over 40K lines.

There are a few of: 2019-01-29 15:09:51,796 [CoreLib-TimeoutManager] DEBUG Failed to access Elasticsearch stats. ones

But most are just 2019-01-29 14:22:31,704 [Crawler-20190129142043-17-1] DEBUG Accessing https://www.antheminc.com/ type of messages

or 2019-01-29 14:43:26,019 [Crawler-20190129142043-150-5] DEBUG Processing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-150, ruleId=webHtmlRule ... messages

or 2019-01-29 14:44:39,229 [Crawler-20190129142043-153-6] DEBUG Storing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-153, ruleId=webHtmlRule ... messages

marevol commented 5 years ago

No... As mentioned, the problem is the access was blocked.

2019-01-29 01:19:25,945 [Crawler-20190129003919-213-1] INFO Checking URL: http://www.oreillyauto.com/robots.txt

I think it's better to put your log files to somewhere...

abolotnov commented 5 years ago

Sure, here's the link to the log gz: https://www.dropbox.com/s/bxj61eyr05rswy6/crawler.log.gz?dl=0

let me know if you want me to grep something out of it, it's big :(

marevol commented 5 years ago
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> GET /ro-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Host: www.ibm.com
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Connection: Keep-Alive
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Accept-Encoding: gzip,deflate
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "GET /ro-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "[\r][\n]"

The above log means no response from www.ibm.com. So, I think it's a problem about your network or the like, not Fess. If you use t?.instance in AWS, you need to change to others.

abolotnov commented 5 years ago

I use c. and m. and they are on 10G networks.

I wonder what exactly indicates that the request was blocked though?

Same worker/thread? also had this in the logs for same host:

2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> GET /uk-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Host: www.ibm.com
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Connection: Keep-Alive
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Accept-Encoding: gzip,deflate
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "GET /uk-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "HTTP/1.1 200 OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Backside-Transport: OK OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Encoding: gzip[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Type: text/xml; charset=utf-8[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "ETag: W/"721e6-2nmQElG0g2j8bIaNrL6mOty9wBU"[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Global-Transaction-ID: 3344377617[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Length: 27965[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Date: Tue, 29 Jan 2019 14:53:27 GMT[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Connection: keep-alive[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Vary: Accept-Encoding[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Robots-Tag: noindex[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Content-Type-Options: nosniff[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-XSS-Protection: 1; mode=block[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Security-Policy: upgrade-insecure-requests[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Strict-Transport-Security: max-age=31536000[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "[\r][\n]"
2019-01-29 14:53:2
abolotnov commented 5 years ago

does it make sense to play with Excluded Failure Type?

abolotnov commented 5 years ago

I guess what I could try instead is build a single EC2 image with clean install of FESS configured to a remote elastic instance. I will get something big with maybe 96 CPUs or something for a large index thread pool and all. Spin off 100 instances of FESS and have each work on 100 sites at a time. 100 sites FESS seems to process fairly well. If I go with a1.large, it's fair deal I guess. I only collect about 1000 pages per site so it should not take too long. What do you think?