Closed timcreatewell closed 10 years ago
Could you check logs/elasticsearch.log file? I'd like to know if exceptions(or stack traces) for river-web exists.
As follows is the contents of my log file after I create a new index (using the cURL request above). Only the "robot" index exists:
[2013-12-19 06:58:18,892][INFO ][cluster.metadata ] [Bloodlust] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2013-12-19 06:58:19,053][INFO ][cluster.metadata ] [Bloodlust] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 06:58:19,074][INFO ][river.routing ] [Bloodlust] no river _meta document found, retrying in 1000 ms
[2013-12-19 06:58:20,083][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: my_web
[2013-12-19 06:58:20,083][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob...
[2013-12-19 06:58:20,094][INFO ][cluster.metadata ] [Bloodlust] [_river] update_mapping [my_web] (dynamic)
Thanks!
Thank you for checking it!
I checked your curl command to create river configuration, and found a missing one. Please put schedule property.
"schedule" : {
"cron" : "0 0 6 * * ?"
}
Hi there,
I've just tried the following:
curl -XPUT 'localhost:9200/_river/my_web/_meta' -d "{
\"type\" : \"web\",
\"crawl\" : {
\"index\" : \"compassion_test\",
\"url\" : [\"http://uat.compassiondev.net.au/\"],
\"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
\"maxDepth\" : 3,
\"maxAccessCount\" : 100,
\"numOfThread\" : 5,
\"interval\" : 1000,
\"overwrite\" : true,
\"target\" : [
{
\"pattern\" : {
\"url\" : \"http://uat.compassiondev.net.au/.*\",
\"mimeType\" : \"text/html\"
},
\"properties\" : {
\"title\" : {
\"text\" : \"title\"
},
\"body\" : {
\"text\" : \"div#page_content\",
\"trimSpaces\" : true
}
}
}
]
},
\"schedule\" : {
\"cron\" : \"*/2 * * * * ?\"
}
}"
The log file then looks like this:
[2013-12-19 07:03:33,852][INFO ][cluster.metadata ] [Bloodlust] [_river] deleting index
[2013-12-19 07:03:33,859][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Unscheduling CrawlJob...
[2013-12-19 07:03:38,017][INFO ][cluster.metadata ] [Bloodlust] [robot] deleting index
[2013-12-19 07:03:52,203][INFO ][node ] [Bloodlust] stopping ...
[2013-12-19 07:03:52,223][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Bloodlust] Stopping Scheduler...
[2013-12-19 07:03:52,224][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Bloodlust] Stopping S2Container...
[2013-12-19 07:03:52,224][INFO ][node ] [Bloodlust] stopped
[2013-12-19 07:03:52,224][INFO ][node ] [Bloodlust] closing ...
[2013-12-19 07:03:52,232][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Bloodlust] Closing Scheduler...
[2013-12-19 07:03:52,232][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED shutting down.
[2013-12-19 07:03:52,232][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED paused.
[2013-12-19 07:03:52,250][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED shutdown complete.
[2013-12-19 07:03:52,250][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Bloodlust] Closing S2Container...
[2013-12-19 07:03:52,262][INFO ][node ] [Bloodlust] closed
[2013-12-19 07:03:52,821][INFO ][node ] [Wildboys] version[0.90.7], pid[5355], build[36897d0/2013-11-13T12:06:54Z]
[2013-12-19 07:03:52,822][INFO ][node ] [Wildboys] initializing ...
[2013-12-19 07:03:52,921][INFO ][plugins ] [Wildboys] loaded [QuartzPlugin, WebPlugin], sites [head]
[2013-12-19 07:03:54,017][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Wildboys] Creating Scheduler...
[2013-12-19 07:03:54,047][INFO ][org.quartz.impl.StdSchedulerFactory] Using default implementation for ThreadExecutor
[2013-12-19 07:03:54,050][INFO ][org.quartz.simpl.SimpleThreadPool] Job execution threads will use class loader of thread: main
[2013-12-19 07:03:54,061][INFO ][org.quartz.core.SchedulerSignalerImpl] Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
[2013-12-19 07:03:54,062][INFO ][org.quartz.core.QuartzScheduler] Quartz Scheduler v.2.2.0 created.
[2013-12-19 07:03:54,062][INFO ][org.quartz.simpl.RAMJobStore] RAMJobStore initialized.
[2013-12-19 07:03:54,063][INFO ][org.quartz.core.QuartzScheduler] Scheduler meta-data: Quartz Scheduler (v2.2.0) 'DefaultQuartzScheduler' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 10 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
[2013-12-19 07:03:54,063][INFO ][org.quartz.impl.StdSchedulerFactory] Quartz scheduler 'DefaultQuartzScheduler' initialized from default resource file in Quartz package: 'quartz.properties'
[2013-12-19 07:03:54,063][INFO ][org.quartz.impl.StdSchedulerFactory] Quartz scheduler version: 2.2.0
[2013-12-19 07:03:55,083][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Wildboys] Creating S2Container...
[2013-12-19 07:03:55,136][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-framework is 2.4.46.
[2013-12-19 07:03:55,137][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-extension is 2.4.46.
[2013-12-19 07:03:55,137][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-tiger is 2.4.46.
[2013-12-19 07:03:56,022][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(client) of org.codelibs.elasticsearch.web.config.RiverConfig not found
[2013-12-19 07:03:56,176][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Running on [ENV]product, [DEPLOY MODE]Cool Deploy
[2013-12-19 07:03:56,252][INFO ][node ] [Wildboys] initialized
[2013-12-19 07:03:56,253][INFO ][node ] [Wildboys] starting ...
[2013-12-19 07:03:56,253][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Wildboys] Starting Scheduler...
[2013-12-19 07:03:56,253][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED started.
[2013-12-19 07:03:56,253][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Wildboys] Starting S2Container...
[2013-12-19 07:03:56,307][INFO ][transport ] [Wildboys] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.26:9300]}
[2013-12-19 07:03:59,331][INFO ][cluster.service ] [Wildboys] new_master [Wildboys][pj_04k2DTTql4pdmMEnNKA][inet[/192.168.1.26:9300]], reason: zen-disco-join (elected_as_master)
[2013-12-19 07:03:59,356][INFO ][discovery ] [Wildboys] elasticsearchtim/pj_04k2DTTql4pdmMEnNKA
[2013-12-19 07:03:59,376][INFO ][http ] [Wildboys] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.26:9200]}
[2013-12-19 07:03:59,376][INFO ][node ] [Wildboys] started
[2013-12-19 07:03:59,380][INFO ][gateway ] [Wildboys] recovered [0] indices into cluster_state
[2013-12-19 07:04:18,238][INFO ][cluster.metadata ] [Wildboys] [robot] creating index, cause [api], shards [5]/[1], mappings []
[2013-12-19 07:04:37,355][INFO ][cluster.metadata ] [Wildboys] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2013-12-19 07:04:37,606][INFO ][cluster.metadata ] [Wildboys] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 07:04:37,628][INFO ][river.routing ] [Wildboys] no river _meta document found, retrying in 1000 ms
[2013-12-19 07:04:38,640][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: my_web
[2013-12-19 07:04:38,640][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob...
[2013-12-19 07:04:38,679][INFO ][cluster.metadata ] [Wildboys] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 07:04:38,698][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:04:38,704][INFO ][cluster.metadata ] [Wildboys] [robot] update_mapping [queue] (dynamic)
[2013-12-19 07:04:38,727][INFO ][cluster.metadata ] [Wildboys] [robot] update_mapping [filter] (dynamic)
[2013-12-19 07:04:38,846][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:04:38,870][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:04:40,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:42,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:46,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:54,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:56,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:01,207][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:05:02,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:06,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:08,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:10,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:12,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:14,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:16,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:20,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:22,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:24,007][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:26,006][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:30,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:34,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:05:34,055][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:05:34,056][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:05:36,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:38,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:43,466][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:05:44,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:46,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:50,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:54,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:56,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:02,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:06,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:08,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:10,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:12,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:14,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:16,003][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:06:16,065][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:06:16,066][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:06:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:22,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:24,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:25,810][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:06:26,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:30,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:32,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:34,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:36,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:38,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:46,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:48,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:52,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:54,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:56,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:58,003][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:06:58,056][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:06:58,057][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:07:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:02,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:06,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:07,167][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:07:08,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:10,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:12,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:14,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:16,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:22,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:24,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:26,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:30,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:34,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:36,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:38,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:40,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:07:40,057][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:07:40,057][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:07:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:46,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:49,424][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:07:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:54,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:56,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:00,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:02,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:06,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:08,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:10,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:12,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:14,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:16,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:18,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:22,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:08:22,070][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:08:22,071][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:08:24,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:26,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:28,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:30,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:30,722][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
... 4 more
[2013-12-19 07:08:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:34,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:36,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:38,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
Thanks for all your help!
Hmm..., just in case, clould you create compassion_test index by
curl -XPUT "localhost:9200/compassion_test"
before
curl -XPUT 'localhost:9200/_river/my_web/_meta' -d ...
Great news, that fixed it! Thanks so much for the help :)
Out of interest, is the "my_web" in curl -XPUT 'localhost:9200/_river/my_web/_meta'
interchangeable with something else? What is the purpose of this syntax?
Thanks again!
Yes, you can replace "my_web" with something you want. The command creates a river configuration, and "my_web" is a name of River configuration("my_web" is just a sample name).
Thanks - that makes sense.
Last question (I promise!) - I seem to be getting duplicates in my search results when I query my index (it seems to be storing multiple copies of data for individual urls, likely gathered each time it crawls). I have applied \"overwrite\" : true,
to my index creation command but I still seem to be getting duplicate search results?
Do you know a quick way this can be fixed?
Thanks again.
I looked into it. "overwrite" option needs a mapping for "url". Could you try below:
# Remove River
curl -XDELETE 'localhost:9200/_river/my_web'
# Delete Index
curl -XDELETE "localhost:9200/compassion_test"
# Create Index
curl -XPUT "localhost:9200/compassion_test"
# Create a mapping for Index (overwrite option needs a mapping for "url")
curl -XPUT "localhost:9200/compassion_test/my_web/_mapping" -d '
{
"my_web" : {
"dynamic_templates" : [
{
"url" : {
"match" : "url",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"method" : {
"match" : "method",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"charSet" : {
"match" : "charSet",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
},
{
"mimeType" : {
"match" : "mimeType",
"mapping" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
}
]
}
}
'
# Start River
curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
"type" : "web",
"crawl" : {
"index" : "compassion_test",
"url" : ["http://uat.compassiondev.net.au/"],
"includeFilter" : ["http://uat.compassiondev.net.au/.*"],
"maxDepth" : 3,
"maxAccessCount" : 100,
"numOfThread" : 5,
"interval" : 1000,
"overwrite" : true,
"target" : [
{
"pattern" : {
"url" : "http://uat.compassiondev.net.au/.*",
"mimeType" : "text/html"
},
"properties" : {
"title" : {
"text" : "title"
},
"body" : {
"text" : "div#page_content",
"trimSpaces" : true
}
}
}
]
},
"schedule" : {
"cron" : "*/2 * * * * ?"
}
}'
I'll improve README.md later..
Thanks, I'll give that a try and will let you know how I get on.
Great news, it seems to be working! Thanks for all the help.
How to check whether crawl data inserted into elastic search index or not?
Hi there,
I've been using this plugin now for a few weeks with no issues (I'm running version 1.0.1) until I decided a few days ago to remove all my indexes and create new ones again from scratch.
Unfortunately now I can't seem to create my crawler indexes. I run the appropriate CURL command to create the index and I receive the
{"ok":true...}
json response, but when I try to query the index I receive aIndexMissingException
.The process I'm following is as follows:
a. Install robot index (as per instructions):
b. I then attempt to create an index using:
I receive the following json response:
But the index doesn't seem to exist (I receive the exception mentioned above)...
Is there something that I've missed? Any help would be greatly appreciated. Thanks!