codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

Cannot create index? #5

Closed timcreatewell closed 10 years ago

timcreatewell commented 10 years ago

Hi there,

I've been using this plugin now for a few weeks with no issues (I'm running version 1.0.1) until I decided a few days ago to remove all my indexes and create new ones again from scratch.

Unfortunately now I can't seem to create my crawler indexes. I run the appropriate CURL command to create the index and I receive the {"ok":true...} json response, but when I try to query the index I receive a IndexMissingException.

The process I'm following is as follows:

a. Install robot index (as per instructions):

curl -XPUT '192.168.1.26:9200/robot/'

b. I then attempt to create an index using:

curl -XPUT '192.168.1.26:9200/_river/my_web/_meta' -d "{
    \"type\" : \"web\",
    \"crawl\" : {
        \"index\" : \"compassion_test\",
        \"url\" : [\"http://uat.compassiondev.net.au/\"],
        \"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
        \"maxDepth\" : 3,
        \"maxAccessCount\" : 100,
        \"numOfThread\" : 5,
        \"interval\" : 1000,
        \"overwrite\" : true,
        \"target\" : [
          {
            \"pattern\" : {
              \"url\" : \"http://uat.compassiondev.net.au/.*\",
              \"mimeType\" : \"text/html\"
            },
            \"properties\" : {
              \"title\" : {
                \"text\" : \"title\"
              },
              \"body\" : {
                \"text\" : \"div#page_content\",
                \"trimSpaces\" : true
              }
            }
          }
        ]
    }
}"

I receive the following json response:

{"ok":true,"_index":"_river","_type":"my_web","_id":"_meta","_version":1}

But the index doesn't seem to exist (I receive the exception mentioned above)...

Is there something that I've missed? Any help would be greatly appreciated. Thanks!

marevol commented 10 years ago

Could you check logs/elasticsearch.log file? I'd like to know if exceptions(or stack traces) for river-web exists.

timcreatewell commented 10 years ago

As follows is the contents of my log file after I create a new index (using the cURL request above). Only the "robot" index exists:

[2013-12-19 06:58:18,892][INFO ][cluster.metadata         ] [Bloodlust] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2013-12-19 06:58:19,053][INFO ][cluster.metadata         ] [Bloodlust] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 06:58:19,074][INFO ][river.routing            ] [Bloodlust] no river _meta document found, retrying in 1000 ms
[2013-12-19 06:58:20,083][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: my_web
[2013-12-19 06:58:20,083][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob...
[2013-12-19 06:58:20,094][INFO ][cluster.metadata         ] [Bloodlust] [_river] update_mapping [my_web] (dynamic)

Thanks!

marevol commented 10 years ago

Thank you for checking it!

I checked your curl command to create river configuration, and found a missing one. Please put schedule property.

"schedule" : {
    "cron" : "0 0 6 * * ?"
 }
timcreatewell commented 10 years ago

Hi there,

I've just tried the following:

  1. Removed the "_river" & "robots" indexes that I can see;
  2. Restart elasticsearch;
  3. Create a new "robots" index;
  4. Create a new index (with cron running every 2 mins - I realised I missed that!):
curl -XPUT 'localhost:9200/_river/my_web/_meta' -d "{
    \"type\" : \"web\",
    \"crawl\" : {
        \"index\" : \"compassion_test\",
        \"url\" : [\"http://uat.compassiondev.net.au/\"],
        \"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
        \"maxDepth\" : 3,
        \"maxAccessCount\" : 100,
        \"numOfThread\" : 5,
        \"interval\" : 1000,
        \"overwrite\" : true,
        \"target\" : [
          {
            \"pattern\" : {
              \"url\" : \"http://uat.compassiondev.net.au/.*\",
              \"mimeType\" : \"text/html\"
            },
            \"properties\" : {
              \"title\" : {
                \"text\" : \"title\"
              },
              \"body\" : {
                \"text\" : \"div#page_content\",
                \"trimSpaces\" : true
              }
            }
          }
        ]
    },
    \"schedule\" : {
        \"cron\" : \"*/2 * * * * ?\"
    }
}"

The log file then looks like this:

[2013-12-19 07:03:33,852][INFO ][cluster.metadata         ] [Bloodlust] [_river] deleting index
[2013-12-19 07:03:33,859][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Unscheduling  CrawlJob...
[2013-12-19 07:03:38,017][INFO ][cluster.metadata         ] [Bloodlust] [robot] deleting index
[2013-12-19 07:03:52,203][INFO ][node                     ] [Bloodlust] stopping ...
[2013-12-19 07:03:52,223][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Bloodlust] Stopping Scheduler...
[2013-12-19 07:03:52,224][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Bloodlust] Stopping S2Container...
[2013-12-19 07:03:52,224][INFO ][node                     ] [Bloodlust] stopped
[2013-12-19 07:03:52,224][INFO ][node                     ] [Bloodlust] closing ...
[2013-12-19 07:03:52,232][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Bloodlust] Closing Scheduler...
[2013-12-19 07:03:52,232][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED shutting down.
[2013-12-19 07:03:52,232][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED paused.
[2013-12-19 07:03:52,250][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED shutdown complete.
[2013-12-19 07:03:52,250][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Bloodlust] Closing S2Container...
[2013-12-19 07:03:52,262][INFO ][node                     ] [Bloodlust] closed
[2013-12-19 07:03:52,821][INFO ][node                     ] [Wildboys] version[0.90.7], pid[5355], build[36897d0/2013-11-13T12:06:54Z]
[2013-12-19 07:03:52,822][INFO ][node                     ] [Wildboys] initializing ...
[2013-12-19 07:03:52,921][INFO ][plugins                  ] [Wildboys] loaded [QuartzPlugin, WebPlugin], sites [head]
[2013-12-19 07:03:54,017][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Wildboys] Creating Scheduler...
[2013-12-19 07:03:54,047][INFO ][org.quartz.impl.StdSchedulerFactory] Using default implementation for ThreadExecutor
[2013-12-19 07:03:54,050][INFO ][org.quartz.simpl.SimpleThreadPool] Job execution threads will use class loader of thread: main
[2013-12-19 07:03:54,061][INFO ][org.quartz.core.SchedulerSignalerImpl] Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
[2013-12-19 07:03:54,062][INFO ][org.quartz.core.QuartzScheduler] Quartz Scheduler v.2.2.0 created.
[2013-12-19 07:03:54,062][INFO ][org.quartz.simpl.RAMJobStore] RAMJobStore initialized.
[2013-12-19 07:03:54,063][INFO ][org.quartz.core.QuartzScheduler] Scheduler meta-data: Quartz Scheduler (v2.2.0) 'DefaultQuartzScheduler' with instanceId 'NON_CLUSTERED'
  Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
  NOT STARTED.
  Currently in standby mode.
  Number of jobs executed: 0
  Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 10 threads.
  Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.

[2013-12-19 07:03:54,063][INFO ][org.quartz.impl.StdSchedulerFactory] Quartz scheduler 'DefaultQuartzScheduler' initialized from default resource file in Quartz package: 'quartz.properties'
[2013-12-19 07:03:54,063][INFO ][org.quartz.impl.StdSchedulerFactory] Quartz scheduler version: 2.2.0
[2013-12-19 07:03:55,083][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Wildboys] Creating S2Container...
[2013-12-19 07:03:55,136][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-framework is 2.4.46.
[2013-12-19 07:03:55,137][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-extension is 2.4.46.
[2013-12-19 07:03:55,137][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Version of s2-tiger is 2.4.46.
[2013-12-19 07:03:56,022][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(client) of org.codelibs.elasticsearch.web.config.RiverConfig not found
[2013-12-19 07:03:56,176][INFO ][org.seasar.framework.container.factory.SingletonS2ContainerFactory] Running on [ENV]product, [DEPLOY MODE]Cool Deploy
[2013-12-19 07:03:56,252][INFO ][node                     ] [Wildboys] initialized
[2013-12-19 07:03:56,253][INFO ][node                     ] [Wildboys] starting ...
[2013-12-19 07:03:56,253][INFO ][org.codelibs.elasticsearch.quartz.service.ScheduleService] [Wildboys] Starting Scheduler...
[2013-12-19 07:03:56,253][INFO ][org.quartz.core.QuartzScheduler] Scheduler DefaultQuartzScheduler_$_NON_CLUSTERED started.
[2013-12-19 07:03:56,253][INFO ][org.codelibs.elasticsearch.web.service.S2ContainerService] [Wildboys] Starting S2Container...
[2013-12-19 07:03:56,307][INFO ][transport                ] [Wildboys] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.26:9300]}
[2013-12-19 07:03:59,331][INFO ][cluster.service          ] [Wildboys] new_master [Wildboys][pj_04k2DTTql4pdmMEnNKA][inet[/192.168.1.26:9300]], reason: zen-disco-join (elected_as_master)
[2013-12-19 07:03:59,356][INFO ][discovery                ] [Wildboys] elasticsearchtim/pj_04k2DTTql4pdmMEnNKA
[2013-12-19 07:03:59,376][INFO ][http                     ] [Wildboys] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.26:9200]}
[2013-12-19 07:03:59,376][INFO ][node                     ] [Wildboys] started
[2013-12-19 07:03:59,380][INFO ][gateway                  ] [Wildboys] recovered [0] indices into cluster_state
[2013-12-19 07:04:18,238][INFO ][cluster.metadata         ] [Wildboys] [robot] creating index, cause [api], shards [5]/[1], mappings []
[2013-12-19 07:04:37,355][INFO ][cluster.metadata         ] [Wildboys] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2013-12-19 07:04:37,606][INFO ][cluster.metadata         ] [Wildboys] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 07:04:37,628][INFO ][river.routing            ] [Wildboys] no river _meta document found, retrying in 1000 ms
[2013-12-19 07:04:38,640][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: my_web
[2013-12-19 07:04:38,640][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob...
[2013-12-19 07:04:38,679][INFO ][cluster.metadata         ] [Wildboys] [_river] update_mapping [my_web] (dynamic)
[2013-12-19 07:04:38,698][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:04:38,704][INFO ][cluster.metadata         ] [Wildboys] [robot] update_mapping [queue] (dynamic)
[2013-12-19 07:04:38,727][INFO ][cluster.metadata         ] [Wildboys] [robot] update_mapping [filter] (dynamic)
[2013-12-19 07:04:38,846][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:04:38,870][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:04:40,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:42,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:46,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:54,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:56,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:04:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:01,207][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:05:02,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:06,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:08,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:10,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:12,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:14,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:16,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:20,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:22,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:24,007][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:26,006][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:30,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:34,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:05:34,055][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:05:34,056][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:05:36,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:38,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:43,466][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:05:44,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:46,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:50,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:54,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:56,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:05:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:02,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:06,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:08,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:10,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:12,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:14,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:16,003][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:06:16,065][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:06:16,066][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:06:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:22,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:24,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:25,810][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:06:26,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:30,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:32,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:34,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:36,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:38,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:46,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:48,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:52,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:54,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:56,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:06:58,003][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:06:58,056][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:06:58,057][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:07:00,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:02,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:06,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:07,167][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:07:08,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:10,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:12,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:14,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:16,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:18,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:22,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:24,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:26,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:28,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:30,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:34,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:36,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:38,005][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:40,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:07:40,057][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:07:40,057][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:07:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:44,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:46,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:48,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:49,424][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:07:50,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:52,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:54,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:56,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:07:58,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:00,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:02,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:04,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:06,004][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:08,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:10,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:12,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:14,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:16,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:18,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:20,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:22,004][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found
[2013-12-19 07:08:22,070][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://uat.compassiondev.net.au/
[2013-12-19 07:08:22,071][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://uat.compassiondev.net.au/robots.txt
[2013-12-19 07:08:24,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:26,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:28,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:30,003][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:30,722][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] Crawling Exception at http://uat.compassiondev.net.au/
org.seasar.robot.RobotSystemException: Could not store data.
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:157)
    at org.seasar.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:73)
    at org.seasar.robot.S2RobotThread.processResponse(S2RobotThread.java:384)
    at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:183)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.indices.IndexMissingException: [compassion_test] missing
    at org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:614)
    at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:513)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:78)
    at org.elasticsearch.action.support.replication.TransportIndicesReplicationOperationAction.doExecute(TransportIndicesReplicationOperationAction.java:44)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
    at org.elasticsearch.client.support.AbstractClient.deleteByQuery(AbstractClient.java:164)
    at org.elasticsearch.action.deletebyquery.DeleteByQueryRequestBuilder.doExecute(DeleteByQueryRequestBuilder.java:158)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeIndex(ScrapingTransformer.java:313)
    at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeData(ScrapingTransformer.java:166)
    at org.seasar.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:143)
    ... 4 more
[2013-12-19 07:08:32,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:34,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:36,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:38,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:40,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
[2013-12-19 07:08:42,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.

Thanks for all your help!

marevol commented 10 years ago

Hmm..., just in case, clould you create compassion_test index by

curl -XPUT "localhost:9200/compassion_test"

before

curl -XPUT 'localhost:9200/_river/my_web/_meta' -d ...
timcreatewell commented 10 years ago

Great news, that fixed it! Thanks so much for the help :)

Out of interest, is the "my_web" in curl -XPUT 'localhost:9200/_river/my_web/_meta' interchangeable with something else? What is the purpose of this syntax?

Thanks again!

marevol commented 10 years ago

Yes, you can replace "my_web" with something you want. The command creates a river configuration, and "my_web" is a name of River configuration("my_web" is just a sample name).

timcreatewell commented 10 years ago

Thanks - that makes sense.

Last question (I promise!) - I seem to be getting duplicates in my search results when I query my index (it seems to be storing multiple copies of data for individual urls, likely gathered each time it crawls). I have applied \"overwrite\" : true, to my index creation command but I still seem to be getting duplicate search results?

Do you know a quick way this can be fixed?

Thanks again.

marevol commented 10 years ago

I looked into it. "overwrite" option needs a mapping for "url". Could you try below:

# Remove River
curl -XDELETE 'localhost:9200/_river/my_web'
# Delete Index
curl -XDELETE "localhost:9200/compassion_test"
# Create Index
curl -XPUT "localhost:9200/compassion_test"
# Create a mapping for Index (overwrite option needs a mapping for "url")
curl -XPUT "localhost:9200/compassion_test/my_web/_mapping" -d '
{
  "my_web" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}
'
# Start River
curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_test",
        "url" : ["http://uat.compassiondev.net.au/"],
        "includeFilter" : ["http://uat.compassiondev.net.au/.*"],
        "maxDepth" : 3,
        "maxAccessCount" : 100,
        "numOfThread" : 5,
        "interval" : 1000,
        "overwrite" : true,
        "target" : [
          {
            "pattern" : {
              "url" : "http://uat.compassiondev.net.au/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "title"
              },
              "body" : {
                "text" : "div#page_content",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}'

I'll improve README.md later..

timcreatewell commented 10 years ago

Thanks, I'll give that a try and will let you know how I get on.

timcreatewell commented 10 years ago

Great news, it seems to be working! Thanks for all the help.

selvas4u commented 10 years ago

How to check whether crawl data inserted into elastic search index or not?