codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

NoClassSettingsException[Failed to load class with value [web]] #18

Closed timcreatewell closed 10 years ago

timcreatewell commented 10 years ago

Hi there,

I've just created a brand new Centos VM (v6), installed ElasticSearch v1.0.0RC2 and elasticsearch-river-web v1.1.0 as per the instructions.

I then have gone to setup my crawl by running the following:

# create robot
curl -XPUT 'http://localhost:9200:443/robot/'

# Create Index
curl -XPUT "http://localhost:9200:443/compassion_uat/"

# create the duplicate mapping index
curl -XPUT "http://localhost:9200:443/compassion_uat/compassion_web/_mapping/" -d '
{
  "compassion_web" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}
'

# create the crawler
curl -XPUT 'http://localhost:9200:443/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "robotsTxt" : false,
                "userAgent" : "bingbot",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}
'

After doing this I cannot see any documents appearing in the index, so I have looked at the _river index and can see the following error:

NoClassSettingsException[Failed to load class with value [web]]; nested: ClassNotFoundException[web];

Have I missed a step?

Thanks, Tim.

marevol commented 10 years ago

Did you install quartz plugin?

$ $ES_HOME/bin/plugin --install org.codelibs/elasticsearch-quartz/1.0.1

If yes, could you check ES log file and provide the stacktrace?

timcreatewell commented 10 years ago

Hi, yes I did install quartz:

[tim@localhost plugins]$ ls
head  kopf  quartz  river-web

The contents of my log are as follows:

[2014-03-25 23:50:35,994][INFO ][node                     ] [Lynx] version[1.0.0.RC2], pid[14524], build[a9d736e/2014-02-03T15:02:11Z]
[2014-03-25 23:50:35,994][INFO ][node                     ] [Lynx] initializing ...
[2014-03-25 23:50:35,999][INFO ][plugins                  ] [Lynx] loaded [], sites []
[2014-03-25 23:50:38,601][INFO ][node                     ] [Lynx] initialized
[2014-03-25 23:50:38,601][INFO ][node                     ] [Lynx] starting ...
[2014-03-25 23:50:38,671][INFO ][transport                ] [Lynx] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.20.122:9300]}
[2014-03-25 23:50:41,714][INFO ][cluster.service          ] [Lynx] new_master [Lynx][RooMfQuzSZ-zzYyhom5DZA][localhost.localdomain][inet[/192.168.20.122:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-25 23:50:41,741][INFO ][discovery                ] [Lynx] elasticsearch/RooMfQuzSZ-zzYyhom5DZA
[2014-03-25 23:50:41,840][INFO ][http                     ] [Lynx] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.20.122:9200]}
[2014-03-25 23:50:41,867][INFO ][gateway                  ] [Lynx] recovered [0] indices into cluster_state
[2014-03-25 23:50:41,867][INFO ][node                     ] [Lynx] started
[2014-03-25 23:54:45,904][INFO ][cluster.metadata         ] [Lynx] [robot] creating index, cause [api], shards [5]/[1], mappings []
[2014-03-25 23:54:53,657][INFO ][cluster.metadata         ] [Lynx] [compassion_uat] creating index, cause [api], shards [5]/[1], mappings []
[2014-03-25 23:55:07,465][INFO ][cluster.metadata         ] [Lynx] [compassion_uat] create_mapping [compassion_web]
[2014-03-25 23:55:25,630][INFO ][cluster.metadata         ] [Lynx] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2014-03-25 23:55:25,795][INFO ][cluster.metadata         ] [Lynx] [_river] update_mapping [compassion_web] (dynamic)
[2014-03-25 23:55:26,820][WARN ][river                    ] [Lynx] failed to create river [web][compassion_web]
org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class with value [web]
        at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:87)
        at org.elasticsearch.river.RiverModule.spawnModules(RiverModule.java:58)
        at org.elasticsearch.common.inject.ModulesBuilder.add(ModulesBuilder.java:44)
        at org.elasticsearch.river.RiversService.createRiver(RiversService.java:137)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
        at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: web
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:73)
        ... 9 more
[2014-03-25 23:55:26,832][INFO ][cluster.metadata         ] [Lynx] [_river] update_mapping [compassion_web] (dynamic)
[2014-03-25 23:57:35,306][INFO ][cluster.metadata         ] [Lynx] [_river] update_mapping [compassion_web] (dynamic)

If it helps, the installed java version is as follows:

[tim@localhost elasticsearch]$ java -version
java version "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Thanks for your help!

marevol commented 10 years ago

Thank you for the info. I do not think that your ES load river-web plugin in plugins directory. Could you check files in $ES_HOME/plugins/river-web directory and also the file permissions?

timcreatewell commented 10 years ago

I've just checked and all the files are there, I did a chmod 755 across them and they all seem to work now. However, I am now receiving the following error:

[2014-03-26 01:19:04,221][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] System Error.
org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[rhyrVnOJTpi6KBrzvo30Nw][compassion_uat][2]: SearchParseException[[compassion_uat][2]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"https://compassionau.custhelp.com/ci/sitemap/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]]; nested: SearchParseException[[compassion_uat][2]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]]; }{[rhyrVnOJTpi6KBrzvo30Nw][compassion_uat][1]: SearchParseException[[compassion_uat][1]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"https://compassionau.custhelp.com/ci/sitemap/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]]; nested: SearchParseException[[compassion_uat][1]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]]; }{[rhyrVnOJTpi6KBrzvo30Nw][compassion_uat][0]: SearchParseException[[compassion_uat][0]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"https://compassionau.custhelp.com/ci/sitemap/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]]; nested: SearchParseException[[compassion_uat][0]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]]; }{[rhyrVnOJTpi6KBrzvo30Nw][compassion_uat][4]: SearchParseException[[compassion_uat][4]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"https://compassionau.custhelp.com/ci/sitemap/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]]; nested: SearchParseException[[compassion_uat][4]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]]; }{[rhyrVnOJTpi6KBrzvo30Nw][compassion_uat][3]: SearchParseException[[compassion_uat][3]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"https://compassionau.custhelp.com/ci/sitemap/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]]; nested: SearchParseException[[compassion_uat][3]: query[url:https://compassionau.custhelp.com/ci/sitemap/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]]; }
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:272)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:224)
        at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:205)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:203)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:186)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I am also receiving this on my qbox.io cluster (qbox.io support confirmed this this morning). Have you seen this error before?

marevol commented 10 years ago

I have not seen it... Could you check a mapping?

curl -XGET localhost:9200/compassion_uat/compassion_web/_mapping?pretty

If the mapping is not correct, I think that it's better to recreate compassion_uat index.

timcreatewell commented 10 years ago

The request returns this:

{
  "compassion_uat" : {
    "mappings" : {
      "compassion_web" : {
        "dynamic_templates" : [ {
          "url" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "url"
          }
        }, {
          "method" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "method"
          }
        }, {
          "charSet" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "charSet"
          }
        }, {
          "mimeType" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "mimeType"
          }
        } ],
        "properties" : { }
      }
    }
  }
}

As far as I can tell it's correct?

marevol commented 10 years ago
"properties" : { }

The properties is empty... So, compassion_web does not have a mapping info. Could you re-register the river with "incremental":false? If it works, please re-register it with "incremental":true again.

timcreatewell commented 10 years ago

Hi there,

I removed all the indexes and started over with the river set to "incremental": false - I can see documents being indexed which is great!

When I update the river by running:

curl -XPUT 'http://localhost:9200/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "robotsTxt" : false,
                "userAgent" : "bingbot",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}

... everything still seems to be working?

marevol commented 10 years ago

Thank you for checking it. An incremental crawling needs a mapping before crawling... Therefore, it works because of creating the mapping by non-incremental crawling. I'll fix this problem in a next release.

timcreatewell commented 10 years ago

Thanks for the help - really appreciate it!

srinivasv2 commented 10 years ago

Hi marevol,

I did same as you said, created a crawling river by mentioning incremental:false initially and then deleted and recreated the same with incremental:true which got failed to index files in later case. Please let me if I made any mistake.

This is the log stacktrace : [2014-03-26 19:43:12,763][INFO ][cluster.metadata ] [Intermec] [_river] update_mapping es_htmls [2014-03-26 19:43:12,768][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: es_htmls [2014-03-26 19:43:12,768][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob... [2014-03-26 19:43:12,771][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found [2014-03-26 19:43:12,774][INFO ][cluster.metadata ] [Intermec] [_river] update_mapping es_htmls [2014-03-26 19:43:12,864][ERROR][org.seasar.robot.helper.impl.LogHelperImpl] System Error. java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Date at org.codelibs.elasticsearch.web.robot.service.EsUrlQueueService.poll(EsUrlQueueService.java:107) at org.seasar.robot.S2RobotThread.run(S2RobotThread.java:128) at java.lang.Thread.run(Thread.java:722) [2014-03-26 19:43:13,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.es_htmlsJob is running. [2014-03-26 19:43:14,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.es_htmlsJob is running.

marevol commented 10 years ago

Could you re-create "robot" index?

curl -XDELETE 'localhost:9200/robot/'
curl -XPUT 'localhost:9200/robot/'
srinivasv2 commented 10 years ago

Mistake is from my end, I first created a river with incremental:false and deleted the river and re-created again with incremental:true. But what I concluded is just updating the river is enough with incremental:true instead of deleting and re-creating it.

I hope this fixed the issue which got worked for me.

Thanks, Srinivas

marevol commented 10 years ago

Filed #22 and #24. Problems on this issue will be fixed in a next release.

selvas4u commented 10 years ago

I got the following log when try the above steps with incremental:false 2014-07-24 14:56:53,509][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org seasar.robot.client.FaultTolerantClient not found 2014-07-24 14:56:53,525][INFO ][cluster.metadata ] [Black Tarantula] [_river] update_mapping compassion_web 2014-07-24 14:59:33,250][INFO ][cluster.metadata ] [Black Tarantula] [[_river]] remove_mapping [[compassion_web]] 2014-07-24 14:59:33,252][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Unscheduling CrawlJob... 2014-07-24 14:59:33,260][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Deleted one time river: compassion_web

When I run this URl "curl -XGET localhost:9200/compassion_uat/compassion_web/_mapping?pretty"

still i get mapping as empty

yatendra commented 10 years ago

I am also getting the same "org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class with value [web]". I tried setting incremental: false when creating the crawl but it didnt help.

FYI I am using ES 1.3.0, river web plugin 1.3.0 and quartz 1.0.1

When I do

curl -XGET localhost:9200/webindex/my_web/_mapping?pretty

I get -

{
  "webindex" : {
    "mappings" : {
      "my_web" : {
        "dynamic_templates" : [ {
          "url" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "url"
          }
        }, {
          "method" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "method"
          }
        }, {
          "charSet" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "charSet"
          }
        }, {
          "mimeType" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "mimeType"
          }
        } ],
        "properties" : { }
      }
    }
  }
}

somehow properties are not getting set even if I set incremental as false.

curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
    "type" : "web",
    "crawl" : {
        "index" : "webindex",
        "url" : ["http://gta.wikia.com"],
        "includeFilter" : ["http://gta.wikia.com/.*"],
        "maxDepth" : 3,
        "maxAccessCount" : 1000000,
        "numOfThread" : 5,
        "interval" : 1000,
        "incremental" : false,
        "target" : [
          {
            "pattern" : {
              "url" : "http://gta.wikia.com/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "title"
              },
              "body" : {
                "text" : "body"
              },
              "bodyAsHtml" : {
                "html" : "body"
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "0 0 6 * * ?"
    }
}'
marevol commented 10 years ago

Closed this issue because of mixing multiple problems. If you see NoClassSettingsException, I think that an installation for river-web was failed.

yatendra commented 10 years ago

@marevol actually the river web was installed successfully still I am getting this exception -

[2014-08-04 10:47:21,080][INFO ][cluster.metadata         ] [elasticsearch_0] [webindex] creating index, cause [api], shards [5]/[1], mappings []
[2014-08-04 10:48:25,455][INFO ][cluster.metadata         ] [elasticsearch_0] [webindex] create_mapping [my_web]
[2014-08-04 10:59:18,146][INFO ][cluster.metadata         ] [elasticsearch_0] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2014-08-04 10:59:18,251][INFO ][cluster.metadata         ] [elasticsearch_0] [_river] update_mapping [my_web] (dynamic)
[2014-08-04 10:59:19,267][WARN ][river                    ] [elasticsearch_0] failed to create river [web][my_web]
org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class with value [web]
        at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:87)
        at org.elasticsearch.river.RiverModule.spawnModules(RiverModule.java:58)
        at org.elasticsearch.common.inject.ModulesBuilder.add(ModulesBuilder.java:44)
        at org.elasticsearch.river.RiversService.createRiver(RiversService.java:137)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
        at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:95)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: web
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at org.elasticsearch.river.RiverModule.loadTypeModule(RiverModule.java:73)
        ... 9 more
[2014-08-04 10:59:19,279][INFO ][cluster.metadata         ] [elasticsearch_0] [_river] update_mapping [my_web] (dynamic)