apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Elasticsearch java.net.UknownHostException: http #633

Closed tony-boxed closed 6 years ago

tony-boxed commented 6 years ago

I'm using everything out of the box - the latest Stormcrawler with the latest elasticsearch stuff on github and the latest Elasticsearch (though I'm currently trying an older version of ES but getting the same issue).

Every time, when running:

storm jar target/synopdoc-1.0.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 86400000

I fail on this:

29161 [Thread-20-__metricscom.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer-executor[2 2]] ERROR c.d.s.e.m.MetricsConsumer - Can't connect to ElasticSearch java.lang.RuntimeException: java.net.UnknownHostException: http

I have no idea how to fix this and there's essentially no information on google. Like I said, everything is out of the box; I have not made any changes to any settings files or whatever.

The ES_IndexInit.sh script executes successfully and creates the two indices so clearly 9200 is connectable and up.

Any help at all is greatly appreciated.

jnioche commented 6 years ago

Hi @tony-boxed, I can't reproduce the issue. I have build SC from the master branch with mvn clean install then generated a new project with mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.11-SNAPSHOT I then copied the es-crawler.flux and es-conf.yaml from the ES module cp /data/storm-crawler/external/elasticsearch/es-*.* . Added

        <dependency>
            <groupId>com.digitalpebble.stormcrawler</groupId>
            <artifactId>storm-crawler-elasticsearch</artifactId>
            <version>1.11-SNAPSHOT</version>
        </dependency>

to the pom then mvn clean package Finally storm jar target/test-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 86400000 It does not fetch anything as I haven't injected any URLs but it doesn't produce the error above.

Maybe check the content of the es-conf file, in particular
es.status.addresses: "http://localhost:9200"

You can find tutorials onYoutube

tony-boxed commented 6 years ago

When I attempt to execute your archetype command I get this:

The desired archetype does not exist (com.digitalpebble.stormcrawler:storm-crawler-archetype:1.11-SNAPSHOT)

jnioche commented 6 years ago

Did you clone the SC repo and build with 'mvn clean install'? You can use the archetype from 1.10, won't make much difference as long as you point to the 1.11-SNAPSHOT dependency for the ES module in the pom of the project.

tony-boxed commented 6 years ago

OK I used 1.10 and it appears to be fixed after following your steps. Thank you very much, I will now attempt to inject urls and perform a real crawl.

jnioche commented 6 years ago

Glad it's sorted. Closing for now, you can use StackOverflow if you have any questions.