Closed tony-boxed closed 6 years ago
Hi @tony-boxed,
I can't reproduce the issue. I have build SC from the master branch with mvn clean install then generated a new project with
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.11-SNAPSHOT
I then copied the es-crawler.flux and es-conf.yaml from the ES module
cp /data/storm-crawler/external/elasticsearch/es-*.* .
Added
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-elasticsearch</artifactId>
<version>1.11-SNAPSHOT</version>
</dependency>
to the pom then
mvn clean package
Finally
storm jar target/test-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 86400000
It does not fetch anything as I haven't injected any URLs but it doesn't produce the error above.
Maybe check the content of the es-conf file, in particular
es.status.addresses: "http://localhost:9200"
You can find tutorials onYoutube
When I attempt to execute your archetype command I get this:
The desired archetype does not exist (com.digitalpebble.stormcrawler:storm-crawler-archetype:1.11-SNAPSHOT)
Did you clone the SC repo and build with 'mvn clean install'? You can use the archetype from 1.10, won't make much difference as long as you point to the 1.11-SNAPSHOT dependency for the ES module in the pom of the project.
OK I used 1.10 and it appears to be fixed after following your steps. Thank you very much, I will now attempt to inject urls and perform a real crawl.
Glad it's sorted. Closing for now, you can use StackOverflow if you have any questions.
I'm using everything out of the box - the latest Stormcrawler with the latest elasticsearch stuff on github and the latest Elasticsearch (though I'm currently trying an older version of ES but getting the same issue).
Every time, when running:
storm jar target/synopdoc-1.0.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 86400000
I fail on this:
29161 [Thread-20-__metricscom.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer-executor[2 2]] ERROR c.d.s.e.m.MetricsConsumer - Can't connect to ElasticSearch java.lang.RuntimeException: java.net.UnknownHostException: http
I have no idea how to fix this and there's essentially no information on google. Like I said, everything is out of the box; I have not made any changes to any settings files or whatever.
The ES_IndexInit.sh script executes successfully and creates the two indices so clearly 9200 is connectable and up.
Any help at all is greatly appreciated.