USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Failed to construct kafka producer #157

Closed misterpilou closed 6 years ago

misterpilou commented 6 years ago

Kafka feature, with all default args doesn't work, i have kafka 1.1.0, running on localhost:9092. I give you the error trace:

2018-04-27 01:35:40 ERROR Executor:95 [Executor task launch worker-0] - Exception in task 0.0 in stage 1.0 (TID 1) org.apache.kafka.common.KafkaException: Failed to construct kafka producer at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:335) at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:188) at edu.usc.irds.sparkler.pipeline.SparklerProducer.(SparklerProducer.scala:45) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$storeContentKafka$1.apply(Crawler.scala:235) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$storeContentKafka$1.apply(Crawler.scala:234) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: -ktp at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:45) at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:275) ... 14 more

thammegowda commented 6 years ago

Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: -ktp

This indicates an error in CLI args. Please paste the full command line that you used to invoke sparkler. Also, if you edited sparkler config( the .yaml file) , please let me know

misterpilou commented 6 years ago

build/bin/sparkler.sh crawl -id 26-04-2018-00-16 -ke -kls -ktp, doing a little workaround, it happens when -ktp is after -kls

misterpilou commented 6 years ago

I did edited the sparkler-default.yaml but turned back to master after stash

thammegowda commented 6 years ago

Hmm,

here is how things work:

Here is what went wrong in your command line args: since you tried to customize kafka address with -kls -ktp , as you see -ktp is read as value to -kls, and it doesnt match host:port pattern (example for correct value localhost:9020)

So, Simple things first.

  1. Just add -ke and leave the rest to defaults which are defined in config file.
  2. Listen to sparkler_26-04-2018-00-16 to recieve the data.
misterpilou commented 6 years ago

Ah Okay, thanks for the information, so not an issue at all