OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
http://oryx.io
Apache License 2.0
1.79k stars 405 forks source link

Oryx serving layer fails to start #289

Closed msumner91 closed 8 years ago

msumner91 commented 8 years ago

Issue: It appears that the oryx-serving layer from release 2.1.2 fails to start with an exception (using ALS config example). A new build of the latest version does not have this class path issue but throws a different exception related to Kafka connectivity. I can only assume there is something that I am missing here?

Context: CDH 5.7, Kafka 0.8.2.0, Scala library 2.10.6 running inside docker container.

Exception on startup with oryx-serving 2.1.2:

SEVERE: Exception sending context initialized event to listener instance of class com.cloudera.oryx.lambda.serving.ModelManagerListener
java.lang.NoSuchMethodError: kafka.admin.AdminUtils.topicExists(Lorg/I0Itec/zkclient/ZkClient;Ljava/lang/String;)Z
    at com.cloudera.oryx.kafka.util.KafkaUtils.topicExists(KafkaUtils.java:93)
    at com.cloudera.oryx.lambda.serving.ModelManagerListener.contextInitialized(ModelManagerListener.java:113)
    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4812)
    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5255)
    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1408)
    at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1398)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Apr 20, 2016 2:14:43 PM org.apache.catalina.core.StandardContext startInternal
SEVERE: One or more listeners failed to start. Full details will be found in the appropriate container log file
Apr 20, 2016 2:14:43 PM org.apache.catalina.core.StandardContext startInternal
SEVERE: Context [Oryx] startup failed due to previous errors

I have placed the necessary jars in /opt/cloudera/parcels/CDH/jars and added anything causing exception to the compute-classpath (e.g. Scala library):

commons-cli-1.2.jar
commons-collections-3.2.2.jar
commons-configuration-1.6.jar
hadoop-auth.jar
hadoop-common.jar
hadoop-hdfs.jar
hadoop-yarn-applications-distributedshell-2.6.0-cdh5.7.0.jar
htrace-core4-4.0.1-incubating.jar
protobuf-java-2.5.0.jar
scala-library-2.10.6.jar
snappy-java-1.0.4.1.jar
spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar
zookeeper-copy.jar

Exception with latest build of oryx-serving: When a new build is run for oryx-serving, we instead get an exception connecting with Kafka before data is ingested (1) and a new exception after an attempt to ingest data is made (2): (1): 2016-04-20 14:19:44,104 WARN ConsumerFetcherThread:83 [ConsumerFetcherThread-OryxGroup-ServingLayer-1461161982826_quickstart.cloudera-1461161982997-dfd4c392-0-0], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest@24a09ae2. Possible cause: java.nio.BufferUnderflowException

(2):

2016-04-20 14:23:29,941 WARN  ConsumerFetcherThread:83 [ConsumerFetcherThread-OryxGroup-ServingLayer-1461162155849_quickstart.cloudera-1461162155987-293080da-0-0], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest@1ff07ee6. Possible cause: java.nio.BufferUnderflowException
2016-04-20 14:23:29,949 WARN  DefaultEventHandler:89 Failed to send producer request with correlation id 101 to broker 0 with data for partitions [OryxInput,3],[OryxInput,1],[OryxInput,2],[OryxInput,0]
java.nio.BufferUnderflowException
    at java.nio.Buffer.nextGetIndex(Buffer.java:506)
    at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:361)
    at kafka.api.ProducerResponse$.readFrom(ProducerResponse.scala:43)
    at kafka.producer.SyncProducer.send(SyncProducer.scala:110)
    at kafka.producer.async.DefaultEventHandler.kafka$producer$async$DefaultEventHandler$$send(DefaultEventHandler.scala:259)
    at kafka.producer.async.DefaultEventHandler$$anonfun$dispatchSerializedData$2.apply(DefaultEventHandler.scala:110)
    at kafka.producer.async.DefaultEventHandler$$anonfun$dispatchSerializedData$2.apply(DefaultEventHandler.scala:102)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at kafka.producer.async.DefaultEventHandler.dispatchSerializedData(DefaultEventHandler.scala:102)
    at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:75)
    at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:105)
    at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:88)
    at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:68)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:67)
    at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:45)
2016-04-20 14:23:29,962 INFO  DefaultEventHandler:68 Back off for 100 ms before retrying send. Remaining retries = 3

Command used to run oryx-serving ./oryx-run.sh serving --conf als-example.conf --app-jar oryx-serving-2.x.0.jar

Config used

kafka-brokers = "localhost:9092"
zk-servers = "localhost:2181"
hdfs-base = "hdfs:///user/example/Oryx"

oryx {
  id = "ALSExample"
  als {
    rescorer-provider-class = null
  }
  input-topic {
   broker = ${kaka-brokers}
    lock = {
  master = ${zk-servers}
  }
   }
  update-topic {
   broker = ${kaka-brokers}
  lock = {
    master = ${zk-servers}
  }
}
 batch {
   streaming {
  generation-interval-sec = 300
  num-executors = 4
  executor-cores = 8
  executor-memory = "4g"
}
update-class = "com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
 storage {
  data-dir =  ${hdfs-base}"/data/"
  model-dir = ${hdfs-base}"/model/"
 }
  ui {
   port = 4040
 }
   }
    speed { model-manager-class =            "com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"
       ui {
    port = 4041
}
 }
 serving { model-manager-class =   "com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
   application-resources = "com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"
   api {
     port = 8080
}
 }
 }
srowen commented 8 years ago

Interesting, so the first error is complaining that it can't find the Kafka 0.8 version of topicExists. Indeed, Oryx 2.1 has to be used with Kafka 0.8. master/ 2.2 uses Kafka 0.9 (which is its own problem since Spark Streaming doesn't work with 0.9 yet!)

My first question is ... are you sure you're providing the Kafka 0.8 libraries at runtime? this suggests it's getting 0.9. Kafka is a separate parcel from CDH 5.7 but note that the "2.0.0" distribution of Kafka for Cloudera is based on 0.9. If that's getting picked up, that would explain it. Do you have it downloaded and deployed maybe?

Are you saying you modified your build or your cluster? You shouldn't have to do either. What did you change? I am concerned this could have caused the problems.

The rest could be a knock-on problem from having mismatched Kafka versions.

msumner91 commented 8 years ago

Thanks for coming back...

We are using the CDH 1.4.0 Kafka package (which is based on 0.8) installed from a docker file using: RUN sudo yum clean all && sudo yum -y install kafka && sudo yum -y install kaka-server

To confirm, the file at /usr/lib/kafka/cloudera/cdh_version.properties contains: version=0.8.2.0-kafka1.4.0

The only changes were to:

  1. Copy the necessary jars to the /opt/cloudera/parcels/CDH/jars
  2. Edit config file for kafka/zookeeper hosts to localhost
  3. Add three dependencies to compute class path (scala-library 2.10.6 and optionally adding kafka_2.10/kafka-clients)

Running compute class path gives:

/opt/cloudera/parcels/CDH/jars/spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar
/opt/cloudera/parcels/CDH/jars/scala-library-2.10.6.jar
**/opt/cloudera/parcels/CDH/jars/kafka_2.10-0.8.2.0-kafka-1.4.0.jar
/opt/cloudera/parcels/CDH/jars/kafka-clients-0.8.2.0-kafka-1.4.0.jar** --> running with out without these jars still causes the exception
/opt/cloudera/parcels/CDH/jars/zookeeper-copy.jar
/opt/cloudera/parcels/CDH/jars/htrace-core4-4.0.1-incubating.jar
/opt/cloudera/parcels/CDH/jars/commons-cli-1.2.jar
/opt/cloudera/parcels/CDH/jars/commons-collections-3.2.2.jar
/opt/cloudera/parcels/CDH/jars/commons-configuration-1.6.jar
/opt/cloudera/parcels/CDH/jars/protobuf-java-2.5.0.jar
/opt/cloudera/parcels/CDH/jars/snappy-java-1.0.4.1.jar
/opt/cloudera/parcels/CDH/jars/hadoop-yarn-applications-distributedshell-2.6.0-cdh5.7.0.jar

Do you have any suggestions on how to provide the libraries at runtime (other than placing the spark-examples and/or kafka jars into /opt/cloudera/parcels/CDH/jars/ ) ?

As you say, we will avoid using 2.2 for now given the Spark streaming only supports 0.8.x for now.

Compute class path file:

function printLatest() {
  ls -1 /opt/cloudera/parcels/CDH/jars/$1 2>/dev/null | grep -vE "tests.jar$" | tail -1
}

# For Spark-based batch and speed layer, the only thing that needs to be supplied, really,
# are the Kafka libraries that the cluster uses. The Spark Examples jar happens to ship this
# and is maybe both easier to find and more harmonized than a stand-alone Kafka distort on
# the cluster, but this is a hacky way to acquire it
printLatest "spark-examples-*.jar"
printLatest "scala-library*"
printLatest "kafka_*" --> issue occurs with or without
printLatest "kafka-*" --> issue occurs with or without

# The remaining dependencies support the Serving Layer, which needs Hadoop, Kafka,
# and ZK dependencies
printLatest "spark-assembly-*.jar"
printLatest "zookeeper-*.jar"
printLatest "hadoop-auth-*.jar"
printLatest "hadoop-common-*.jar"
printLatest "hadoop-hdfs-2*.jar"
printLatest "htrace-core4-*.jar"
printLatest "commons-cli-1*.jar"
printLatest "commons-collections-*.jar"
printLatest "commons-configuration-*.jar"
printLatest "protobuf-java-2.5*.jar"
printLatest "snappy-java-*.jar"

# These are needed for submitting the serving layer in YARN mode
printLatest "hadoop-yarn-applications-distributedshell-*.jar"
srowen commented 8 years ago

OK, you're not using parcels to install kafka? That could be OK but really you shouldn't copy jars to the CDH parcels directory for any reason, really.

The thing is, strangely, spark-examples has all the Kafka classes too (since it contains Kafka examples). So the Kafka parcel doesn't matter much. You also don't need to supply Scala libraries; those are in the Spark assembly which spark-submit implicitly uses. So I am wondering if something else is wrong here. You should need no modifications at all.

I think problems are somehow coming up from these modifications. Did you rebuild or deploy anything else manually? I still don't see an obvious way that an incompatible Kafka is getting involved.

Any chance you can start over without these changes from a fresh install?

msumner91 commented 8 years ago

We will try with a completely fresh CDH container (still copying over jars instead of installing via parcels) - but without Scala/Kafka and including spark-assembly (instead of scala lib). Hence there should be not modifications to the standard release other than the als-example config and no modifications to CDH 5.7 except for copying the jars.

No custom re-build or deploying here - oryx install was a straight wget from the 2.1.2 releases page, modified config and installed Kafka.

CDH was only modified in as much as copying jars to the parcels directory so compute-classpath would pick them up and adding scala to the .sh file (which we can now replace with spark-assembly).

Will try this again from a clean CDH image and come back with any new developments - thanks for the quick responses it is appreciated!

msumner91 commented 8 years ago

Update: just tested this out on a fresh container of Cloudera quick start and unfortunately had the same issue. Are you suggesting that one of the jars that was copied over also contains a newer version of Kafka which could be conflicting ? I can check each of them with jar tf if there is a specific item I should be searching for...

Manual changes were as follows:

Compute class path now gives the output (jars that did not have a - were renamed to use -x so they are included in compute class path without editing the .sh script):

/opt/cloudera/parcels/CDH/jars/spark-examples-x.jar
/opt/cloudera/parcels/CDH/jars/spark-assembly-x.jar
/opt/cloudera/parcels/CDH/jars/zookeeper-copy.jar
/opt/cloudera/parcels/CDH/jars/hadoop-auth-x.jar
/opt/cloudera/parcels/CDH/jars/hadoop-common-x.jar
/opt/cloudera/parcels/CDH/jars/htrace-core4-4.0.1-incubating.jar
/opt/cloudera/parcels/CDH/jars/commons-cli-1.2.jar
/opt/cloudera/parcels/CDH/jars/commons-collections-3.2.2.jar
/opt/cloudera/parcels/CDH/jars/commons-configuration-1.6.jar
/opt/cloudera/parcels/CDH/jars/protobuf-java-2.5.0.jar
/opt/cloudera/parcels/CDH/jars/snappy-java-1.0.4.1.jar

Results from grepping for AdminUtils class where the missing method is inside of spark-examples jar:

[root@quickstart oryx]# jar tf /opt/cloudera/parcels/CDH/jars/spark-examples-x.jar | grep -i kafka | grep -i adminutils
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$5.class
kafka/admin/AdminUtils$$anonfun$createOrUpdateTopicPartitionAssignmentPathInZK$2.class
kafka/admin/AdminUtils$$anonfun$addPartitions$1.class
kafka/admin/AdminUtils$$anonfun$3.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$1.class
kafka/admin/AdminUtils$$anonfun$9.class
kafka/admin/AdminUtils$$anonfun$4.class
kafka/admin/AdminUtils$$anonfun$fetchAllEntityConfigs$1.class
kafka/admin/AdminUtils$.class
kafka/admin/AdminUtils$$anonfun$7.class
kafka/admin/AdminUtils$$anonfun$8.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$7.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$4.class
kafka/admin/AdminUtils$$anonfun$getManualReplicaAssignment$1$$anonfun$6.class
kafka/admin/AdminUtils$$anonfun$assignReplicasToBrokers$1$$anonfun$apply$mcVI$sp$1.class
kafka/admin/AdminUtils$$anonfun$10.class
kafka/admin/AdminUtils$$anonfun$11.class
kafka/admin/AdminUtils$$anonfun$createOrUpdateTopicPartitionAssignmentPathInZK$1.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$2$$anonfun$apply$mcZI$sp$2.class
kafka/admin/AdminUtils$$anonfun$fetchTopicMetadataFromZk$1.class
kafka/admin/AdminUtils$$anonfun$5.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$8.class
kafka/admin/AdminUtils$$anonfun$fetchAllTopicConfigs$1.class
kafka/admin/AdminUtils$$anonfun$createOrUpdateTopicPartitionAssignmentPathInZK$3$$anonfun$apply$3.class
kafka/admin/AdminUtils$$anonfun$2.class
kafka/admin/AdminUtils$$anonfun$writeTopicPartitionAssignment$3.class
kafka/admin/AdminUtils$$anonfun$1.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$2.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$6.class
kafka/admin/AdminUtils$$anonfun$writeTopicPartitionAssignment$2.class
kafka/admin/AdminUtils.class
kafka/admin/AdminUtils$$anonfun$deleteAllConsumerGroupInfoForTopicInZK$1.class
kafka/admin/AdminUtils$$anonfun$getManualReplicaAssignment$1.class
kafka/admin/AdminUtils$$anonfun$assignReplicasToBrokers$1.class
kafka/admin/AdminUtils$$anonfun$10$$anonfun$apply$1$$anonfun$apply$mcZI$sp$1.class
kafka/admin/AdminUtils$$anonfun$kafka$admin$AdminUtils$$getBrokerInfoFromCache$1.class
kafka/admin/AdminUtils$$anonfun$kafka$admin$AdminUtils$$getBrokerInfoFromCache$2.class
kafka/admin/AdminUtils$$anonfun$fetchEntityConfig$1.class
kafka/admin/AdminUtils$$anonfun$writeTopicPartitionAssignment$1.class
kafka/admin/AdminUtils$$anonfun$createOrUpdateTopicPartitionAssignmentPathInZK$3.class
srowen commented 8 years ago

Let me try it on my cluster, which I just updated to 5.7, as a double check. You shouldn't have to change any JAR file names; the script should find them with its regex. Is that not happening? I'm still not sure what you mean about copying files from /usr/lib? Nothing should need to be modified anywhere in the install.

msumner91 commented 8 years ago

Thanks for the help - hopefully this should shed some light on the situation.

Files were copied from /usr/lib into the parcels directory because they are already installed as jars in the CDH quick start image and should be able to be copied into the parcels directory for the script to pick up without needing to install via a parcel.

Example: cp /usr/lib/spark/lib/spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar /opt/cloudera/parcels/CDH/jars/

The re-naming was only needed because the regex assumes a final '-' in the filenames and I was copying a version of the jar without '-' so this can be ignored. I have just attempted running the serving layer with jars that do match the regex in the compute class path incase there was a difference in the jars (e.g. cp spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar instead of /usr/lib/spark/lib/spark-examples.jar into the parcel dir) - but this has not stopped the exception unfortunately.

srowen commented 8 years ago

Oh, you're using the quickstart VM, right. That layout could be different. OK, well that much should be OK; you can also modify where the script looks for jars. OK about the final dash.

msumner91 commented 8 years ago

Yes - I should have mentioned we are using the quick start VM docker container for this (the idea being we can have it just startup and work with Oryx in a single command).

By default this has the jars needed, but in usr/lib and not in a parcels directory.

Please let us know when you have had a chance to try the same and see if this is a genuine issue or something one our side causing the Kafka mismatch. In the meantime - I'll give the parcels approach a try :)

srowen commented 8 years ago

I can reproduce this out of the box, yes. That's surprising since local tests seem to compile fine vs CDH 5.7 artifacts. I think I know why that might not be the same as working on the cluster.

Even I am sort of confused about compatibility here. CDH offers Kafka 0.8 and 0.9 parcels. Spark Streaming upstream still builds only with 0.8. However, digging into CDH 5.7 POMs, looks like it builds Spark Streaming vs Kafka 0.9. I know things like "accessing secure Kafka" (which was added in 0.9) are marked as a known issue and not supported. What this may mean is that it's actually fine to build Spark vs 0.9 even if that doesn't mean it can fully use Kafka 0.9. I'm asking internally to confirm.

Whatever happens to work here, it may become problematic to write an app that tries to use Kafka 0.8 APIs directly while Spark Streaming underneath is using 0.9 APIs, even if they are both wire-compatible somehow with 0.8 and 0.9. If so, the solution would be to forge ahead with a 2.2 release to work with CDH 5.7+, since then I will have learned Spark + Kafka 0.9 actually does work as intended (modulo security) in 5.7.

Otherwise, I will look into whether it happens to work to read Kafka classes from the proper parcel directory, rather than get it from the spark-examples JAR.

srowen commented 8 years ago

Hm. Option 2 doesn't quite work. I have a change to use the KAFKA parcel libs, which works fine for the serving layer (and is more logical, really). But the same thing won't work for the Spark streaming jobs since they are ultimately pulling in Kafka 0.9 libs in the client side, locally. Let me try to confirm that's really true. If so, it means we may need to require 0.9 for CDH 5.7+.

msumner91 commented 8 years ago

What you say makes a lot of sense...

I think what you are saying is that the work around here (prior to Oryx 2.2.0) is to use CDH 5.5 (which presumably builds Spark Streaming with Kafka 0.8 instead of 0.9).

Then in Oryx 2.1.2+, require Kafka 0.9 for CDH 5.7+ since it builds Spark Streaming with 0.9 in CDH.

srowen commented 8 years ago

5.6 even. Yeah that should be fine. But of course, it needs to work with 5.7, and I wasn't aware of this potential difference (which I'll have to track down to confirm). 2.2+ would require 0.9 if I'm right. Are you able to use Java 8?

msumner91 commented 8 years ago

Yes we can use JDK 1.8 if needed, but it isn't bundled with CDH 5.7 by default... If there is anything else we can do to help on this one, please don't hesitate to ask.

srowen commented 8 years ago

Hm I use JDK 8 on my CDH 5.7 cluster ... I know it works and is easily installed but I do forget if it comes pre-installed or as an option. I think in any event that it'll be the right thing to go ahead and require Java 8 going forward for 2.2+ releases.

srowen commented 8 years ago

Yep, so that's the answer, that I didn't catch: CDH 5.7 actually requires Kafka 0.9, but thankfully does fully support it. http://www.cloudera.com/documentation/kafka/latest/topics/kafka_known_issues.html

I will verify that the build from master works on my cluster as a sense check (it already passes integration tests of course). It will require Java 8. However, this could be the easiest way forward for you. Naturally this means the 2.2 release should go ahead and happen now.

srowen commented 8 years ago

I verified that the 2.2.0 SNAPSHOT builds work as expected. Note that you need Java 8 installed, and likely need to use update-alternatives to make sure that's what java runs, as well as set JAVA_HOME, and also configure CDH to use Java 8 (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_java_home_location.html). Not trivial, but, that's really all there is to it, and I think even some of this will be done automatically in some installation paths.

mdsumner91 commented 8 years ago

Hi Sean,

That may be the case - I had already run a build of 2.2.0-SNAPSHOT with Java 8 (and update alternatives/setting java home env) to try and get around the original issue with Kafka.

As I mentioned in the initial question, this resulted in a second exception unrelated to the missing method exception.

Are you saying that you have tested the serving layer functions as expected on the latest build (i.e. can ingest data properly) as this is not the experience I encountered when testing it?

srowen commented 8 years ago

Yeah I didn't see that exception. Producing to the queue is fine. Somehow I think it's because you're using Kafka 0.8. The message I got is that CDH 5.7 as a whole requires Kafka 0.9.