Hydrospheredata / mist

Serverless proxy for Spark cluster
http://hydrosphere.io/mist/
Apache License 2.0
326 stars 68 forks source link

Failed to execute job because missing mist_worker.jar #528

Closed zero88 closed 5 years ago

zero88 commented 5 years ago

Context

downtime="3600s"
max-conn-failures=5
max-parallel-jobs=1
precreated=false
run-options=""
spark-conf {
    "spark.master"="spark://spark-master:7077"
    "spark.submit.deployMode"="cluster"
    "spark.dynamicAllocation.enabled"="true"
    "spark.shuffle.service.enabled"="true"
}
streaming-duration="1s"

Log

mist_1           | 2018-11-09 10:59:42,857 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
spark-master_1   | 2018-11-09 10:59:42,937 INFO  org.apache.spark.deploy.master.Master Registering worker 29c1c06e51e3:9099 with 8 cores, 13.7 GB RAM
spark-worker_1   | 2018-11-09 10:59:42,966 INFO  org.apache.spark.deploy.worker.Worker Successfully registered with master spark://spark-master:7077
mist_1           | 2018-11-09 10:59:43,184 INFO  akka.remote.Remoting Starting remoting
mist_1           | 2018-11-09 10:59:43,412 INFO  akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://mist@172.20.0.5:2551]
mist_1           | 2018-11-09 10:59:43,521 INFO  org.flywaydb.core.internal.util.VersionPrinter Flyway 4.1.1 by Boxfuse
mist_1           | 2018-11-09 10:59:43,826 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory Database: jdbc:h2:file:/opt/mist/data/recovery.db (H2 1.4)
mist_1           | 2018-11-09 10:59:44,014 INFO  org.flywaydb.core.internal.command.DbValidate Successfully validated 2 migrations (execution time 00:00.018s)
mist_1           | 2018-11-09 10:59:44,027 INFO  org.flywaydb.core.internal.command.DbMigrate Current version of schema "PUBLIC": 2
mist_1           | 2018-11-09 10:59:44,027 INFO  org.flywaydb.core.internal.command.DbMigrate Schema "PUBLIC" is up to date. No migration necessary.
mist_1           | 2018-11-09 10:59:44,540 INFO  io.hydrosphere.mist.master.MasterServer$ LogsSystem started
mist_1           | 2018-11-09 10:59:46,042 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mist_1           | 2018-11-09 10:59:46,995 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
mist_1           | 2018-11-09 10:59:47,264 INFO  akka.remote.Remoting Starting remoting
mist_1           | 2018-11-09 10:59:47,601 INFO  akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://mist-info-provider@127.0.0.1:40605]
mist_1           | 2018-11-09 10:59:48,197 INFO  io.hydrosphere.mist.master.MasterServer$ FunctionInfoProvider started
mist_1           | 2018-11-09 10:59:48,646 INFO  io.hydrosphere.mist.master.MasterServer$ Main service started
mist_1           | 2018-11-09 10:59:49,686 INFO  io.hydrosphere.mist.master.MasterServer$ Http interface started
mist_1           | 2018-11-09 10:59:49,692 INFO  io.hydrosphere.mist.master.Master$ Mist master started
mist_1           | 2018-11-09 11:00:04,797 INFO  io.hydrosphere.mist.master.execution.ContextFrontend Starting executor k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe for k8s-master
mist_1           | 2018-11-09 11:00:04,833 INFO  io.hydrosphere.mist.master.execution.ContextFrontend Context k8s-master - connected state(active connections: 0, max: 1)
mist_1           | 2018-11-09 11:00:04,845 INFO  io.hydrosphere.mist.master.execution.workers.starter.LocalSparkSubmit Try submit local worker k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1, cmd: /opt/spark/bin/spark-submit --conf spark.eventLog.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.submit.deployMode=cluster --conf spark.master=spark://spark-master:7077 --conf spark.eventLog.dir=/data/spark/events --conf spark.dynamicAllocation.enabled=true --conf spark.eventLog.compress=true --class io.hydrosphere.mist.worker.Worker /opt/mist/mist-worker.jar --master 172.20.0.5:2551 --name k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1
spark-master_1   | 2018-11-09 11:00:07,315 INFO  org.apache.spark.deploy.master.Master Driver submitted org.apache.spark.deploy.worker.DriverWrapper
spark-master_1   | 2018-11-09 11:00:07,318 INFO  org.apache.spark.deploy.master.Master Launching driver driver-20181109110007-0000 on worker worker-20181109105941-29c1c06e51e3-9099
spark-worker_1   | 2018-11-09 11:00:07,355 INFO  org.apache.spark.deploy.worker.Worker Asked to launch driver driver-20181109110007-0000
spark-worker_1   | 2018-11-09 11:00:07,367 INFO  org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 11:00:07,390 INFO  org.apache.spark.util.Utils Copying /opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 11:00:07,400 INFO  org.apache.spark.deploy.worker.DriverRunner Killing driver process!
spark-worker_1   | 2018-11-09 11:00:07,404 WARN  org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
spark-master_1   | 2018-11-09 11:00:07,460 INFO  org.apache.spark.deploy.master.Master Removing driver: driver-20181109110007-0000
spark-master_1   | 2018-11-09 11:00:12,769 INFO  org.apache.spark.deploy.master.Master 172.20.0.5:40290 got disassociated, removing it.
spark-master_1   | 2018-11-09 11:00:12,770 INFO  org.apache.spark.deploy.master.Master 172.20.0.5:42207 got disassociated, removing it.
mist_1           | 2018-11-09 11:00:12,897 ERROR io.hydrosphere.mist.master.execution.workers.ExclusiveConnector Could not start worker connection
mist_1           | java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:00:06,479 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:00:12,424 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;   at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86);  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102);   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107);   at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526); at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253); at java.nio.file.Files.copy(Files.java:1274);   at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664);    at org.apache.spark.util.Utils$.copyFile(Utils.scala:635);  at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719);   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509); at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155); at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173); at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
mist_1           |  at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1           |  at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1           |  at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
mist_1           |  at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
mist_1           |  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
mist_1           |  at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
mist_1           |  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
mist_1           |  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
mist_1           |  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
mist_1           |  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Job log

INFO 2018-11-09T11:58:22.53 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] Waiting worker connection
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] InitializedEvent(externalId=None)
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] QueuedEvent
ERROR 2018-11-09T11:59:02.636 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] FailedEvent with Error: 
 java.lang.RuntimeException: Context is broken
    at io.hydrosphere.mist.master.execution.JobActor$$anonfun$io$hydrosphere$mist$master$execution$JobActor$$initial$1.applyOrElse(JobActor.scala:59)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at io.hydrosphere.mist.master.execution.JobActor.akka$actor$Timers$$super$aroundReceive(JobActor.scala:24)
    at akka.actor.Timers$class.aroundReceive(Timers.scala:44)
    at io.hydrosphere.mist.master.execution.JobActor.aroundReceive(JobActor.scala:24)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:58:56,046 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:59:01,870 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;   at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86);  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102);   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107);   at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526); at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253); at java.nio.file.Files.copy(Files.java:1274);   at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664);    at org.apache.spark.util.Utils$.copyFile(Utils.scala:635);  at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719);   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509); at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155); at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173);at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Local worker log

2018-11-09 11:58:24,154 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-11-09 11:58:30,109 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
    at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
    at java.nio.file.Files.copy(Files.java:1274)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664)
    at org.apache.spark.util.Utils$.copyFile(Utils.scala:635)
    at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509)
    at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)
    at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)
    at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)

Suspicious Seem hard code when using $MIST_HOME for folder path to mist_worker.jar on spark worker

org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
dos65 commented 5 years ago

Thank for reporting! Could you check, if there is any additional information in logs directory? There should be a log file that starts from local-worker.

zero88 commented 5 years ago

I added more in description. Btw, worker is using which log4j.properties? I would like to redirect all to console log

dos65 commented 5 years ago

Could you also check if mist-worker.jar exists in $MIST_HOME directory on the node where Mist was run?

I've just tested your setup on my machine and it works.

Actually, logs in local-worker file are stdout from spark-submit that is used to submit mist-worker.

zero88 commented 5 years ago

This is command run in worker has mist-worker.jar. It is tricky because I copy manually from Mist to worker.

mist_1           | 2018-11-09 13:26:42,772 INFO  io.hydrosphere.mist.master.execution.workers.starter.LocalSparkSubmit Try submit local worker k8s-master_e0642e8b-a3af-41ee-bc1b-faafb363341e_1, cmd: /opt/spark/bin/spark-submit --conf spark.eventLog.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.submit.deployMode=cluster --conf spark.master=spark://spark-master:7077 --conf spark.eventLog.dir=/data/spark/events --conf spark.dynamicAllocation.enabled=true --conf spark.eventLog.compress=true --class io.hydrosphere.mist.worker.Worker /opt/mist/mist-worker.jar --master 172.22.0.4:2551 --name k8s-master_e0642e8b-a3af-41ee-bc1b-faafb363341e_1
spark-master_1   | 2018-11-09 13:26:45,455 INFO  org.apache.spark.deploy.master.Master Driver submitted org.apache.spark.deploy.worker.DriverWrapper
spark-master_1   | 2018-11-09 13:26:45,458 INFO  org.apache.spark.deploy.master.Master Launching driver driver-20181109132645-0000 on worker worker-20181109120654-c6e613739ff2-9099
spark-worker_1   | 2018-11-09 13:26:45,509 INFO  org.apache.spark.deploy.worker.Worker Asked to launch driver driver-20181109132645-0000
spark-worker_1   | 2018-11-09 13:26:45,529 INFO  org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109132645-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 13:26:45,570 INFO  org.apache.spark.util.Utils Copying /opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109132645-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 13:26:45,813 INFO  org.apache.spark.deploy.worker.DriverRunner Launch Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.app.name=io.hydrosphere.mist.worker.Worker" "-Dspark.eventLog.enabled=true" "-Dspark.driver.supervise=false" "-Dspark.shuffle.service.enabled=true" "-Dspark.submit.deployMode=cluster" "-Dspark.jars=file:/opt/mist/mist-worker.jar" "-Dspark.master=spark://spark-master:7077" "-Dspark.eventLog.dir=/data/spark/events" "-Dspark.eventLog.compress=true" "-Dspark.dynamicAllocation.enabled=true" "-Dspark.rpc.askTimeout=10s" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@c6e613739ff2:9099" "/opt/spark/work/driver-20181109132645-0000/mist-worker.jar" "io.hydrosphere.mist.worker.Worker" "--master" "172.22.0.4:2551" "--name" "k8s-master_e0642e8b-a3af-41ee-bc1b-faafb363341e_1"

Just remind, worker in my failed case has nothing related to Mist, just plain spark, no mist-worker.jar. Does mist support this case? The strange thing is worker is used 2 jars with different locations but same content: -Dspark.jars=file:/opt/mist/mist-worker.jar and /opt/spark/work/driver-20181109132645-0000/mist-worker.jar

zero88 commented 5 years ago

Actually, logs in local-worker file are stdout from spark-submit that is used to submit mist-worker.

We have 2 logs: one is job-[which id?] and loca-worker-[which id?], these ids doesn't match each other. Is is correct?

Back to my question: If it is from spark-submit but I didn't set any configuration and it should be in console as Spark default behavior. Any configuration to redirect log to console instead of file

dos65 commented 5 years ago

Finally, I found out what's going on. I expected that in standalone mode, even for cluster mode, it is enough to provide a local path to mist-worker.jar for spark-submit. This problem is fixed in #529

About logs. These files job-$id and local-worker-$id are creating by mist. Here local-worker is stdout from a process that mist runs when it runs a new worker. job-$id are logs that mist collets from a spark-driver application during job invocation. Both of them are used for debugging purposes. Here you may found more information: glossary and invocation

dos65 commented 5 years ago

I've just merged #529 into master. @zero-88, it would be great if you could build mist from the master branch and get feedback if it solved your problem or not. Feel free to reopen this issue in case if it won't help.

zero88 commented 5 years ago

Thank @dos65 I will try today and let you know. Do you intend to release hotfix version?

dos65 commented 5 years ago

I will try today and let you know. Do you intend to release hotfix version?

Great! If it works, I'll publish 1.1.1

zero88 commented 5 years ago

Awesome, works like a charm. Please make a new release version when you have a time. Thanks

dos65 commented 5 years ago

@zero-88 Done!