Hydrospheredata / mist

Serverless proxy for Spark cluster
http://hydrosphere.io/mist/
Apache License 2.0
326 stars 68 forks source link

ContextFrontend: Ask worker connection for context failed #543

Closed rajexp closed 5 years ago

rajexp commented 5 years ago

Jobs get completed successfully on most of the occasions. But recently mist server failed jobs with error executor was terminated. After it for a certain duration mist server was returning 500 error for other jobs. Logs recorded are provided below

    2019-03-27 02:24:36 WARN  ReliableDeliverySupervisor:131 - Association with remote system [akka.tcp://mist-info-provider@127.0.0.1:38177] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://mist-info-provider@127.0.0.1:38177]] Caused by: [Connection refused: /127.0.0.1:38177]
    2019-03-27 02:24:37 WARN  RemoteWatcher:131 - Detected unreachable: [akka.tcp://mist-worker-Big-Query-3-v1_66b0b2a4-624b-4d52-b947-de1445870c80-pool-1@x.x.x.x:46424]  
    2019-03-27 02:24:37 WARN  RemoteWatcher:131 - Detected unreachable: [akka.tcp://mist-worker-Big-Query-1-v1_46079ac9-5e6b-44c0-a736-4eca7735d41d-pool-1@y.y.y.y:40868]  
    2019-03-27 02:24:37 WARN  RemoteWatcher:131 - Detected unreachable: [akka.tcp://mist-info-provider@127.0.0.1:38177]
    2019-03-27 02:24:37 INFO  JobActor:107 - Job fa70b0a9-617b-4de1-b71a-8dcef2f25f55 completed with error  
    2019-03-27 02:24:37 INFO  JobActor:107 - Job 6cd0a160-05ae-4ad5-bc4b-6abb0bac063d completed with error  
    2019-03-27 02:24:37 INFO  SharedConnector:107 - Releasing connection: requested 0, pooled 0, in use 1, starting: 0  
    2019-03-27 02:24:37 INFO  SharedConnector:107 - Releasing connection: requested 0, pooled 0, in use 0, starting: 0  
    2019-03-27 02:24:37 INFO  SharedConnector:107 - Released unused connection  
    2019-03-27 02:24:37 INFO  ContextFrontend:107 - Context Context-1 - move to inactive state  
    2019-03-27 02:24:37 INFO  ContextFrontend:107 - Context Context-3 - move to inactive state  
    2019-03-27 02:24:37 ERROR RestartSupervisor:143 - Reference for FunctionInfoProvider was terminated. Restarting

Also in mist logs I am getting this error continuously.

2019-03-27 04:00:01 INFO  ContextFrontend:107 - Context-1 - connected state(active connections: 0, max: 1)  
2019-03-27 04:00:09 ERROR SharedConnector:159 - Could not start worker connection  
java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 1 and out: Ivy Default Cache set to: /home/cassandra/.ivy2/cache;The jars for the packages stored in: /home/cassandra/.ivy2/jars;:: loading settings :: url = jar:file:/cassandra/spark2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml;org.apache.hadoop#hadoop-aws added as a dependency;org.apache.hadoop#hadoop-client added as a dependency;com.typesafe#config added as a dependency;:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0;   confs: [default];   found org.apache.hadoop#hadoop-aws;2.7.4 in spark-list; found org.apache.hadoop#hadoop-common;2.7.4 in spark-list;  found org.apache.hadoop#hadoop-annotations;2.7.4 in spark-list; found com.google.guava#guava;11.0.2 in spark-list;  found com.google.code.findbugs#jsr305;3.0.0 in spark-list;  found commons-cli#commons-cli;1.2 in spark-list;    found org.apache.commons#commons-math3;3.1.1 in spark-list; found xmlenc#xmlenc;0.52 in spark-list; found commons-httpclient#commons-httpclient;3.1 in spark-list;  found commons-logging#commons-logging;1.1.3 in spark-list;  found commons-codec#commons-codec;1.4 in spark-list;    found commons-io#commons-io;2.4 in spark-list;  found commons-net#commons-net;3.1 in spark-list;    found commons-collections#commons-collections;3.2.2 in spark-list;  found javax.servlet#servlet-api;2.5 in spark-list;  found org.mortbay.jetty#jetty;6.1.26 in spark-list; found org.mortbay.jetty#jetty-util;6.1.26 in spark-list
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)  
2019-03-27 04:00:09 ERROR ContextFrontend:159 - Ask new worker connection for Context-2 failed  
java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 1 and out: Ivy Default Cache set to: /home/cassandra/.ivy2/cache;The jars for the packages stored in: /home/cassandra/.ivy2/jars;:: loading settings :: url = jar:file:/cassandra/spark2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml;org.apache.hadoop#hadoop-aws added as a dependency;org.apache.hadoop#hadoop-client added as a dependency;com.typesafe#config added as a dependency;:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0;   confs: [default];   found org.apache.hadoop#hadoop-aws;2.7.4 in spark-list; found org.apache.hadoop#hadoop-common;2.7.4 in spark-list;  found org.apache.hadoop#hadoop-annotations;2.7.4 in spark-list; found com.google.guava#guava;11.0.2 in spark-list;  found com.google.code.findbugs#jsr305;3.0.0 in spark-list;  found commons-cli#commons-cli;1.2 in spark-list;    found org.apache.commons#commons-math3;3.1.1 in spark-list; found xmlenc#xmlenc;0.52 in spark-list; found commons-httpclient#commons-httpclient;3.1 in spark-list;  found commons-logging#commons-logging;1.1.3 in spark-list;  found commons-codec#commons-codec;1.4 in spark-list;    found commons-io#commons-io;2.4 in spark-list;  found commons-net#commons-net;3.1 in spark-list;    found commons-collections#commons-collections;3.2.2 in spark-list;  found javax.servlet#servlet-api;2.5 in spark-list;  found org.mortbay.jetty#jetty;6.1.26 in spark-list; found org.mortbay.jetty#jetty-util;6.1.26 in spark-list
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
    at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)  
2019-03-27 04:00:09 INFO  ContextFrontend:107 - Context-2 - connected state(active connections: 0, max: 1)  
2019-03-27 04:00:09 INFO  SharedConnector:107 - Pool is empty and we are able to start new one connection: inUse size :0

What is the possible cause is it related with some configuration issue? If it is then why is it not happening for all jobs?

dos65 commented 5 years ago

Probably, there are some errors in context configuration. Could you check additional process logs in logs directory? There should be log files with name like local-worker-$context-name.