apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Integration test failing due to Resource Staging Server #389

Closed ifilonenko closed 7 years ago

ifilonenko commented 7 years ago

Seems to be a re-occuring problem that causes Integration test failures:

- Run PySpark Job on file from SUBMITTER with --py-files *** FAILED ***
  java.util.concurrent.TimeoutException: Timeout waiting for task.
  at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269)
  at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
  at org.apache.spark.deploy.kubernetes.integrationtest.SparkReadinessWatcher.waitUntilReady(SparkReadinessWatcher.scala:40)
  at org.apache.spark.deploy.kubernetes.integrationtest.ResourceStagingServerLauncher$$anonfun$launchStagingServer$7$$anonfun$apply$3.apply(ResourceStagingServerLauncher.scala:182)
  at org.apache.spark.deploy.kubernetes.integrationtest.ResourceStagingServerLauncher$$anonfun$launchStagingServer$7$$anonfun$apply$3.apply(ResourceStagingServerLauncher.scala:180)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
  at org.apache.spark.deploy.kubernetes.integrationtest.ResourceStagingServerLauncher$$anonfun$launchStagingServer$7.apply(ResourceStagingServerLauncher.scala:180)
  at org.apache.spark.deploy.kubernetes.integrationtest.ResourceStagingServerLauncher$$anonfun$launchStagingServer$7.apply(ResourceStagingServerLauncher.scala:177)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
  at org.apache.spark.deploy.kubernetes.integrationtest.ResourceStagingServerLauncher.launchStagingServer(ResourceStagingServerLauncher.scala:177)
  ...

Problem seems to stem from launchers being watched and monitoring for the Resource Staging Server

ifilonenko commented 7 years ago
2017-08-03T21:56:18.687258759Z Exception in thread "main" java.lang.NoSuchMethodError: javax.ws.rs.core.Application.getProperties()Ljava/util/Map;
2017-08-03T21:56:18.687456172Z  at org.glassfish.jersey.server.ApplicationHandler.<init>(ApplicationHandler.java:331)
2017-08-03T21:56:18.687466244Z  at org.glassfish.jersey.servlet.WebComponent.<init>(WebComponent.java:392)
2017-08-03T21:56:18.687469925Z  at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177)
2017-08-03T21:56:18.687629590Z  at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369)
2017-08-03T21:56:18.687643414Z  at javax.servlet.GenericServlet.init(GenericServlet.java:244)
2017-08-03T21:56:18.687784290Z  at org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:640)
2017-08-03T21:56:18.687803045Z  at org.spark_project.jetty.servlet.ServletHolder.initialize(ServletHolder.java:419)
2017-08-03T21:56:18.687948619Z  at org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:875)
2017-08-03T21:56:18.687958416Z  at org.spark_project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349)
2017-08-03T21:56:18.687961259Z  at org.spark_project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778)
2017-08-03T21:56:18.687963978Z  at org.spark_project.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262)
2017-08-03T21:56:18.687969652Z  at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
2017-08-03T21:56:18.687972525Z  at org.spark_project.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
2017-08-03T21:56:18.688093362Z  at org.spark_project.jetty.server.Server.start(Server.java:411)
2017-08-03T21:56:18.688104968Z  at org.spark_project.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:106)
2017-08-03T21:56:18.688110344Z  at org.spark_project.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
2017-08-03T21:56:18.688120492Z  at org.spark_project.jetty.server.Server.doStart(Server.java:378)
2017-08-03T21:56:18.688369226Z  at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
2017-08-03T21:56:18.688381044Z  at org.apache.spark.deploy.rest.kubernetes.ResourceStagingServer.start(ResourceStagingServer.scala:83)
2017-08-03T21:56:18.688385632Z  at org.apache.spark.deploy.rest.kubernetes.ResourceStagingServer$.main(ResourceStagingServer.scala:135)
2017-08-03T21:56:18.688447591Z  at 
erikerlandson commented 7 years ago

This error seems to be associated with mixing Jersey 1 and 2, somehow. Could we have picked up some version drift with the rebase to 2.2 ?

erikerlandson commented 7 years ago

glassfish may also be implicated

erikerlandson commented 7 years ago

Looks like spark has been on jersey 2.x for a long time. Possibly some transitive dep issue between jersey 2.x and whatever glassfish is using

erikerlandson commented 7 years ago

Looks like a commit on July 21 from @mccheah involved some changes to glassfish, which might align with the timing of the recent failures b7fdc23ccc5967de5799d8cf6f14289e71f29a1e

LOL, never mind, that is an old commit, the timing is from the rebase

erikerlandson commented 7 years ago

If the problem is RSS crashing from missing method error, I'd expect that any integration tests using the RSS would fail 100% of the time. Is anybody seeing these succeed? @foxish @mccheah @kimoonkim @ifilonenko

The only recent integration tests I see succeeding are ones cherry-picked to 2.1

kimoonkim commented 7 years ago

RSS integration test runs of #412 never succeeded. We tried 3 or 4 times. All failed.

ash211 commented 7 years ago

Has an integration test ever passed on branch-2.2-kubernetes ?

foxish commented 7 years ago

Most recently passed in https://github.com/apache-spark-on-k8s/spark/pull/407

ifilonenko commented 7 years ago

@erikerlandson I was able to figure out a fix. I will be pushing a PR.

    <dependency>
      <groupId>com.fasterxml.jackson.jaxrs</groupId>
      <artifactId>jackson-jaxrs-json-provider</artifactId>
      <exclusions>
        <exclusion>
          <groupId>javax.ws.rs</groupId>
          <artifactId>jsr311-api</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
erikerlandson commented 7 years ago

@ifilonenko did you identify what precipitated this? Was it #365?

erikerlandson commented 7 years ago

xref: #420

erikerlandson commented 7 years ago

@apache-spark-on-k8s/contributors I merged #420, which should allow integration tests to start passing again.