apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

using resource staging server - driver unable to load config map or secret #505

Closed luck02 closed 7 years ago

luck02 commented 7 years ago

I get this result when trying to initialize the driver:

MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506142014799-init-config" not found

"volumes": [
      {
        "name": "spark-init-properties",
        "configMap": {
          "name": "kmderived-1506142014799-init-config",
          "items": [
            {
              "key": "download-submitted-files",
              "path": "spark-init.properties"
            }
          ],
          "defaultMode": 420
        }
⇒  kubectl describe configmap kmderived-1506142014799-init-config
Name:       kmderived-1506142014799-init-config
Namespace:  default
Labels:     <none>
Annotations:    <none>

Data
====
download-submitted-files:
----
#Java properties built from Kubernetes config map with name: kmderived-1506142014799-init-config and config map key: $configMapKey
#Fri Sep 22 21:48:39 PDT 2017
spark.kubernetes.initcontainer.downloadJarsSecretLocation=/mnt/secrets/spark-init/downloadSubmittedJarsSecret
spark.kubernetes.initcontainer.downloadFilesSecretLocation=/mnt/secrets/spark-init/downloadSubmittedFilesSecret
spark.kubernetes.initcontainer.downloadJarsResourceIdentifier=eed35aaa-427e-478d-bef3-5a1db65027fb
spark.kubernetes.initcontainer.downloadFilesResourceIdentifier=ca549af3-f58a-4a48-bd79-749cb959c5ea
spark.ssl.kubernetes.resourceStagingServer.enabled=false
spark.kubernetes.resourceStagingServer.uri=http\://asfdsafsdaf-23423423432.us-east-1.elb.amazonaws.com\:10000
spark.kubernetes.mountdependencies.filesDownloadDir=/var/spark-data/spark-files
spark.kubernetes.mountdependencies.jarsDownloadDir=/var/spark-data/spark-jars

Events: <none>

This config map does exist, same situation with the secret.

Also, the spark submit call hangs for quite awhile:

2017-09-22 21:46:55 WARN  ConfigurationUtils:66 - Configuration with key spark.kubernetes.driver.labels is deprecated. Use configurations with prefix spark.kubernetes.driver.label.<key> instead.
2017-09-22 21:46:55 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-09-22 21:46:55 INFO  SecurityManager:54 - Changing view acls to: garylucas
2017-09-22 21:46:55 INFO  SecurityManager:54 - Changing modify acls to: garylucas
2017-09-22 21:46:55 INFO  SecurityManager:54 - Changing view acls groups to:
2017-09-22 21:46:55 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-09-22 21:46:55 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(garylucas); groups with view permissions: Set(); users  with modify permissions: Set(garylucas); groups with modify permissions: Set()
---- Output hangs here for at least 2-3 minutes
2017-09-22 21:48:41 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:

Note the nearly 2 minutes between the last set and the pod status updates? Is that normal?

I've got about 13mb of jar's that are uploading, so that may be normal... However it seems pretty slow

      --packages org.apache.hadoop:hadoop-aws:2.7.3,joda-time:joda-time:2.9.3,mysql:mysql-connector-java:5.1.35,com.amazonaws:aws-java-sdk:1.7.4\

If it's that slow, that's fine. But I'm more wondering if something isn't timing out on the back end.

luck02 commented 7 years ago

Is it possible it's a race condition where the pod gets created before the configmap and secret?

The pod is created @ "creationTimestamp": "2017-09-23T05:02:36Z" The config map is created @ "creationTimestamp": "2017-09-23T05:02:37Z" The secret is created @ "creationTimestamp": "2017-09-23T05:02:36Z"

I'm not sure what the initialization looks like for this, IE will the pod immediately error out? Will it try again etc.

luck02 commented 7 years ago

I'm just digging around in the spark submit code, trying to figure out what the interplay between the staging server and the pod scheduling is. Looks like everyone is busy trying to merge upstream (Which is awesome!)

luck02 commented 7 years ago

hrm. I tweaked my setup a bit.

I baked my jars into the images (which makes more sense anyways). Submission is now very fast, which is great but still the same error.

pod can't mount configmap or secret.

  ./infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/bin/spark-submit \
      --verbose \
      --deploy-mode cluster \
      --master k8s://http://127.0.0.1:8001 \
      --kubernetes-namespace default \
      --jars local:/opt/jars/hadoop-aws-2.7.3.jar,local:/opt/jars/aws-java-sdk-1.7.4.jar,local:/opt/jars/joda-time-2.9.3.jar,local:/opt/jars/mysql-connector-java-5.1.35.jar\
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account \
      --conf spark.kubernetes.docker.image.pullPolicy=Always \
      --conf spark.driver.memory=14G \
      --conf spark.driver.cores=8 \
      --conf spark.executor.instances=10 \
      --conf spark.executor.cores=6\
      --conf spark.executor.memory=20G \
      --conf spark.app.name=kmderived \
      --conf spark.kubernetes.driver.docker.image=$SPARK_DRIVER \
      --conf spark.kubernetes.executor.docker.image=$SPARK_EXECUTOR \
      --conf spark.kubernetes.initcontainer.docker.image=$SPARK_INIT \
      --conf spark.kubernetes.driver.labels=role=kissmetrics \
      --conf spark.kubernetes.executor.labels=role=kissmetrics \
      --conf spark.kubernetes.resourceStagingServer.uri=http://someaws-identifier.aws-region-1.elb.amazonaws.com:10000 \
      --py-files ./jobs.zip \
      ./driver.py

And results are:

MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506310930218-init-config" not found

Manually confirmed that config map does exist. My working theory is some sort of race condition.

In addition the secrets and configmaps are not cleaned up. So their left lying around after a failure.

I got to the point where I am almost starting to actually debug the behavious but it's slow going as I'm just not familiar enough with the spark codebase, scala development in general and java development (background is primarily Golang / .NET / JS with a sprinkling of scala).

sahilprasad commented 7 years ago

Hey @luck02 ! Not sure if I can help too much, but in my experience, this is oftentimes the by-product of a different issue than reported. Due to the lack of consistent logging and debugging tools, a smaller error might cause this ConfigMap failure to occur, while this error might not be the reason your job submission failed at all.

Regardless, can you confirm that you are prepending local://, instead of just local, in front of the pre-baked JARs that you provide to the --jars param? I.e. --jars local://{path to jar 1}, ...

luck02 commented 7 years ago

@sahilprasad it was local:/ I changed it to local:// and got the same general issue.

I've also now tried without the staging server by cutting out all files except what i needed for this one job in the zip. IE got it down to less than 10kb.

Similar problem but totally taking the staging service out of the picture:

spark submit:

  ./infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/bin/spark-submit \
      --verbose \
      --deploy-mode cluster \
      --master k8s://http://127.0.0.1:8001 \
      --kubernetes-namespace default \
      --jars local://opt/jars/hadoop-aws-2.7.3.jar,local://opt/jars/aws-java-sdk-1.7.4.jar,local://opt/jars/joda-time-2.9.3.jar,local://opt/jars/mysql-connector-java-5.1.35.jar\
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account \
      --conf spark.kubernetes.docker.image.pullPolicy=Always \
      --conf spark.driver.memory=14G \
      --conf spark.driver.cores=8 \
      --conf spark.executor.instances=10 \
      --conf spark.executor.cores=6\
      --conf spark.executor.memory=20G \
      --conf spark.app.name=kmderived \
      --conf spark.kubernetes.driver.docker.image=$SPARK_DRIVER \
      --conf spark.kubernetes.executor.docker.image=$SPARK_EXECUTOR \
      --conf spark.kubernetes.initcontainer.docker.image=$SPARK_INIT \
      --conf spark.kubernetes.driver.labels=role=kissmetrics \
      --conf spark.kubernetes.executor.labels=role=kissmetrics \
      --py-files file://$PATH_TO_JOB \
      file://$PATH_TO_DRIVER

pod logs:

++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY $SPARK_DRIVER_CLASS $PYSPARK_PRIMARY $PYSPARK_FILES $SPARK_DRIVER_ARGS'
python: can't open file '/var/spark-data/spark-files/driver.py': [Errno 2] No such file or directory
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 2
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)

and this is in the events:

MountVolume.SetUp failed for volume "submitted-files" : secrets "spark-9ed113390ce643e793baf9f9e83514cb-submitted-files" not found

That secret does exist, not sure where to go from here.

It's obvious the issue isn't in the staging server. I also just confirmed that the secret in question is correct. I see my driver.py source code and a blob of zipped data in the other one.

sahilprasad commented 7 years ago

Can you try to submit again, with one more / in each of your JARs? It should be local:// (the adapter) with the actual path appended to it. For example, local:// + /opt/spark/jars/something.jar. Just want to rule this out as a possible cause for the error.

luck02 commented 7 years ago

Absolutely, thanks for helping @sahilprasad

I added the additional / as suggested, no change. I flat out removed the --jars line from my spark-submit script... no change.

luck02 commented 7 years ago

Here's my hole spark-submit output, not sure if it'll help.

garylucas@gary-lucas:~/dev/ub-data-derpmetrics|DAT-spark-on-k8s⚡
⇒  ./run_page_spark.bash
/Users/garylucas/dev/ub-data-derpmetrics/jobs.zip
/Users/garylucas/dev/ub-data-derpmetrics/driver.py
docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-derpmetrics:latest
Using properties file: null
Parsed arguments:
  master                  k8s://http://127.0.0.1:8001
  deployMode              cluster
  executorMemory          20G
  executorCores           6
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            14G
  driverCores             8
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            10
  files                   null
  pyFiles                 file:///Users/garylucas/dev/ub-data-derpmetrics/jobs.zip
  archives                null
  mainClass               null
  primaryResource         file:///Users/garylucas/dev/ub-data-derpmetrics/driver.py
  name                    kmderived
  childArgs               []
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  (spark.kubernetes.executor.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-derpmetrics:latest)
  (spark.kubernetes.docker.image.pullPolicy,Always)
  (spark.executor.memory,20G)
  (spark.driver.memory,14G)
  (spark.executor.instances,10)
  (spark.kubernetes.authenticate.driver.serviceAccountName,spark-k8s-service-account)
  (spark.driver.cores,8)
  (spark.app.name,kmderived)
  (spark.kubernetes.driver.labels,role=derpmetrics)
  (spark.kubernetes.driver.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest)
  (spark.kubernetes.initcontainer.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest)
  (spark.kubernetes.executor.labels,role=derpmetrics)
  (spark.executor.cores,6)

Main class:
org.apache.spark.deploy.kubernetes.submit.Client
Arguments:
--primary-py-file
file:///Users/garylucas/dev/ub-data-derpmetrics/driver.py
--main-class
org.apache.spark.deploy.PythonRunner
--other-py-files
file:///Users/garylucas/dev/ub-data-derpmetrics/jobs.zip
System properties:
(spark.kubernetes.executor.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-derpmetrics:latest)
(spark.kubernetes.docker.image.pullPolicy,Always)
(spark.executor.memory,20G)
(spark.driver.memory,14G)
(spark.executor.instances,10)
(SPARK_SUBMIT,true)
(spark.kubernetes.namespace,default)
(spark.kubernetes.authenticate.driver.serviceAccountName,spark-k8s-service-account)
(spark.driver.cores,8)
(spark.app.name,kmderived)
(spark.kubernetes.driver.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest)
(spark.kubernetes.driver.labels,role=derpmetrics)
(spark.kubernetes.initcontainer.docker.image,docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest)
(spark.submit.deployMode,cluster)
(spark.master,k8s://http://127.0.0.1:8001)
(spark.kubernetes.executor.labels,role=derpmetrics)
(spark.executor.cores,6)
Classpath elements:

2017-09-24 21:49:36 WARN  ConfigurationUtils:66 - Configuration with key spark.kubernetes.driver.labels is deprecated. Use configurations with prefix spark.kubernetes.driver.label.<key> instead.
2017-09-24 21:49:38 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
     pod name: kmderived-1506314976209-driver
     namespace: default
     labels: role -> derpmetrics, spark-app-selector -> spark-5f563affbb7547fa9562c78bfd66c763, spark-role -> driver
     pod uid: ef5e9e46-a1ac-11e7-b97c-026e1e112030
     creation time: 2017-09-25T04:49:38Z
     service account name: spark-k8s-service-account
     volumes: submitted-files, spark-k8s-service-account-token-fn4qg
     node name: N/A
     start time: N/A
     container images: N/A
     phase: Pending
     status: []
2017-09-24 21:49:38 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
     pod name: kmderived-1506314976209-driver
     namespace: default
     labels: role -> derpmetrics, spark-app-selector -> spark-5f563affbb7547fa9562c78bfd66c763, spark-role -> driver
     pod uid: ef5e9e46-a1ac-11e7-b97c-026e1e112030
     creation time: 2017-09-25T04:49:38Z
     service account name: spark-k8s-service-account
     volumes: submitted-files, spark-k8s-service-account-token-fn4qg
     node name: ip-10-253-50-77.ec2.internal
     start time: N/A
     container images: N/A
     phase: Pending
     status: []
2017-09-24 21:49:38 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
     pod name: kmderived-1506314976209-driver
     namespace: default
     labels: role -> derpmetrics, spark-app-selector -> spark-5f563affbb7547fa9562c78bfd66c763, spark-role -> driver
     pod uid: ef5e9e46-a1ac-11e7-b97c-026e1e112030
     creation time: 2017-09-25T04:49:38Z
     service account name: spark-k8s-service-account
     volumes: submitted-files, spark-k8s-service-account-token-fn4qg
     node name: ip-10-253-50-77.ec2.internal
     start time: 2017-09-25T04:49:38Z
     container images: docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
     phase: Pending
     status: [ContainerStatus(containerID=null, image=docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
2017-09-24 21:49:38 INFO  Client:54 - Waiting for application kmderived to finish...
2017-09-24 21:49:40 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
     pod name: kmderived-1506314976209-driver
     namespace: default
     labels: role -> derpmetrics, spark-app-selector -> spark-5f563affbb7547fa9562c78bfd66c763, spark-role -> driver
     pod uid: ef5e9e46-a1ac-11e7-b97c-026e1e112030
     creation time: 2017-09-25T04:49:38Z
     service account name: spark-k8s-service-account
     volumes: submitted-files, spark-k8s-service-account-token-fn4qg
     node name: ip-10-253-50-77.ec2.internal
     start time: 2017-09-25T04:49:38Z
     container images: docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
     phase: Failed
     status: [ContainerStatus(containerID=docker://075a6b4aff0db13aa24a5acc30392227192aa7c5fd8a050156e490809fdfb930, image=docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest, imageID=docker-pullable://docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics@sha256:ec68cf8707428bf83f2fb4e0683aa94bfd7acfd0927de29212a3ee9eb3b8ecdb, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://075a6b4aff0db13aa24a5acc30392227192aa7c5fd8a050156e490809fdfb930, exitCode=1, finishedAt=2017-09-25T04:49:39Z, message=null, reason=Error, signal=null, startedAt=2017-09-25T04:49:39Z, additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
2017-09-24 21:49:40 INFO  LoggingPodStatusWatcherImpl:54 - Container final statuses:

     Container name: spark-kubernetes-driver
     Container image: docker_ecr.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
     Container state: Terminated
     Exit code: 1
2017-09-24 21:49:40 INFO  Client:54 - Application kmderived finished.

I'm using a podpreset to provide the actual secrets.

Huh, I'm going to make that fail and see if anything changes (I doubt it) (Nope, had no effect).

sahilprasad commented 7 years ago

Of course!

Let's try keeping the --jars out of the equation, and adding the resource staging server specs back in. Not sure if the pod presets are contributing to this error, but as I mentioned, the configmap mounting is a rather cryptic error, so we can't be certain about the cause yet.

luck02 commented 7 years ago

That gave me the following error:

MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506316079615-init-config" not found

I didn't see an attempt to load an actual secret, so that's different. But presumably that's just taking the --jars out of the picture.

I did however also scrape a previously failed pod creation yaml and was able to launch it. It did mount the secret / config maps. But then crashed with the following error which might be suspicious:


++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY $SPARK_DRIVER_CLASS $PYSPARK_PRIMARY $PYSPARK_FILES $SPARK_DRIVER_ARGS'
Error: Could not find or load main class .opt.spark.jars.RoaringBitmap-0.5.11.jar

But presumably is just related to having no startup arguments, however the config-map was in fact present.

Thanks for all the assistance, needing to sign off for the evening (PDT timezone)

sahilprasad commented 7 years ago

Interesting. I wonder if this is related to my PR #464 . Try adding just the spark-examples JAR with the --jars param to the spec of your latest submission, and if it works, then it is related. If not, hopefully the resulting error is more revealing of the actual issue.

luck02 commented 7 years ago

not much in the way of meaningfull change:

#!/usr/bin/env bash

ECR_URL="514588561872.dkr.ecr.us-east-1.amazonaws.com"
SPARK_DRIVER="$ECR_URL/spark-driver-py-derpmetrics:latest"
SPARK_EXECUTOR="$ECR_URL/spark-executor-py-derpmetrics:latest"
SPARK_INIT="$ECR_URL/spark-init-derpmetrics:latest"
PATH_TO_JOB="$PWD/jobs.zip"
PATH_TO_DRIVER="$PWD/jobs/derpmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py"
PATH_TO_SPARK_EXAMPLES="$PWD/infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0-SNAPSHOT.jar"

echo $PATH_TO_JOB
echo $PATH_TO_DRIVER

echo $SPARK_DRIVER
echo $SPARK_EXECUTOR

  ./infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/bin/spark-submit \
      --verbose \
      --deploy-mode cluster \
      --master k8s://http://127.0.0.1:8001 \
      --kubernetes-namespace default \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account \
      --conf spark.kubernetes.docker.image.pullPolicy=Always \
      --conf spark.driver.memory=4G \
      --conf spark.driver.cores=2 \
      --conf spark.executor.instances=1 \
      --conf spark.executor.cores=2\
      --conf spark.executor.memory=6G \
      --conf spark.app.name=kmderived \
      --conf spark.kubernetes.driver.docker.image=$SPARK_DRIVER \
      --conf spark.kubernetes.executor.docker.image=$SPARK_EXECUTOR \
      --conf spark.kubernetes.initcontainer.docker.image=$SPARK_INIT \
      --conf spark.kubernetes.driver.labels=role=derpmetrics \
      --conf spark.kubernetes.executor.labels=role=derpmetrics \
      --conf spark.kubernetes.resourceStagingServer.uri=http://herpip-421161154.us-east-1.elb.amazonaws.com:10000 \
      --jars file://$PATH_TO_SPARK_EXAMPLES\
      file://$PATH_TO_DRIVER

Results in: MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506351376470-init-config" not found

And pod failed to initialize.

Just reading your PR now

luck02 commented 7 years ago

Your PR doesn't look overly related but who knows, I tried one more time with just my python file and still got the same error.

MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506353114000-init-config" not found

I'm still wondering about mount order etc. Correct me if I'm wrong but the problem needs to be happening in the context of 'spark-submit'

I got to the point of being able to test out a spark submit with spark running locally in debug mode, but hadn't gotten to the point of doing any significant debugging. Given that my submit steps are so simple I'll probably wire up my spark submit args to local-dev and start trying to troubleshoot there. It seems as if the secret / config map is being created after the pod spec is executed. But presumably that's just behaviour, not root cause. Will update if I find anything and if we get another spark 2.2 cut I'll try that too. I did run integration tests last night against my local build and all was 'good'... Also, when I bake all my code into the images it seems to work with the exception of: https://github.com/apache-spark-on-k8s/spark/issues/504

In fact, that blockage was why I tried to move to submitting local resources. If I wanted to debug that issue I really wanted to be able to iterate faster.

I'm wondering if perhaps my local snapshot build is broken in some subtle way:

garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  git rev-parse HEAD
8a0f485d83f84a9b556f7634fed8cb1b65e12159
garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  ./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3

As mentioned in #504 I'm building my own due to broken service account issues in the release. If a new build gets cut I'll try that.

mccheah commented 7 years ago

The pod does get created before we create the config map and secret. This is because we need the Pod's UID to create the owner reference graph from the config map and secret back up to the pod. But if the pod is created first, it should eventually recover after the API server detects that the dependent objects are created even after the fact. @foxish - could this perhaps be a bug in Kubernetes? @luck02 what Kubernetes version is being used?

luck02 commented 7 years ago

I see I've got some version skew, that could be doing me in too. I'll update and try again.

garylucas@gary-lucas:~/dev/ub-data-hub|dat-730⚡
⇒  kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T19:06:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4+coreos.0", GitCommit:"4bb697e04f7c356347aee6ffaa91640b428976d5", GitTreeState:"clean", BuildDate:"2017-08-22T08:43:47Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
mccheah commented 7 years ago

1.7.x should be new enough though so I don't see that being the root issue.

luck02 commented 7 years ago

Upgraded my kubectl, no impact as predicted.

mccheah commented 7 years ago

Are there any Kubernetes logs that can be read that might give more insight here?

mccheah commented 7 years ago

Can we also remove any PodPresets? Read a bit earlier that those are being applied, and just want to reduce the number of variables at play here.

luck02 commented 7 years ago

I tried that last evening, didn't seem to affect the situation. just reconfirmed MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506375022936-init-config" not found

what's interesting is that it shows pending after the failure and then switches to failed

Here's the pod describe:

⇒  kubectl describe pod kmderived-1506375022936-driver
Name:       kmderived-1506375022936-driver
Namespace:  default
Node:       ip-10-253-51-182.ec2.internal/10.253.51.182
Start Time: Mon, 25 Sep 2017 14:30:26 -0700
Labels:     spark-app-selector=spark-ec57f4301e754e728468e6141ff0e013
        spark-role=driver
Annotations:    spark-app-name=kmderived
Status:     Failed
IP:     10.2.75.15
Init Containers:
  spark-init:
    Container ID:   docker://f96c6130a921c46b48813cf9127ed05cb4a87d158a608d215628788ec858c93d
    Image:      ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest
    Image ID:       docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics@sha256:0f244ccfaf6eeedf793c808616b2867ccb7c709c74d6fea1694faaf558684a50
    Port:       <none>
    Args:
      /etc/spark-init/spark-init.properties
    State:      Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 25 Sep 2017 14:30:37 -0700
      Finished:     Mon, 25 Sep 2017 14:31:10 -0700
    Ready:      False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/spark-init from spark-init-properties (rw)
      /mnt/secrets/spark-init from spark-init-secret (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from spark-k8s-service-account-token-fn4qg (ro)
      /var/spark-data/spark-files from download-files (rw)
      /var/spark-data/spark-jars from download-jars-volume (rw)
Containers:
  spark-kubernetes-driver:
    Container ID:
    Image:      ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
    Image ID:
    Port:       <none>
    State:      Waiting
      Reason:       PodInitializing
    Ready:      False
    Restart Count:  0
    Limits:
      memory:   4505Mi
    Requests:
      cpu:  2
      memory:   4Gi
    Environment:
      SPARK_DRIVER_MEMORY:  4G
      SPARK_DRIVER_CLASS:   org.apache.spark.deploy.PythonRunner
      SPARK_DRIVER_ARGS:
      SPARK_MOUNTED_FILES_DIR:  /var/spark-data/spark-files
      PYSPARK_PRIMARY:      /var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py
      PYSPARK_FILES:        null
      SPARK_JAVA_OPT_0:     -Dspark.kubernetes.docker.image.pullPolicy=Always
      SPARK_JAVA_OPT_1:     -Dspark.kubernetes.namespace=default
      SPARK_JAVA_OPT_2:     -Dspark.kubernetes.initcontainer.executor.configmapkey=download-submitted-files
      SPARK_JAVA_OPT_3:     -Dspark.kubernetes.driver.pod.name=kmderived-1506375022936-driver
      SPARK_JAVA_OPT_4:     -Dspark.files=/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py
      SPARK_JAVA_OPT_5:     -Dspark.kubernetes.driver.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-derpmetrics:latest
      SPARK_JAVA_OPT_6:     -Dspark.submit.deployMode=cluster
      SPARK_JAVA_OPT_7:     -Dspark.driver.blockManager.port=7079
      SPARK_JAVA_OPT_8:     -Dspark.app.name=kmderived
      SPARK_JAVA_OPT_9:     -Dspark.driver.cores=2
      SPARK_JAVA_OPT_10:    -Dspark.driver.host=kmderived-1506375022936-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_11:    -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account
      SPARK_JAVA_OPT_12:    -Dspark.executor.cores=2
      SPARK_JAVA_OPT_13:    -Dspark.driver.port=7078
      SPARK_JAVA_OPT_14:    -Dspark.kubernetes.initcontainer.executor.stagingServerSecret.name=kmderived-1506375022936-init-secret
      SPARK_JAVA_OPT_15:    -Dspark.app.id=spark-ec57f4301e754e728468e6141ff0e013
      SPARK_JAVA_OPT_16:    -Dspark.kubernetes.initcontainer.executor.stagingServerSecret.mountDir=/mnt/secrets/spark-init
      SPARK_JAVA_OPT_17:    -Dspark.executor.memory=6G
      SPARK_JAVA_OPT_18:    -Dspark.kubernetes.resourceStagingServer.uri=http://ab5edaaa8a01911e7b97c026e1e11203-421161154.us-east-1.elb.amazonaws.com:10000
      SPARK_JAVA_OPT_19:    -Dspark.driver.bindAddress=kmderived-1506375022936-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_20:    -Dspark.executor.instances=1
      SPARK_JAVA_OPT_21:    -Dspark.kubernetes.executor.podNamePrefix=kmderived-1506375022936
      SPARK_JAVA_OPT_22:    -Dspark.master=k8s://http://127.0.0.1:8001
      SPARK_JAVA_OPT_23:    -Dspark.driver.memory=4G
      SPARK_JAVA_OPT_24:    -Dspark.kubernetes.initcontainer.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest
      SPARK_JAVA_OPT_25:    -Dspark.kubernetes.executor.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-derpmetrics:latest
      SPARK_JAVA_OPT_26:    -Dspark.kubernetes.initcontainer.executor.configmapname=kmderived-1506375022936-init-config
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from spark-k8s-service-account-token-fn4qg (ro)
      /var/spark-data/spark-files from download-files (rw)
      /var/spark-data/spark-jars from download-jars-volume (rw)
Conditions:
  Type      Status
  Initialized   False
  Ready     False
  PodScheduled  True
Volumes:
  spark-init-properties:
    Type:   ConfigMap (a volume populated by a ConfigMap)
    Name:   kmderived-1506375022936-init-config
    Optional:   false
  download-jars-volume:
    Type:   EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  download-files:
    Type:   EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  spark-init-secret:
    Type:   Secret (a volume populated by a Secret)
    SecretName: kmderived-1506375022936-init-secret
    Optional:   false
  spark-k8s-service-account-token-fn4qg:
    Type:   Secret (a volume populated by a Secret)
    SecretName: spark-k8s-service-account-token-fn4qg
    Optional:   false
QoS Class:  Burstable
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From                    SubObjectPath           Type        Reason          Message
  --------- --------    -----   ----                    -------------           --------    ------          -------
  2m        2m      1   default-scheduler                           Normal      Scheduled       Successfully assigned kmderived-1506375022936-driver to ip-10-253-51-182.ec2.internal
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "download-files"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "download-jars-volume"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Warning     FailedMount     MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "kmderived-1506375022936-init-config" not found
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "spark-k8s-service-account-token-fn4qg"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Warning     FailedMount     MountVolume.SetUp failed for volume "spark-init-secret" : secrets "kmderived-1506375022936-init-secret" not found
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "spark-init-secret"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "spark-init-properties"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal  spec.initContainers{spark-init} Normal      Pulling         pulling image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal  spec.initContainers{spark-init} Normal      Pulled          Successfully pulled image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest"
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal  spec.initContainers{spark-init} Normal      Created         Created container
  2m        2m      1   kubelet, ip-10-253-51-182.ec2.internal  spec.initContainers{spark-init} Normal      Started         Started container
mccheah commented 7 years ago

Is there anything maybe in the kubelet or the API server logs?

luck02 commented 7 years ago

Ok, that helped. Appologies, I should have checked this earlier. Our log introspection tooling is pretty weak at the moment.

There is a pile of logs:

Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: ++ id -u
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: + myuid=0
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: ++ id -g
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: + mygid=0
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: ++ getent passwd 0
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: + uidentry=root:x:0:0:root:/root:/bin/ash
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: + '[' -z root:x:0:0:root:/root:/bin/ash ']'
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: + /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer /etc/spark-init/spark-init.properties
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:14 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting init-container to download Spark application dependencies.
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing view acls to: root
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing modify acls to: root
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing view acls groups to: 
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing modify acls groups to: 
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing view acls to: root
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing modify acls to: root
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing view acls groups to: 
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - Changing modify acls groups to: 
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init: 2017-09-26 04:27:15 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download jars from resource staging server...
Sep 25 21:27:34 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:  2017-09-26 04:27:15 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download files from resource staging server...
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:  Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult: 
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$waitForFutures$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:187)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$waitForFutures$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:187)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer.waitForFutures(KubernetesSparkDependencyDownloadInitContainer.scala:186)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer.run(KubernetesSparkDependencyDownloadInitContainer.scala:140)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$.main(KubernetesSparkDependencyDownloadInitContainer.scala:222)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer.main(KubernetesSparkDependencyDownloadInitContainer.scala)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:  Caused by: java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: connect timed out
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:279)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.WaitableCallback$class.waitForCompletion(KubernetesSparkDependencyDownloadInitContainer.scala:53)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.DownloadTarGzCallback.waitForCompletion(KubernetesSparkDependencyDownloadInitContainer.scala:57)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$kubernetes$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1$$anonfun$apply$3.apply(KubernetesSparkDependencyDownloadInitContainer.scala:167)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$kubernetes$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1$$anonfun$apply$3.apply(KubernetesSparkDependencyDownloadInitContainer.scala:156)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.Option.foreach(Option.scala:257)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$kubernetes$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:156)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$kubernetes$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:155)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.Option.foreach(Option.scala:257)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer.org$apache$spark$deploy$rest$kubernetes$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer(KubernetesSparkDependencyDownloadInitContainer.scala:155)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply$mcV$sp(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.lang.Thread.run(Thread.java:748)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:  Caused by: java.net.SocketTimeoutException: connect timed out
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.PlainSocketImpl.socketConnect(Native Method)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at java.net.Socket.connect(Socket.java:589)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:220)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:146)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:186)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:129)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Sep 25 21:27:46 data-cluster-slow-test k8s.kmderived-1506400028781-driver_default_spark-init:   ... 3 more

Not familiar enough with the under-the-hood machinery to form an opinion.

luck02 commented 7 years ago

Here's a fuller log including all the kubelet stuff... But it's big and crufty:

Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.037608       1 gc_controller.go:62] PodGC is force deleting Pod: default:kmderived-1506400028781-driver
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.082009       1 gc_controller.go:166] Forced deletion of orphaned Pod kmderived-1506400028781-driver succeeded
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.104348       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506400028781-driver-svc
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.267595       1 gc_controller.go:62] PodGC is force deleting Pod: default:kmderived-1506399639732-driver
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.310623       1 gc_controller.go:166] Forced deletion of orphaned Pod kmderived-1506399639732-driver succeeded
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.310824       1 gc_controller.go:62] PodGC is force deleting Pod: default:kmderived-1506399812445-driver
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.350596       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506399639732-driver-svc
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.354022       1 gc_controller.go:166] Forced deletion of orphaned Pod kmderived-1506399812445-driver succeeded
Sep 26 08:16:54 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:16:54.379049       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506399812445-driver-svc
Sep 26 08:18:57 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:18:56.792076       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506400028781-driver-svc
Sep 26 08:24:22 data-cluster-slow-test k8s.kube-dns-3664836949-15b1l_kube-system_kubedns: I0926 15:24:21.595476       1 dns.go:552] Could not find endpoints for service "kmderived-1506400028781-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:24:58 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:24:57.793296       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506400028781-driver-svc
Sep 26 08:31:04 data-cluster-slow-test k8s.kube-dns-3664836949-h9370_kube-system_kubedns: I0926 15:31:04.281200       1 dns.go:552] Could not find endpoints for service "kmderived-1506400028781-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:33:06 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:33:05.793954       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506400028781-driver-svc
Sep 26 08:38:50 data-cluster-slow-test k8s.kube-dns-3664836949-15b1l_kube-system_kubedns: I0926 15:38:49.606999       1 dns.go:552] Could not find endpoints for service "kmderived-1506400028781-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:43:09 data-cluster-slow-test k8s.kube-dns-3664836949-h9370_kube-system_kubedns: I0926 15:43:09.293169       1 dns.go:552] Could not find endpoints for service "kmderived-1506400028781-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:50:07 data-cluster-slow-test k8s.kube-controller-manager-ip-10-253-46-89.ec2.internal_kube-sy: I0926 15:50:06.795441       1 servicecontroller.go:740] Service has been deleted default/kmderived-1506400028781-driver-svc
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.010718    1727 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume \"spark-init-properties\" (UniqueName: \"kubernetes.io/configmap/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-properties\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\")"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.010758    1727 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume \"download-files\" (UniqueName: \"kubernetes.io/empty-dir/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-download-files\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\")"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.010784    1727 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume \"download-jars-volume\" (UniqueName: \"kubernetes.io/empty-dir/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-download-jars-volume\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\")"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.010806    1727 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume \"spark-k8s-service-account-token-fn4qg\" (UniqueName: \"kubernetes.io/secret/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-k8s-service-account-token-fn4qg\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\")"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.010943    1727 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume \"spark-init-secret\" (UniqueName: \"kubernetes.io/secret/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-secret\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\")"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"E0926 15:52:07.114849    1727 configmap.go:202] Couldn't get configMap default/kmderived-1506441123636-init-config: configmaps \"kmderived-1506441123636-init-config\" not found"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"E0926 15:52:07.115008    1727 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/configmap/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-properties\\\" (\\\"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\\\")\" failed. No retries permitted until 2017-09-26 15:52:07.614916423 +0000 UTC (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume \"spark-init-properties\" (UniqueName: \"kubernetes.io/configmap/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-properties\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\") : configmaps \"kmderived-1506441123636-init-config\" not found"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"E0926 15:52:07.117656    1727 secret.go:201] Couldn't get secret default/kmderived-1506441123636-init-secret: secrets \"kmderived-1506441123636-init-secret\" not found"}
Sep 26 08:52:07 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"E0926 15:52:07.117749    1727 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-secret\\\" (\\\"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\\\")\" failed. No retries permitted until 2017-09-26 15:52:07.617719263 +0000 UTC (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume \"spark-init-secret\" (UniqueName: \"kubernetes.io/secret/a5e157ac-a2d2-11e7-a071-0e80cdd21cb6-spark-init-secret\") pod \"kmderived-1506441123636-driver\" (UID: \"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6\") : secrets \"kmderived-1506441123636-init-secret\" not found"}
Sep 26 08:52:07 data-cluster-slow-test k8s.kube-scheduler-ip-10-253-48-222.ec2.internal_kube-system_kub: I0926 15:52:06.832447       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"kmderived-1506441123636-driver", UID:"a5e157ac-a2d2-11e7-a071-0e80cdd21cb6", APIVersion:"v1", ResourceVersion:"11214775", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned kmderived-1506441123636-driver to ip-10-253-50-31.ec2.internal
Sep 26 08:52:07 data-cluster-slow-test k8s.kube-dns-3664836949-h9370_kube-system_kubedns: I0926 15:52:07.294174       1 dns.go:552] Could not find endpoints for service "kmderived-1506441123636-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:52:08 data-cluster-slow-test k8s.kube-dns-3664836949-15b1l_kube-system_kubedns: I0926 15:52:07.282993       1 dns.go:552] Could not find endpoints for service "kmderived-1506441123636-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
Sep 26 08:52:08 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:07.831701    1727 kuberuntime_manager.go:457] Container {Name:spark-kubernetes-driver Image:514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SPARK_DRIVER_MEMORY Value:4G ValueFrom:nil} {Name:SPARK_DRIVER_CLASS Value:org.apache.spark.deploy.PythonRunner ValueFrom:nil} {Name:SPARK_DRIVER_ARGS Value: ValueFrom:nil} {Name:SPARK_MOUNTED_FILES_DIR Value:/var/spark-data/spark-files ValueFrom:nil} {Name:PYSPARK_PRIMARY Value:/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:PYSPARK_FILES Value:null ValueFrom:nil} {Name:SPARK_JAVA_OPT_0 Value:-Dspark.kubernetes.docker.image.pullPolicy=Always ValueFrom:nil} {Name:SPARK_JAVA_OPT_1 Value:-Dspark.kubernetes.namespace=default ValueFrom:nil} {Name:SPARK_JAVA_OPT_2 Value:-Dspark.kubernetes.initcontainer.executor.configmapkey=download-submitted-files ValueFrom:nil} {Name:SPARK_JAVA_OPT_3 Value:-Dspark.kubernetes.driver.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_4 Value:-Dspark.files=/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:SPARK_JAVA_OPT_5 Value:-Dspark.submit.deployMode=cluster ValueFrom:nil} {Name:SPARK_JAVA_OPT_6 Value:-Dspark.driver.blockManager.port=7079 ValueFrom:nil} {Name:SPARK_JAVA_OPT_7 Value:-Dspark.app.name=kmderived ValueFrom:nil} {Name:SPARK_JAVA_OPT_8 Value:-Dspark.driver.bindAddress=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_9 Value:-Dspark.driver.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_10 Value:-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account ValueFrom:nil} {Name:SPARK_JAVA_OPT_11 Value:-Dspark.app.id=spark-067e91b9e35e43d580fea426ac797f3a ValueFrom:nil} {Name:SPARK_JAVA_OPT_12 Value:-Dspark.executor.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_13 Value:-Dspark.driver.port=7078 ValueFrom:nil} {Name:SPARK_J"}
Sep 26 08:52:08 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"AVA_OPT_14 Value:-Dspark.driver.host=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_15 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.mountDir=/mnt/secrets/spark-init ValueFrom:nil} {Name:SPARK_JAVA_OPT_16 Value:-Dspark.executor.memory=6G ValueFrom:nil} {Name:SPARK_JAVA_OPT_17 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.name=kmderived-1506441123636-init-secret ValueFrom:nil} {Name:SPARK_JAVA_OPT_18 Value:-Dspark.kubernetes.resourceStagingServer.uri=http://ab5edaaa8a01911e7b97c026e1e11203-421161154.us-east-1.elb.amazonaws.com:10000 ValueFrom:nil} {Name:SPARK_JAVA_OPT_19 Value:-Dspark.kubernetes.driver.pod.name=kmderived-1506441123636-driver ValueFrom:nil} {Name:SPARK_JAVA_OPT_20 Value:-Dspark.executor.instances=1 ValueFrom:nil} {Name:SPARK_JAVA_OPT_21 Value:-Dspark.master=k8s://http://127.0.0.1:8001 ValueFrom:nil} {Name:SPARK_JAVA_OPT_22 Value:-Dspark.kubernetes.initcontainer.executor.configmapname=kmderived-1506441123636-init-config ValueFrom:nil} {Name:SPARK_JAVA_OPT_23 Value:-Dspark.driver.memory=4G ValueFrom:nil} {Name:SPARK_JAVA_OPT_24 Value:-Dspark.kubernetes.executor.podNamePrefix=kmderived-1506441123636 ValueFrom:nil} {Name:SPARK_JAVA_OPT_25 Value:-Dspark.kubernetes.initcontainer.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-init-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_26 Value:-Dspark.kubernetes.executor.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:4723834880 scale:0} d:{Dec:<nil>} s:4505Mi Format:BinarySI}] Requests:map[cpu:{i:{value:2 scale:0} d:{Dec:<nil>} s:2 Format:DecimalSI} memory:{i:{value:4294967296 scale:0} d:{Dec:<nil>} s:4Gi Format:BinarySI}]} VolumeMounts:[{Name:download-jars-volume ReadOnly:false MountPath:/var/spark-data/spark-jars SubPath:} {Name:download-files ReadOnly:false MountPath:/var/spark-data/spark-files SubPath:} {Name:spark-k8s-service-account-token-"}
Sep 26 08:52:11 data-cluster-slow-test k8s.cluster-autoscaler-1682651915-z6j80_kube-system_cluster-auto:  I0926 15:52:10.518471       1 cluster.go:89] Fast evaluation: node ip-10-253-50-31.ec2.internal cannot be removed: default/kmderived-1506441123636-driver is not replicated
Sep 26 08:52:20 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:19.824063    1727 kuberuntime_manager.go:457] Container {Name:spark-kubernetes-driver Image:514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SPARK_DRIVER_MEMORY Value:4G ValueFrom:nil} {Name:SPARK_DRIVER_CLASS Value:org.apache.spark.deploy.PythonRunner ValueFrom:nil} {Name:SPARK_DRIVER_ARGS Value: ValueFrom:nil} {Name:SPARK_MOUNTED_FILES_DIR Value:/var/spark-data/spark-files ValueFrom:nil} {Name:PYSPARK_PRIMARY Value:/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:PYSPARK_FILES Value:null ValueFrom:nil} {Name:SPARK_JAVA_OPT_0 Value:-Dspark.kubernetes.docker.image.pullPolicy=Always ValueFrom:nil} {Name:SPARK_JAVA_OPT_1 Value:-Dspark.kubernetes.namespace=default ValueFrom:nil} {Name:SPARK_JAVA_OPT_2 Value:-Dspark.kubernetes.initcontainer.executor.configmapkey=download-submitted-files ValueFrom:nil} {Name:SPARK_JAVA_OPT_3 Value:-Dspark.kubernetes.driver.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_4 Value:-Dspark.files=/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:SPARK_JAVA_OPT_5 Value:-Dspark.submit.deployMode=cluster ValueFrom:nil} {Name:SPARK_JAVA_OPT_6 Value:-Dspark.driver.blockManager.port=7079 ValueFrom:nil} {Name:SPARK_JAVA_OPT_7 Value:-Dspark.app.name=kmderived ValueFrom:nil} {Name:SPARK_JAVA_OPT_8 Value:-Dspark.driver.bindAddress=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_9 Value:-Dspark.driver.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_10 Value:-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account ValueFrom:nil} {Name:SPARK_JAVA_OPT_11 Value:-Dspark.app.id=spark-067e91b9e35e43d580fea426ac797f3a ValueFrom:nil} {Name:SPARK_JAVA_OPT_12 Value:-Dspark.executor.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_13 Value:-Dspark.driver.port=7078 ValueFrom:nil} {Name:SPARK_J"}
Sep 26 08:52:20 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"AVA_OPT_14 Value:-Dspark.driver.host=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_15 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.mountDir=/mnt/secrets/spark-init ValueFrom:nil} {Name:SPARK_JAVA_OPT_16 Value:-Dspark.executor.memory=6G ValueFrom:nil} {Name:SPARK_JAVA_OPT_17 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.name=kmderived-1506441123636-init-secret ValueFrom:nil} {Name:SPARK_JAVA_OPT_18 Value:-Dspark.kubernetes.resourceStagingServer.uri=http://ab5edaaa8a01911e7b97c026e1e11203-421161154.us-east-1.elb.amazonaws.com:10000 ValueFrom:nil} {Name:SPARK_JAVA_OPT_19 Value:-Dspark.kubernetes.driver.pod.name=kmderived-1506441123636-driver ValueFrom:nil} {Name:SPARK_JAVA_OPT_20 Value:-Dspark.executor.instances=1 ValueFrom:nil} {Name:SPARK_JAVA_OPT_21 Value:-Dspark.master=k8s://http://127.0.0.1:8001 ValueFrom:nil} {Name:SPARK_JAVA_OPT_22 Value:-Dspark.kubernetes.initcontainer.executor.configmapname=kmderived-1506441123636-init-config ValueFrom:nil} {Name:SPARK_JAVA_OPT_23 Value:-Dspark.driver.memory=4G ValueFrom:nil} {Name:SPARK_JAVA_OPT_24 Value:-Dspark.kubernetes.executor.podNamePrefix=kmderived-1506441123636 ValueFrom:nil} {Name:SPARK_JAVA_OPT_25 Value:-Dspark.kubernetes.initcontainer.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-init-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_26 Value:-Dspark.kubernetes.executor.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:4723834880 scale:0} d:{Dec:<nil>} s:4505Mi Format:BinarySI}] Requests:map[cpu:{i:{value:2 scale:0} d:{Dec:<nil>} s:2 Format:DecimalSI} memory:{i:{value:4294967296 scale:0} d:{Dec:<nil>} s:4Gi Format:BinarySI}]} VolumeMounts:[{Name:download-jars-volume ReadOnly:false MountPath:/var/spark-data/spark-jars SubPath:} {Name:download-files ReadOnly:false MountPath:/var/spark-data/spark-files SubPath:} {Name:spark-k8s-service-account-token-"}
Sep 26 08:52:21 data-cluster-slow-test k8s.cluster-autoscaler-1682651915-z6j80_kube-system_cluster-auto:  I0926 15:52:21.177298       1 cluster.go:89] Fast evaluation: node ip-10-253-50-31.ec2.internal cannot be removed: default/kmderived-1506441123636-driver is not replicated
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  ++ id -u
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  + myuid=0
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  ++ id -g
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  + mygid=0
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  ++ getent passwd 0
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  + uidentry=root:x:0:0:root:/root:/bin/ash
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  + '[' -z root:x:0:0:root:/root:/bin/ash ']'
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  + /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer /etc/spark-init/spark-init.properties
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:19 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting init-container to download Spark application dependencies.
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing view acls to: root
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing modify acls to: root
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing view acls groups to: 
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing modify acls groups to: 
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing view acls to: root
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing modify acls to: root
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing view acls groups to: 
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - Changing modify acls groups to: 
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download files from resource staging server...
Sep 26 08:52:21 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  2017-09-26 15:52:20 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download jars from resource staging server...
Sep 26 08:52:21 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"I0926 15:52:20.829040    1727 kuberuntime_manager.go:457] Container {Name:spark-kubernetes-driver Image:514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SPARK_DRIVER_MEMORY Value:4G ValueFrom:nil} {Name:SPARK_DRIVER_CLASS Value:org.apache.spark.deploy.PythonRunner ValueFrom:nil} {Name:SPARK_DRIVER_ARGS Value: ValueFrom:nil} {Name:SPARK_MOUNTED_FILES_DIR Value:/var/spark-data/spark-files ValueFrom:nil} {Name:PYSPARK_PRIMARY Value:/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:PYSPARK_FILES Value:null ValueFrom:nil} {Name:SPARK_JAVA_OPT_0 Value:-Dspark.kubernetes.docker.image.pullPolicy=Always ValueFrom:nil} {Name:SPARK_JAVA_OPT_1 Value:-Dspark.kubernetes.namespace=default ValueFrom:nil} {Name:SPARK_JAVA_OPT_2 Value:-Dspark.kubernetes.initcontainer.executor.configmapkey=download-submitted-files ValueFrom:nil} {Name:SPARK_JAVA_OPT_3 Value:-Dspark.kubernetes.driver.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_4 Value:-Dspark.files=/var/spark-data/spark-files/Compute__Derived_NonAppPageViewed.py ValueFrom:nil} {Name:SPARK_JAVA_OPT_5 Value:-Dspark.submit.deployMode=cluster ValueFrom:nil} {Name:SPARK_JAVA_OPT_6 Value:-Dspark.driver.blockManager.port=7079 ValueFrom:nil} {Name:SPARK_JAVA_OPT_7 Value:-Dspark.app.name=kmderived ValueFrom:nil} {Name:SPARK_JAVA_OPT_8 Value:-Dspark.driver.bindAddress=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_9 Value:-Dspark.driver.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_10 Value:-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account ValueFrom:nil} {Name:SPARK_JAVA_OPT_11 Value:-Dspark.app.id=spark-067e91b9e35e43d580fea426ac797f3a ValueFrom:nil} {Name:SPARK_JAVA_OPT_12 Value:-Dspark.executor.cores=2 ValueFrom:nil} {Name:SPARK_JAVA_OPT_13 Value:-Dspark.driver.port=7078 ValueFrom:nil} {Name:SPARK_J"}
Sep 26 08:52:21 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"AVA_OPT_14 Value:-Dspark.driver.host=kmderived-1506441123636-driver-svc.default.svc.cluster.local ValueFrom:nil} {Name:SPARK_JAVA_OPT_15 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.mountDir=/mnt/secrets/spark-init ValueFrom:nil} {Name:SPARK_JAVA_OPT_16 Value:-Dspark.executor.memory=6G ValueFrom:nil} {Name:SPARK_JAVA_OPT_17 Value:-Dspark.kubernetes.initcontainer.executor.stagingServerSecret.name=kmderived-1506441123636-init-secret ValueFrom:nil} {Name:SPARK_JAVA_OPT_18 Value:-Dspark.kubernetes.resourceStagingServer.uri=http://ab5edaaa8a01911e7b97c026e1e11203-421161154.us-east-1.elb.amazonaws.com:10000 ValueFrom:nil} {Name:SPARK_JAVA_OPT_19 Value:-Dspark.kubernetes.driver.pod.name=kmderived-1506441123636-driver ValueFrom:nil} {Name:SPARK_JAVA_OPT_20 Value:-Dspark.executor.instances=1 ValueFrom:nil} {Name:SPARK_JAVA_OPT_21 Value:-Dspark.master=k8s://http://127.0.0.1:8001 ValueFrom:nil} {Name:SPARK_JAVA_OPT_22 Value:-Dspark.kubernetes.initcontainer.executor.configmapname=kmderived-1506441123636-init-config ValueFrom:nil} {Name:SPARK_JAVA_OPT_23 Value:-Dspark.driver.memory=4G ValueFrom:nil} {Name:SPARK_JAVA_OPT_24 Value:-Dspark.kubernetes.executor.podNamePrefix=kmderived-1506441123636 ValueFrom:nil} {Name:SPARK_JAVA_OPT_25 Value:-Dspark.kubernetes.initcontainer.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-init-kissmetrics:latest ValueFrom:nil} {Name:SPARK_JAVA_OPT_26 Value:-Dspark.kubernetes.executor.docker.image=514588561872.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:4723834880 scale:0} d:{Dec:<nil>} s:4505Mi Format:BinarySI}] Requests:map[cpu:{i:{value:2 scale:0} d:{Dec:<nil>} s:2 Format:DecimalSI} memory:{i:{value:4294967296 scale:0} d:{Dec:<nil>} s:4Gi Format:BinarySI}]} VolumeMounts:[{Name:download-jars-volume ReadOnly:false MountPath:/var/spark-data/spark-jars SubPath:} {Name:download-files ReadOnly:false MountPath:/var/spark-data/spark-files SubPath:} {Name:spark-k8s-service-account-token-"}
Sep 26 08:52:28 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  [2017-09-26T15:52:28Z] Incoming HTTP/1.1 GET /api/v1/pod/default/kmderived-1506441123636-driver request from 10.2.122.0:47336: {}
Sep 26 08:52:28 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  Getting details of kmderived-1506441123636-driver pod in default namespace
Sep 26 08:52:28 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  Found 10 events related to kmderived-1506441123636-driver pod in default namespace
Sep 26 08:52:31 data-cluster-slow-test k8s.cluster-autoscaler-1682651915-z6j80_kube-system_cluster-auto:  I0926 15:52:31.367843       1 cluster.go:89] Fast evaluation: node ip-10-253-50-31.ec2.internal cannot be removed: default/kmderived-1506441123636-driver is not replicated
Sep 26 08:52:31 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  [2017-09-26T15:52:31Z] Incoming HTTP/1.1 GET /api/v1/log/source/default/kmderived-1506441123636-driver/pod request from 10.2.122.0:47336: {}
Sep 26 08:52:31 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  [2017-09-26T15:52:31Z] Incoming HTTP/1.1 GET /api/v1/log/default/kmderived-1506441123636-driver request from 10.2.122.0:47336: {}
Sep 26 08:52:42 data-cluster-slow-test k8s.cluster-autoscaler-1682651915-z6j80_kube-system_cluster-auto:  I0926 15:52:41.564109       1 cluster.go:89] Fast evaluation: node ip-10-253-50-31.ec2.internal cannot be removed: default/kmderived-1506441123636-driver is not replicated
Sep 26 08:52:45 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  [2017-09-26T15:52:45Z] Incoming HTTP/1.1 GET /api/v1/pod/default/kmderived-1506441123636-driver request from 10.2.122.0:47336: {}
Sep 26 08:52:45 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  Getting details of kmderived-1506441123636-driver pod in default namespace
Sep 26 08:52:45 data-cluster-slow-test k8s.kubernetes-dashboard-439418514-207jv_kube-system_kubernetes-:  Found 10 events related to kmderived-1506441123636-driver pod in default namespace
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult: 
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$waitForFutures$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:187)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$waitForFutures$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:187)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer.waitForFutures(KubernetesSparkDependencyDownloadInitContainer.scala:186)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer.run(KubernetesSparkDependencyDownloadInitContainer.scala:140)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$.main(KubernetesSparkDependencyDownloadInitContainer.scala:222)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer.main(KubernetesSparkDependencyDownloadInitContainer.scala)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  Caused by: java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: connect timed out
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:279)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.WaitableCallback$class.waitForCompletion(KubernetesSparkDependencyDownloadInitContainer.scala:53)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.DownloadTarGzCallback.waitForCompletion(KubernetesSparkDependencyDownloadInitContainer.scala:57)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$k8s$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1$$anonfun$apply$3.apply(KubernetesSparkDependencyDownloadInitContainer.scala:167)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$k8s$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1$$anonfun$apply$3.apply(KubernetesSparkDependencyDownloadInitContainer.scala:156)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.Option.foreach(Option.scala:257)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$k8s$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:156)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$org$apache$spark$deploy$rest$k8s$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:155)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.Option.foreach(Option.scala:257)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer.org$apache$spark$deploy$rest$k8s$KubernetesSparkDependencyDownloadInitContainer$$downloadResourcesFromStagingServer(KubernetesSparkDependencyDownloadInitContainer.scala:155)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply$mcV$sp(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at org.apache.spark.deploy.rest.k8s.KubernetesSparkDependencyDownloadInitContainer$$anonfun$1.apply(KubernetesSparkDependencyDownloadInitContainer.scala:104)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.lang.Thread.run(Thread.java:748)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:  Caused by: java.net.SocketTimeoutException: connect timed out
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.PlainSocketImpl.socketConnect(Native Method)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at java.net.Socket.connect(Socket.java:589)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:220)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:146)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:186)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:129)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Sep 26 08:52:51 data-cluster-slow-test k8s.kmderived-1506441123636-driver_default_spark-init:   ... 3 more
Sep 26 08:52:53 data-cluster-slow-test systemd: {"_BOOT_ID":"55547778ef634c26a15c3b46b6f2e020","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_SYSTEMD_SLICE":"system.slice","_CAP_EFFECTIVE":"3fffffffff","_TRANSPORT":"stdout","_MACHINE_ID":"50e74f29f36b43eca88a7f72305e96b0","_SELINUX_CONTEXT":"system_u:system_r:kernel_t:s0","_HOSTNAME":"ip-10-253-50-31.ec2.internal","_SYSTEMD_UNIT":"kubelet.service","_SYSTEMD_INVOCATION_ID":"499393ff4e984b5a81b921137ec4d4bd","SYSLOG_IDENTIFIER":"kubelet-wrapper","_PID":"1727","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_COMM":"kubelet","_EXE":"/var/lib/rkt/pods/run/3e947ebc-2d9f-4100-a899-2bd270d4748e/stage1/rootfs/opt/stage2/hyperkube/rootfs/hyperkube","_CMDLINE":"/kubelet --cni-conf-dir=/etc/kubernetes/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni --container-runtime=docker --rkt-path=/usr/bin/rkt --rkt-stage1-image=coreos.com/rkt/stage1-coreos --node-labels data.unbounce.net/role=worker --register-node=true --allow-privileged=true --pod-manifest-path=/etc/kubernetes/manifests --cluster-dns=10.253.50.31 --cluster-domain=cluster.local --cloud-provider=aws --cert-dir=/etc/kubernetes/ssl --tls-cert-file=/etc/kubernetes/ssl/worker.pem --tls-private-key-file=/etc/kubernetes/ssl/worker-key.pem --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml --require-kubeconfig","MESSAGE":"E0926 15:52:52.925532    1727 kuberuntime_container.go:59] Can't make a ref to pod \"kmderived-1506441123636-driver_default(a5e157ac-a2d2-11e7-a071-0e80cdd21cb6)\", container spark-init: selfLink was empty, can't make reference"}
luck02 commented 7 years ago

What's interesting is that I'm seeing app logs in there (socket timeout exception) but they don't show up in the pod logs in the UI. Not sure if that's expected.

Ahh, ok. so it's blowing up in k8s.kmderived-1506441123636-driver_default_spark-init: IE before the primary container has done anything. which makes sense. But is ultra confusing when those logs aren't available in the UI.

luck02 commented 7 years ago

One thing I was never sure on was exactly how the resource staging server communicates with running processes. IE how does a running init container find it's resource staging server?

luck02 commented 7 years ago

Here's the spec on my resource staging server pod:

⇒  kubectl describe pod spark-resource-staging-server-4091971298-8ncwj
Name:       spark-resource-staging-server-4091971298-8ncwj
Namespace:  default
Node:       ip-10-253-42-225.ec2.internal/10.253.42.225
Start Time: Mon, 25 Sep 2017 16:32:55 -0700
Labels:     pod-template-hash=4091971298
        resource-staging-server-instance=default
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"spark-resource-staging-server-4091971298","uid":"b5c340fe-a019-1...
Status:     Running
IP:     10.2.31.25
Created By: ReplicaSet/spark-resource-staging-server-4091971298
Controlled By:  ReplicaSet/spark-resource-staging-server-4091971298
Containers:
  spark-resource-staging-server:
    Container ID:   docker://b39929d218a61d4b88cb667810b7f84e11c479cbb5ad1a0ca5cd3a5fe0fd3c5e
    Image:      kubespark/spark-resource-staging-server:v2.2.0-kubernetes-0.3.0
    Image ID:       docker-pullable://kubespark/spark-resource-staging-server@sha256:151b58248a1a0bc9a1bbc82e186a64cfb36dc108bead07a22d1e69e5e1c2ead3
    Port:       <none>
    Args:
      /etc/spark-resource-staging-server/resource-staging-server.properties
    State:      Running
      Started:      Mon, 25 Sep 2017 16:33:06 -0700
    Ready:      True
    Restart Count:  0
    Limits:
      cpu:  4
      memory:   1Gi
    Requests:
      cpu:      2
      memory:       768Mi
    Environment:    <none>
    Mounts:
      /etc/spark-resource-staging-server from resource-staging-server-properties (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-00b84 (ro)
Conditions:
  Type      Status
  Initialized   True
  Ready     True
  PodScheduled  True
Volumes:
  resource-staging-server-properties:
    Type:   ConfigMap (a volume populated by a ConfigMap)
    Name:   spark-resource-staging-server-config
    Optional:   false
  default-token-00b84:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-00b84
    Optional:   false
QoS Class:  Burstable
Node-Selectors: <none>
Tolerations:    <none>
Events:     <none>

and service:

garylucas@gary-lucas:~/dev/ub-data-kissmetrics|DAT-spark-on-k8s⚡
⇒  kubectl describe services spark-resource-staging-service
Name:           spark-resource-staging-service
Namespace:      default
Labels:         <none>
Annotations:        <none>
Selector:       resource-staging-server-instance=default
Type:           LoadBalancer
IP:         10.3.0.161
LoadBalancer Ingress:   lb_ingress.etc
Port:           <unset> 10000/TCP
NodePort:       <unset> 32496/TCP
Endpoints:      10.2.31.25:10000
Session Affinity:   None
Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  2d        19m     64  service-controller          Normal      UpdatedLoadBalancer Updated load balancer with new hosts

it's whitelisted as well, so it's not open to the world. We'll sort out the TLS / SSL stuff probably on our loadbalancer.

mccheah commented 7 years ago

One thing I was never sure on was exactly how the resource staging server communicates with running processes. IE how does a running init container find it's resource staging server?

This should be derived from spark.kubernetes.resourceStagingServer.uri, which must be a URI that is routable both from the submission client and also from the init-container.

mccheah commented 7 years ago

Ahh, ok. so it's blowing up in k8s.kmderived-1506441123636-driver_default_spark-init: IE before the primary container has done anything. which makes sense. But is ultra confusing when those logs aren't available in the UI.

init-container logs in the UI might be a question for @foxish, but we actually had this information in the kubectl describe pod from earlier and happened to miss this:

Init Containers:
  spark-init:
    Container ID:   docker://f96c6130a921c46b48813cf9127ed05cb4a87d158a608d215628788ec858c93d
    Image:      ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics:latest
    Image ID:       docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-derpmetrics@sha256:0f244ccfaf6eeedf793c808616b2867ccb7c709c74d6fea1694faaf558684a50
    Port:       <none>
    Args:
      /etc/spark-init/spark-init.properties
    State:      Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 25 Sep 2017 14:30:37 -0700
      Finished:     Mon, 25 Sep 2017 14:31:10 -0700
    Ready:      False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/spark-init from spark-init-properties (rw)
      /mnt/secrets/spark-init from spark-init-secret (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from spark-k8s-service-account-token-fn4qg (ro)
      /var/spark-data/spark-files from download-files (rw)
      /var/spark-data/spark-jars from download-jars-volume (rw)
luck02 commented 7 years ago

Ok, I think this was added in 2.2 spark.kubernetes.resourceStagingServer.internal.uri

This issue is that I had to white list that URL and the whitelist doesn't include the cluster.

I'll test that shortly and then close this.

Also, I moved to 2.2 release. Thanks!

luck02 commented 7 years ago

Ok, so it was a combination of things: 1) The IP for the staging server was only visible to my whitelists. 2) The documentation (as far as I read, didn't obviously indicate that the URL needed to be accessible from internal and external... Might have been there). 3) The logs were appearing in the spark-init container which doesn't (for whatever reason) log to the primary console. IE the problem was really effectively obfuscated. 4) PEBKAC

Fix was spark 2.2 brought in an internal uri.