JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
199 stars 76 forks source link

Notebook connectivity issue #35

Closed kyprifog closed 4 years ago

kyprifog commented 4 years ago

Can't tell if this is a sparkmagics issue. I start livy using the helm chart attached and ran the kubectl exec example and it ran fine. Then I started the helm chart jupyter-sparkmagic after changing the values.yaml to have:

livyEndpoint: "livy.livy:80"

which is the endpoint I think needs to be referenced here (i also tried livyheadless.livy:8998, livy.livy:8998, and the corresponding cluster IP for both of those services exposed by livy, as well as the same ports in the pods directly without the services)

However in a notebook if I run:

%load_ext sparkmagic.magics
%manage_spark

to try to establish a session using that livy endpoint, I am only able to get this error:

HttpClientException: Error sending http request and maximum retry encountered.

Even though using curl and telnet I am able to access those endpoints from the jupyter-sparkmagic pod. One strange thing is if I exec into the livy-0 pod spun up by livy and run:

livy-server status

I get that the server is not running even though the kubectl exec example worked and livy appears to be running for PID 1.

kyprifog commented 4 years ago

Some updates debugging manual connection to livy have sent me down a rabbit hole (sparkmagic didn't seem to have any easy ways to do so):

import socket
import time
def isOpen(ip,port):
     up = False
     for i in range(3):
        try:
          s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
          s.settimeout(1)
          s.connect((ip, port))
          up = True
        except:
          time.sleep(1)
        finally:
          if up:
              s.shutdown(socket.SHUT_RDWR)
              s.close()

     path = ip + ":" + str(port)
     if up:
        print(path + " is UP")
     else:
        print(path + " is DOWN")

isOpen("100.96.14.16", 8998)  # cluster ip for livy-0 node

>>

100.96.14.16:8998 is UP

import json, pprint, requests, textwrap
host = 'http://100.96.14.16:8998' 
headers = {'Content-Type': 'application/json'}
spark_conf = {
    "spark.kubernetes.authenticate.driver.serviceAccountName": "livy-spark"
}

data = {'kind': 'spark', 'conf': spark_conf}
r1 = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
r1.json()

>>

{'id': 35,
 'name': None,
 'appId': None,
 'owner': None,
 'proxyUser': None,
 'state': 'starting',
 'kind': 'spark',
 'appInfo': {'driverLogUrl': None,
  'sparkUiUrl': None,
  'executorLogUrls': None},
 'log': ['stdout: ', '\nstderr: ', '\nKubernetes Diagnostics: ']}

data = {
  'code': textwrap.dedent("""
    val NUM_SAMPLES = 100000;
    val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
      val x = Math.random();
      val y = Math.random();
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _);
    println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
    """)
}

r = requests.post(statements_url, data=json.dumps(data), headers=headers)
pprint.pprint(r.json())

r = requests.get(session_url, headers=headers)
pprint.pprint(r.json())

>>

'java.lang.IllegalStateException: Session is in state starting'
{'appId': 'spark-ec26dd2ece084a27920610a43cb810c7',
 'appInfo': {'driverLogUrl': None, 'executorLogUrls': None, 'sparkUiUrl': None},
 'id': 35,
 'kind': 'spark',
 'log': ['\t\tspark-kubernetes-driver:',
         '\t\t\timage: '
         'sasnouskikh/livy-spark:0.7.0-incubating-spark_2.4.3_2.11-hadoop_3.2.0',
         '\t\t\trequests: cpu=Quantity(amount=1, format=null, '
         'additionalProperties={}), memory=Quantity(amount=1408Mi, '
         'format=null, additionalProperties={})',
         '\t\t\tlimits: memory=Quantity(amount=1408Mi, format=null, '
         'additionalProperties={})',
         '\t\t\tcommand: [] [driver, --properties-file, '
         '/opt/spark/conf/spark.properties, --class, '
         'org.apache.livy.rsc.driver.RSCDriverBootstrapper, spark-internal]',
         '\tconditions:',
         '\t\tPodCondition(lastProbeTime=null, '
         'lastTransitionTime=2020-07-08T15:08:54Z, message=null, reason=null, '
         'status=True, type=Initialized, additionalProperties={})',
         '\t\tPodCondition(lastProbeTime=null, '
         'lastTransitionTime=2020-07-08T15:08:55Z, message=null, reason=null, '
         'status=True, type=Ready, additionalProperties={})',
         '\t\tPodCondition(lastProbeTime=null, '
         'lastTransitionTime=2020-07-08T15:08:55Z, message=null, reason=null, '
         'status=True, type=ContainersReady, additionalProperties={})',
         '\t\tPodCondition(lastProbeTime=null, '
         'lastTransitionTime=2020-07-08T15:08:54Z, message=null, reason=null, '
         'status=True, type=PodScheduled, additionalProperties={})'],
 'name': None,
 'owner': None,
 'proxyUser': None,
 'state': 'starting'}

So the session will never leave starting state. Going into kubernetes log for session I found:

2020-07-08 15:09:01,040 INFO  [pool-3-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Added JAR file:///opt/livy/repl_2.11-jars/livy-repl_2.11-0.7.0-incubating-SNAPSHOT.jar at spark://livy-session-35-1594220933140-driver-svc.livy.svc:7078/jars/livy-repl_2.11-0.7.0-incubating-SNAPSHOT.jar with timestamp 1594220941040
2020-07-08 15:09:01,041 INFO  [pool-3-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Added JAR file:///opt/livy/repl_2.11-jars/livy-core_2.11-0.7.0-incubating-SNAPSHOT.jar at spark://livy-session-35-1594220933140-driver-svc.livy.svc:7078/jars/livy-core_2.11-0.7.0-incubating-SNAPSHOT.jar with timestamp 1594220941041
2020-07-08 15:09:01,041 INFO  [pool-3-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Added JAR file:///opt/livy/repl_2.11-jars/commons-codec-1.9.jar at spark://livy-session-35-1594220933140-driver-svc.livy.svc:7078/jars/commons-codec-1.9.jar with timestamp 1594220941041
2020-07-08 15:09:01,998 INFO  [kubernetes-executor-snapshots-subscribers-1] k8s.ExecutorPodsAllocator (Logging.scala:logInfo(54)) - Going to request 2 executors from Kubernetes.
2020-07-08 15:09:02,082 WARN  [OkHttp https://kubernetes.default.svc/...] internal.WatchConnectionManager (WatchConnectionManager.java:onFailure(197)) - Exec Failure: HTTP 403, Status: 403 - 
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
    at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:228)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:195)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2020-07-08 15:09:02,093 WARN  [pool-3-thread-1] k8s.ExecutorPodsWatchSnapshotSource (Logging.scala:logWarning(87)) - Kubernetes client has been closed (this is expected if the application is shutting down.)
2020-07-08 15:09:02,094 ERROR [pool-3-thread-1] spark.SparkContext (Logging.scala:logError(91)) - Error initializing SparkContext.
io.fabric8.kubernetes.client.KubernetesClientException: 
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201)
    at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:570)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:197)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Exception in thread "kubernetes-dispatcher-0" java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@16bcaca6 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@5af9178f[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

which looks like it was returning an instances of this error:

https://stackoverflow.com/questions/57887672/expected-http-101-response-but-was-403-forbidden

So I tried adding JARS like so:

import json, pprint, requests, textwrap
host = 'http://100.96.14.16:8998'
headers = {'Content-Type': 'application/json'}
spark_conf = {
    "spark.kubernetes.authenticate.driver.serviceAccountName": "livy-spark"
}

jars = [

    "/opt/spark/jars/kubernetes-client-4.1.2.jar",
    "/opt/spark/jars/kubernetes-model-common-4.1.2.jar",
    "/opt/spark/jars/kubernetes-model-4.1.2.jar",
    "/opt/spark/jars/okhttp-3.11.0.jar",
    "/opt/spark/jars/okio-1.15.0.jar"
]

data = {'kind': 'spark', 'conf': spark_conf, 'jars': jars}
r1 = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
r1.json()

{'msg': 'requirement failed: Local path /opt/spark/jars/kubernetes-client-4.1.2.jar cannot be added to user sessions.'}

even though if I exec into the livy-0 pod and run:

ls /opt/spark/jars/kubernetes-client-4.1.2.jar 

it shows that it is there. So now i'm really confused...

I'm especially confused why none of these issues showed up during the kubectl exec example which seems to run fine.

Should I use a livy chart other than the one provided?

kyprifog commented 4 years ago

A combination of this PR: https://github.com/jahstreet/spark-on-kubernetes-helm/pull/36 with making sure I deployed the jupyter in the same "livy" namespace seems to allow me to submit jobs manually through Livy api, but I am still having trouble getting sparkmagics to work, getting the same error.

kyprifog commented 4 years ago

Following the readme more closely in jupyter-sparkmagic chart and making sure that I deploy it to the jupyter-sparkmagic namespace seems to have resolved the issue.

jahstreet commented 4 years ago

Haven't yet went through it, cool that you've managed to solve it on your own already. Thx for the PR and extensive debugging. I appreciate your contribution @kyprifog .

kyprifog commented 4 years ago

No problem, the livy image works great thanks for putting it together. FTR, I am using just a standard notebook image derived from the docker stacks implementation with sparkmagic installed, not the jupyter helm chart you provided and am able to submit jobs to livy no problem which is great! I thought there was more to your jupyter notebook image but after looking at it looks like its very similar to just a docker stacks image with sparkmagic installed which made it easy to use our existing notebook image living outside the k8s cluster. I am still debugging some issues i'm having with sparkmagic however.

jahstreet commented 4 years ago

its very similar to just a docker stacks image with sparkmagic installed

That is true indeed, the initial idea was to keep it so for easy upgrades if necessary. Probably in the future it can be derived from the fork + patch branch to make the relationships more explicit.