Closed NiklasRosenstein closed 1 year ago
It looks like a configuration issuse.
Lighter tries to create Spark Driver Pod, but fails due to PeerUnverifiedException: Hostname kubernetes.default.svc.cluster.local not verified
. kubernetes.default.svc.cluster.local
is a DNS name for accessing kubernetes service API, it should be available on all kubernetes installations.
Maybe something is wrong with Service Account or Role binding (I'd imagine in that case errors would be different, but maybe..) Have you followed documentation in https://github.com/exacaster/lighter/blob/master/docs/kubernetes.md?
Did you change something in that documentation to fit your needs? If you did everything according to the documentation, these kubernetes resources should be present:
➜ ~ kubectl get pods -n spark
NAME READY STATUS RESTARTS AGE
lighter-5965d6ffb8-qf457 1/1 Running 0 2d21h
➜ ~ kubectl get sa -n spark
NAME SECRETS AGE
default 1 619d
spark 1 583d
➜ ~ kubectl get rolebinding -n spark
NAME ROLE AGE
lighter-spark Role/lighter-spark 583d
➜ ~
You can also try changing LIGHTER_KUBERNETES_MASTER
to k8s://kubernetes.default.svc:443
, or corresponding IP address.
Hi @pdambrauskas,
Gotcha! I'm just surprised because if Lighter just inherits the Kubernetes service account under /var/run/secrets/kubernetes.io
the CA authority is present there.
root@lighter-79bfc75b4b-q25rs:/var/run/secrets/kubernetes.io/serviceaccount# ls
ca.crt namespace token
I did follow the Kubernetes guide you reference, and I seem to have all the required resources:
coder@coder-niklasrosenstein-workspace:~/git/cluster-configuration > kubectl get pods -n spark
NAME READY STATUS RESTARTS AGE
lighter-79bfc75b4b-q25rs 1/1 Running 0 14h
lighter-db-postgresql-0 1/1 Running 0 37h
spark-master-0 1/1 Running 0 37h
spark-worker-0 1/1 Running 0 37h
spark-worker-1 1/1 Running 0 34h
coder@coder-niklasrosenstein-workspace:~/git/cluster-configuration > kubectl get sa -n spark
NAME SECRETS AGE
default 1 14d
spark 1 14d
coder@coder-niklasrosenstein-workspace:~/git/cluster-configuration > kubectl get rolebinding -n spark
NAME ROLE AGE
lighter-spark Role/lighter-spark 14d
You don't seem to have Spark installed separately, does spark-master
just live in a different namespace in your case?
I've tried setting LIGHTER_KUBERNETES_MASTER
to k8s://IP:6443
where IP
is the IP of the first network interface where kubeapi service is running (I currently run a k0s distribution on a single node).
That seems to have change the results a bit.
14:09:00.125 [scheduled-executor-thread-23] INFO c.e.l.a.sessions.SessionHandler - Start provisioning permanent sessions.
14:09:00.125 [scheduled-executor-thread-23] INFO c.e.l.a.sessions.SessionHandler - End provisioning permanent sessions.
14:09:00.127 [scheduled-executor-thread-11] INFO c.e.l.a.sessions.SessionHandler - Launching Application[id='f4788106-d005-49e9-be74-30b9c1216baf', type=SESSION, state=NOT_STARTED, appId='null', appInfo='null', submitParams=SubmitParams[name='session_d3d8a4aa-2de3-47e2-bf20-5a4fc0f3a648', file='http://lighter.spark:8080/lighter/jobs/shell_wrapper.py', master='null', mainClass='null', numExecutors=1, executorCores=1, executorMemory='1000M', driverCores=1, driverMemory='1000M', args=[], pyFiles=[], files=[], jars=[], archives=[], conf={}], createdAt=2023-01-14T14:08:57.274658, contactedAt=null]
Jan 14, 2023 2:09:00 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: /home/app/spark//bin/load-spark-env.sh: line 68: ps: command not found
14:09:00.559 [scheduled-executor-thread-12] INFO c.e.l.application.batch.BatchHandler - Completed 0 jobs
14:09:00.562 [scheduled-executor-thread-8] INFO c.e.l.application.batch.BatchHandler - Processing scheduled batches, found empty slots: 15, using 10
14:09:00.563 [scheduled-executor-thread-8] INFO c.e.l.application.batch.BatchHandler - Waiting launches to complete
Jan 14, 2023 2:09:01 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Jan 14, 2023 2:09:01 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:01 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 WARN DriverServiceFeatureStep: Driver's hostname would preferably be session-d3d8a4aa-2de3-47e2-bf20-5a4fc0f3a648-5809d285b09cc25d-driver-svc, but this is too long (must be <= 63 characters). Falling back to use spark-6613ee85b09cc4a8-driver-svc as the driver service's name.
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 INFO ShutdownHookManager: Shutdown hook called
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-354cb523-92fb-4a13-ac8e-5fc65ebb8273
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-f669604b-f230-4fe6-b4ed-270ab5ef5d2b
Jan 14, 2023 2:09:02 PM org.apache.spark.launcher.OutputRedirector redirect
INFO: 23/01/14 14:09:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-9425cac7-d3a7-409d-b090-ec1e46df7ca1
14:09:02.596 [launcher-proc-1] INFO c.e.l.backend.ClusterSparkListener - State change. AppId: null, State: LOST
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by py4j.reflection.ReflectionShim (file:/home/app/libs/py4j-0.10.9.7.jar) to method java.util.ArrayList$Itr.next()
WARNING: Please consider reporting this to the maintainers of py4j.reflection.ReflectionShim
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
14:09:30.569 [scheduled-executor-thread-12] INFO c.e.l.application.batch.BatchHandler - Completed 0 jobs
14:09:30.571 [scheduled-executor-thread-16] INFO c.e.l.application.batch.BatchHandler - Processing scheduled batches, found empty slots: 15, using 10
14:09:30.571 [scheduled-executor-thread-16] INFO c.e.l.application.batch.BatchHandler - Waiting launches to complete
14:10:00.125 [scheduled-executor-thread-1] INFO c.e.l.a.sessions.SessionHandler - Start provisioning permanent sessions.
14:10:00.125 [scheduled-executor-thread-1] INFO c.e.l.a.sessions.SessionHandler - End provisioning permanent sessions.
14:10:00.457 [scheduled-executor-thread-9] INFO c.e.l.a.ApplicationStatusHandler - Tracking Application[id='f4788106-d005-49e9-be74-30b9c1216baf', type=SESSION, state=STARTING, appId='null', appInfo='null', submitParams=SubmitParams[name='session_d3d8a4aa-2de3-47e2-bf20-5a4fc0f3a648', file='http://lighter.spark:8080/lighter/jobs/shell_wrapper.py', master='null', mainClass='null', numExecutors=1, executorCores=1, executorMemory='1000M', driverCores=1, driverMemory='1000M', args=[], pyFiles=[], files=[], jars=[], archives=[], conf={}], createdAt=2023-01-14T14:08:57.274658, contactedAt=2023-01-14T14:09:00.131426], info: ApplicationInfo[state=IDLE, applicationId='spark-bf370e5a0e784d49a86d927ff4873144']
14:10:00.578 [scheduled-executor-thread-6] INFO c.e.l.application.batch.BatchHandler - Completed 0 jobs
14:10:00.579 [scheduled-executor-thread-10] INFO c.e.l.application.batch.BatchHandler - Processing scheduled batches, found empty slots: 15, using 10
14:10:00.579 [scheduled-executor-thread-10] INFO c.e.l.application.batch.BatchHandler - Waiting launches to complete
14:10:00.866 [Thread-3] INFO c.e.l.a.s.p.p.PythonSessionIntegration - Waiting: [Statement[id='9d86b013-dfac-411e-95a7-1b8e4fe75dc8', code='spark', output=null, state='waiting', createdAt='2023-01-14T14:10:00.860492']]
14:10:00.877 [Thread-3] WARN c.e.l.a.s.p.p.PythonSessionIntegration - Handling response for f4788106-d005-49e9-be74-30b9c1216baf : 9d86b013-dfac-411e-95a7-1b8e4fe75dc8 --- {content={text/plain=<pyspark.sql.session.SparkSession object at 0x7f6e8474e5b0>}}
14:10:01.182 [Thread-3] INFO c.e.l.a.s.p.p.PythonSessionIntegration - Waiting: [Statement[id='23fb2f98-5221-40ba-b151-a24b24da46e1', code='', output=null, state='waiting', createdAt='2023-01-14T14:10:01.171992']]
14:10:01.192 [Thread-3] WARN c.e.l.a.s.p.p.PythonSessionIntegration - Handling response for f4788106-d005-49e9-be74-30b9c1216baf : 23fb2f98-5221-40ba-b151-a24b24da46e1 --- {error=IndexError, message=pop from empty list, traceback=[Traceback (most recent call last):
, File "/tmp/spark-daeb26e9-9c22-4361-8210-b9992b4fdf49/shell_wrapper.py", line 112, in exec
self._exec_then_eval(code.rstrip())
, File "/tmp/spark-daeb26e9-9c22-4361-8210-b9992b4fdf49/shell_wrapper.py", line 103, in _exec_then_eval
last = ast.Interactive([block.body.pop()])
, IndexError: pop from empty list
]}
14:10:30.584 [scheduled-executor-thread-6] INFO c.e.l.application.batch.BatchHandler - Completed 0 jobs
14:10:30.585 [scheduled-executor-thread-20] INFO c.e.l.application.batch.BatchHandler - Processing scheduled batches, found empty slots: 15, using 10
14:10:30.585 [scheduled-executor-thread-20] INFO c.e.l.application.batch.BatchHandler - Waiting launches to complete
You don't seem to have Spark installed separately, does spark-master just live in a different namespace in your case?
We do not run Spark in standalone mode, so no Spark master is needed. More details here: https://spark.apache.org/docs/latest/running-on-kubernetes.html
That seems to have change the results a bit.
It looks like it launched the session somewhat successfully. It looks like spark session was created:
14:10:00.866 [Thread-3] INFO c.e.l.a.s.p.p.PythonSessionIntegration - Waiting: [Statement[id='9d86b013-dfac-411e-95a7-1b8e4fe75dc8', code='spark', output=null, state='waiting', createdAt='2023-01-14T14:10:00.860492']]
14:10:00.877 [Thread-3] WARN c.e.l.a.s.p.p.PythonSessionIntegration - Handling response for f4788106-d005-49e9-be74-30b9c1216baf : 9d86b013-dfac-411e-95a7-1b8e4fe75dc8 --- {content={text/plain=<pyspark.sql.session.SparkSession object at 0x7f6e8474e5b0>}}
But your next statement failed:
14:10:01.182 [Thread-3] INFO c.e.l.a.s.p.p.PythonSessionIntegration - Waiting: [Statement[id='23fb2f98-5221-40ba-b151-a24b24da46e1', code='', output=null, state='waiting', createdAt='2023-01-14T14:10:01.171992']]
14:10:01.192 [Thread-3] WARN c.e.l.a.s.p.p.PythonSessionIntegration - Handling response for f4788106-d005-49e9-be74-30b9c1216baf : 23fb2f98-5221-40ba-b151-a24b24da46e1 --- {error=IndexError, message=pop from empty list, traceback=[Traceback (most recent call last):
I do not understand how you manged to send empty code wit this statement (code=''
), our jupyter notebook just skips these kind of statements. Did the session failed after eventually? Have you tried to execute more statements? Can you check Lighter UI and see what is the status of your session and what are the logs of session driver pod? It looks like it did not failed, only one of your statements got error response
I think that was caused by running a cell with just %spark
in it.
Executing this example from SparkMagic
%%spark
numbers = sc.parallelize([1, 2, 3, 4])
print('First element of numbers is {} and its description is:\n{}'.format(numbers.first(), numbers.toDebugString()))
gives me
15:24:37.727 [Thread-3] INFO c.e.l.a.s.p.p.PythonSessionIntegration - Waiting: [Statement[id='72d32f7f-abd9-497a-b494-538bc974a5de', code='numbers = sc.parallelize([1, 2, 3, 4])
print('First element of numbers is {} and its description is:\n{}'.format(numbers.first(), numbers.toDebugString()))\
', output=null, state='waiting', createdAt='2023-01-14T15:24:37.716379']]
15:24:37.735 [Thread-3] WARN c.e.l.a.s.p.p.PythonSessionIntegration - Handling response for f4788106-d005-49e9-be74-30b9c1216baf : 72d32f7f-abd9-497a-b494-538bc974a5de --- {error=SyntaxError, message=unexpected EOF while parsing (<unknown>, line 2), traceback=[Traceback (most recent call last):
, File "/tmp/spark-daeb26e9-9c22-4361-8210-b9992b4fdf49/shell_wrapper.py", line 112, in exec
self._exec_then_eval(code.rstrip())
, File "/tmp/spark-daeb26e9-9c22-4361-8210-b9992b4fdf49/shell_wrapper.py", line 100, in _exec_then_eval
block = ast.parse(code, mode='exec')
, File "/usr/lib/python3.9/ast.py", line 50, in parse
return compile(source, filename, mode, flags,
, File "<unknown>", line 2
, print('First element of numbers is {} and its description is:\n{}'.format(numbers.first(), numbers.toDebugString()))\
, ^
, SyntaxError: unexpected EOF while parsing
]}
(and basically the same error shown in the notebook)
The Lighter UI shows no Batches and one Session:
Just use spark
without %%
.
Also I see, that you've selected scala
as your Session language. Not sure if you've noticed, but Lighter Sessions only support python, so it makes no difference, what language you've chosen in the UI, Lighter will always start PySpark session.
I've tried your example on my notebook:
I've also managed to run spark cell in Python Session (Like you tried):
When I try to start an Application via SparkMagic through Lighter, the application seems to fail to start and the Lighter logs show the followin error:
I'm unsure where exactly that certificate error occurs; is it in the Spark driver pod that Lighter presumably starts in the Kubernetes cluster that I have it and Spark running in? (Although I have not seen a Pod spawning between clicking "Create Session" in SparkMagic and the error showing up in the Lighter logs) Or is it ni the Lighter configuration that I need to make aware of my cluster's CA?