JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
199 stars 76 forks source link

403 Forbidden issue #24

Closed duongnt closed 4 years ago

duongnt commented 4 years ago

I tried to install in our GKE cluster and got this error:

2020-03-06 10:04:58,536 WARN  [OkHttp https://kubernetes.default.svc/...] internal.WatchConnectionManager (WatchConnectionManager.java:onFailure(197)) - Exec Failure: HTTP 403, Status: 403 - 
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
    at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:228)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:195)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2020-03-06 10:04:58,543 WARN  [pool-3-thread-1] k8s.ExecutorPodsWatchSnapshotSource (Logging.scala:logWarning(87)) - Kubernetes client has been closed (this is expected if the application is shutting down.)
2020-03-06 10:04:58,544 ERROR [pool-3-thread-1] spark.SparkContext (Logging.scala:logError(91)) - Error initializing SparkContext.
io.fabric8.kubernetes.client.KubernetesClientException: 
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201)
    at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:570)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:197)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2020-03-06 10:04:58,554 INFO  [pool-3-thread-1] server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped Spark@6fc37701{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2020-03-06 10:04:58,557 INFO  [pool-3-thread-1] ui.SparkUI (Logging.scala:logInfo(54)) - Stopped Spark web UI at http://livy-session-2-1583489087099-driver-svc.livy.svc:4040
2020-03-06 10:04:58,561 INFO  [pool-3-thread-1] k8s.KubernetesClusterSchedulerBackend (Logging.scala:logInfo(54)) - Shutting down all executors
2020-03-06 10:04:58,565 INFO  [dispatcher-event-loop-2] k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint (Logging.scala:logInfo(54)) - Asking each executor to shut down
2020-03-06 10:04:58,785 ERROR [kubernetes-executor-snapshots-subscribers-1] util.Utils (Logging.scala:logError(91)) - Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [livy]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:364)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.InterruptedIOException: interrupted
    at okio.Timeout.throwIfReached(Timeout.java:146)
    at okio.Okio$1.write(Okio.java:76)
    at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
    at okio.RealBufferedSink.flush(RealBufferedSink.java:224)
    at okhttp3.internal.http1.Http1Codec.finishRequest(Http1Codec.java:166)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:84)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:107)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
    at okhttp3.RealCall.execute(RealCall.java:77)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:227)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:787)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:357)
    ... 17 more
2020-03-06 10:04:58,854 INFO  [dispatcher-event-loop-6] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(54)) - MapOutputTrackerMasterEndpoint stopped!
2020-03-06 10:04:58,861 INFO  [pool-3-thread-1] memory.MemoryStore (Logging.scala:logInfo(54)) - MemoryStore cleared
2020-03-06 10:04:58,862 INFO  [pool-3-thread-1] storage.BlockManager (Logging.scala:logInfo(54)) - BlockManager stopped
2020-03-06 10:04:58,868 INFO  [pool-3-thread-1] storage.BlockManagerMaster (Logging.scala:logInfo(54)) - BlockManagerMaster stopped
2020-03-06 10:04:58,868 WARN  [pool-3-thread-1] metrics.MetricsSystem (Logging.scala:logWarning(66)) - Stopping a MetricsSystem that is not running
2020-03-06 10:04:58,870 INFO  [dispatcher-event-loop-4] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(54)) - OutputCommitCoordinator stopped!
2020-03-06 10:04:58,883 INFO  [pool-3-thread-1] spark.SparkContext (Logging.scala:logInfo(54)) - Successfully stopped SparkContext
Exception in thread "main" java.lang.NullPointerException
    at org.apache.livy.rsc.driver.JobWrapper.cancel(JobWrapper.java:90)
    at org.apache.livy.rsc.driver.RSCDriver.shutdown(RSCDriver.java:127)
    at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:364)
    at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:93)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I did some research and apparently the kubernetes client jar is too old: https://stackoverflow.com/questions/57643079/kubernetes-watchconnectionmanager-exec-failure-http-403

I followed the suggestions there and replace the jars, however after that I got this:

/opt/entrypoint.sh: line 45: /opt/spark/conf/spark-defaults.conf: Read-only file system
jahstreet commented 4 years ago

Hi @duongnt , thank you for creating the issue. You're right, the root cause of the error you've seen is the old version of the fabric8 Java Kubernetes Client. Please refer the compatibility matrix and note that you should upgrade client jars on both Livy and Spark classpaths. Also there are some explainations here.

Could you please share the steps you did on replace the jars so I could reproduce your current issue? Also which version of GKE have you tried?

kyprifog commented 4 years ago

I received the same error. This seems related: https://github.com/kubernetes/kubernetes/issues/82131

Solution suggested there seems to be upgrading to spark 2.4.5

jahstreet commented 4 years ago

@kyprifog , 403 Forbidden - yes, Spark 2.4.5 depends on fabric8 Kubernetes client 4.6.1 which is compatible with Kubernetes API 1.12.0 - 1.15.3. If you running these versions then upgrading Spark should work for you. Also current Livy on Kubernetes build depends on older client version and needs to be upgraded as well (I have it on my nearest roadmap). Strange to me that @duongnt gets /opt/entrypoint.sh: line 45: /opt/spark/conf/spark-defaults.conf: Read-only file system error.