Open ashokkumarrathore opened 1 month ago
ok this means it is failing to get AppId due to timed out.
I think this is the call which fails : withRetry(kubernetesClient.getApplications().find(_.getApplicationTag.contains(appTag))) But not sure why would it fail if the driver is there and running fine. Need to see if there is some API behaviour change in new version.
There is a bug due to which original exception is masked. Here's the original exception: 24/11/11 16:40:31 INFO SparkKubernetesApp: (Failed to get app from tag: ,io.fabric8.kubernetes.client.KubernetesClie │ │ ntException: Failure executing: GET at: https://kubernetes.default.svc.cluster.local/api/v1/pods?labelSelector=spark │ │ -role%3Ddriver%2Cspark-app-tag%2Cspark-app-selector. Message: Forbidden!Configured service account doesn't have acce │ │ ss. Service account may have been revoked
The service account have permission on the namespace where job is submitted. Does livy look into all namespace for app or just the namespace to which job was submitted? I am kind of curious why same setup works with older versions of spark/hadoop but not new one. Please let me know if you have any inputs.
I think it is an issue with the implementation. In a multi tenant cluster, the service account might not have permission to all namespaces. We should look for job within the namespace to avoid this issue. This is actually a regression. Spark jobs(on k8s) work fine if i use the build before we added spark k8s support.
Thank you @ashokkumarrathore for the details. I see that you have created issue https://github.com/apache/incubator-livy/issues/461. Do you have a potential fix in mind already? If so, we can try to make the related code changes together.
I think there should be multiple changes.
Yes, this seems like a great idea.
I upgraded the Kubernetes client from 5.6.0 to 6.5.1 to address P0 vulnerabilities in dependencies and trying to run a simple job.
It submits job and job also succeeds. However, Livy marks it failed because it is not able to get the app status. The relevant log from Livy server is pasted below. I am also debugging it but let me know if you have something i can try.
24/10/16 08:17:58 ERROR SparkKubernetesApp: Error while refreshing Kubernetes state │ │ java.lang.IllegalStateException: Promise already completed. │ │ at scala.concurrent.Promise.complete(Promise.scala:53) │ │ at scala.concurrent.Promise.complete$(Promise.scala:52) │ │ at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187) │ │ at scala.concurrent.Promise.failure(Promise.scala:104) │ │ at scala.concurrent.Promise.failure$(Promise.scala:104) │ │ at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:187) │ │ at org.apache.livy.utils.SparkKubernetesApp.org$apache$livy$utils$SparkKubernetesApp$$monitorSparkKubernetesApp( │ │ SparkKubernetesApp.scala:299) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.$anonfun$run$9(SparkKubernetesApp.scala │ │ :210) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.$anonfun$run$9$adapted(SparkKubernetesA │ │ pp.scala:204) │ │ at scala.collection.immutable.Range.foreach(Range.scala:158) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.run(SparkKubernetesApp.scala:204) │ │ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) │ │ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) │ │ at java.base/java.lang.Thread.run(Thread.java:829)