askhatri / livycluster

Apache License 2.0
3 stars 1 forks source link

Livy unable to get spark k8s app status correctly #4

Open ashokkumarrathore opened 1 month ago

ashokkumarrathore commented 1 month ago

I upgraded the Kubernetes client from 5.6.0 to 6.5.1 to address P0 vulnerabilities in dependencies and trying to run a simple job.

It submits job and job also succeeds. However, Livy marks it failed because it is not able to get the app status. The relevant log from Livy server is pasted below. I am also debugging it but let me know if you have something i can try.

24/10/16 08:17:58 ERROR SparkKubernetesApp: Error while refreshing Kubernetes state │ │ java.lang.IllegalStateException: Promise already completed. │ │ at scala.concurrent.Promise.complete(Promise.scala:53) │ │ at scala.concurrent.Promise.complete$(Promise.scala:52) │ │ at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187) │ │ at scala.concurrent.Promise.failure(Promise.scala:104) │ │ at scala.concurrent.Promise.failure$(Promise.scala:104) │ │ at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:187) │ │ at org.apache.livy.utils.SparkKubernetesApp.org$apache$livy$utils$SparkKubernetesApp$$monitorSparkKubernetesApp( │ │ SparkKubernetesApp.scala:299) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.$anonfun$run$9(SparkKubernetesApp.scala │ │ :210) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.$anonfun$run$9$adapted(SparkKubernetesA │ │ pp.scala:204) │ │ at scala.collection.immutable.Range.foreach(Range.scala:158) │ │ at org.apache.livy.utils.SparkKubernetesApp$KubernetesAppMonitorRunnable.run(SparkKubernetesApp.scala:204) │ │ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) │ │ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) │ │ at java.base/java.lang.Thread.run(Thread.java:829)

ashokkumarrathore commented 1 month ago

Exception is throws from here : https://github.com/apache/incubator-livy/blob/6097af1cdd536ebbe1d7eacb1513a440a5fa2784/server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala#L298

askhatri commented 1 month ago

ok this means it is failing to get AppId due to timed out.

ashokkumarrathore commented 1 month ago

I think this is the call which fails : withRetry(kubernetesClient.getApplications().find(_.getApplicationTag.contains(appTag))) But not sure why would it fail if the driver is there and running fine. Need to see if there is some API behaviour change in new version.

ashokkumarrathore commented 1 week ago

There is a bug due to which original exception is masked. Here's the original exception: 24/11/11 16:40:31 INFO SparkKubernetesApp: (Failed to get app from tag: ,io.fabric8.kubernetes.client.KubernetesClie │ │ ntException: Failure executing: GET at: https://kubernetes.default.svc.cluster.local/api/v1/pods?labelSelector=spark │ │ -role%3Ddriver%2Cspark-app-tag%2Cspark-app-selector. Message: Forbidden!Configured service account doesn't have acce │ │ ss. Service account may have been revoked

The service account have permission on the namespace where job is submitted. Does livy look into all namespace for app or just the namespace to which job was submitted? I am kind of curious why same setup works with older versions of spark/hadoop but not new one. Please let me know if you have any inputs.

ashokkumarrathore commented 1 week ago

I think it is an issue with the implementation. In a multi tenant cluster, the service account might not have permission to all namespaces. We should look for job within the namespace to avoid this issue. This is actually a regression. Spark jobs(on k8s) work fine if i use the build before we added spark k8s support.

askhatri commented 1 week ago

Thank you @ashokkumarrathore for the details. I see that you have created issue https://github.com/apache/incubator-livy/issues/461. Do you have a potential fix in mind already? If so, we can try to make the related code changes together.

ashokkumarrathore commented 4 days ago

I think there should be multiple changes.

  1. KubernetesClient : We can initialise this with default namespace but when we make a call to getApplications(), we should use the namespace it was submitted to.
  2. Currently we just initialise the KubernetesClient with livyConf. Need to see if we can override namespace after object creation, if not then we need to defer initialising it to later time.
  3. We also need to think how sharing k8s client works. If they have different namespaces, they can only be shared if configs match.
askhatri commented 4 days ago

Yes, this seems like a great idea.