apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.09k stars 913 forks source link

[Bug] Spark engine can not exit when enable authz plugin #4270

Closed pan3793 closed 1 year ago

pan3793 commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

It was reported by some users that when using the Spark Ranger plugin (Kyuubi authz module) w/ Kyuubi, the Driver JVM process can not exit even after SparkSQLEngine and SparkContext was stopped.

After some investigation by jstack, I found the JVM hang because of some non-daemon threads opened by Ranger Audit plugins, i.e. ES, Solr

See RANGER-3787 for details about issue related to ES audit plugin

Affects Version(s)

master/1.6

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

As you can see, the SparkSQLEngine and SparkContext was stopped, but the JVM process is still alive.

23/02/08 03:45:03 INFO EngineServiceDiscovery: Clean up discovery service due to this is connection share level.
23/02/08 03:45:03 INFO EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
23/02/08 03:45:03 INFO SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
23/02/08 03:45:03 INFO SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
23/02/08 03:45:03 INFO SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
23/02/08 03:45:03 INFO SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
23/02/08 03:45:03 INFO SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
23/02/08 03:45:03 INFO SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
23/02/08 03:45:03 INFO SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
23/02/08 03:45:03 INFO SparkSQLBackendService: Service[SparkSQLBackendService] is stopped.
23/02/08 03:45:03 INFO SparkSQLEngine: Service[SparkSQLEngine] is stopped.
23/02/08 03:45:03 INFO SparkTBinaryFrontendService: Finished closing SessionHandle [a1176c3c-e2a4-497c-b0fe-af6907017284]
23/02/08 03:45:03 INFO SparkUI: Stopped Spark web UI at http://spark-8bca9c862f1fda07-driver-svc.spark.svc:4040
23/02/08 03:45:03 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
23/02/08 03:45:03 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
23/02/08 03:45:03 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
23/02/08 03:45:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/02/08 03:45:04 INFO MemoryStore: MemoryStore cleared
23/02/08 03:45:04 INFO BlockManager: BlockManager stopped
23/02/08 03:45:04 INFO BlockManagerMaster: BlockManagerMaster stopped
23/02/08 03:45:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/02/08 03:45:04 INFO SparkContext: Successfully stopped SparkContext
23/02/08 03:45:53 INFO BaseAuditHandler: Audit Status Log: name=sparkSql.async.batch, finalDestination=sparkSql.async.batch.solr, interval=01:00.008 minutes, events=4, succcessCount=2, totalEvents=4, totalSuccessCount=2

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

The ranger audit configuration

    <property>
      <name>xasecure.audit.is.enabled</name>
      <value>true</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr</name>
      <value>true</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr.batch.filespool.dir</name>
      <value>/var/log/spark/audit/solr/spool</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr.zookeepers</name>
      <value>192.168.1.80:2181/infra-solr</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr.urls</name>
      <value>NONE</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr.user</name>
      <value>ranger_solr</value>
    </property>
    <property>
      <name>xasecure.audit.destination.solr.password</name>
      <value>[REDECTED]</value>
    </property>

The stucked Driver JVM jstack output

"zkConnectionManagerCallback-5-thread-1" #134 prio=5 os_prio=0 tid=0x00007f2c90008800 nid=0x108b waiting on condition [0x00007f2ca98e8000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fae5b628> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - None

The zkConnectionManagerCallback-* is actually created by Solr Zk client

Are you willing to submit PR?

pan3793 commented 1 year ago

cc @bowenliang123 @zhouyifan279

bowenliang123 commented 1 year ago

Let's register a shutdown hook with org.apache.spark.util.ShutdownHookManager to properly cleanup the plugin by calling RangerBasePlugin.cleanup().

Among plugins provided in Ranger, plugins for kafka、hbase、solr does manually clean up but the one for hive doesn't.