apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.11k stars 914 forks source link

[Bug] kyuubi webui kill engine failed #6790

Closed SGITLOGIN closed 2 weeks ago

SGITLOGIN commented 2 weeks ago

Code of Conduct

Search before asking

Describe the bug

hello,I have killed the engine in kyuubi webui, and I have also seen that all sessions and operations have stopped, but in fact the engine has not stopped. Engine Log Output: EngineServiceDiscovery: 1 connection(s) are active, delay shutdown, The engine will not stop until kyuubi.session.engine.idle.timeout.

image image

Affects Version(s)

1.9.2, 1.10.0

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

24/11/01 17:54:06 INFO ZookeeperDiscoveryClient: Created a /kyuubi_1.9.2_USER_SPARK_SQL/changguowei/default/serverUri=ali-odp-test-01.huan.tv:46373;version=1.9.2;spark.driver.memory=3g;spark.executor.memory=3g;kyuubi.engine.id=application_1730447465039_0023;kyuubi.engine.url=ali-odp-test-01.huan.tv:44237;refId=6a07379c-4d18-49d7-b113-9a22c113351b;sequence=0000000137 on ZooKeeper for KyuubiServer uri: ali-odp-test-01.huan.tv:46373
24/11/01 17:54:06 INFO EngineServiceDiscovery: Registered EngineServiceDiscovery in namespace /kyuubi_1.9.2_USER_SPARK_SQL/changguowei/default.
24/11/01 17:54:06 INFO EngineServiceDiscovery: Service[EngineServiceDiscovery] is started.
24/11/01 17:54:06 INFO SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is started.
24/11/01 17:54:06 INFO SparkSQLEngine: Service[SparkSQLEngine] is started.
24/11/01 17:54:06 INFO SparkTBinaryFrontendService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V10
24/11/01 17:54:06 WARN SparkTBinaryFrontendService: No matching Hive token found for engine metastore uris thrift://ali-odp-test-01.huan.tv:9083,thrift://ali-odp-test-02.huan.tv:9083
24/11/01 17:54:06 WARN SparkTBinaryFrontendService: Ignore token with earlier issue date: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ha-nn, Ident: (token for changguowei: HDFS_DELEGATION_TOKEN owner=changguowei, renewer=changguowei, realUser=hive/ali-odp-test-01.huan.tv@HUAN.TV, issueDate=1730453394322, maxDate=1731058194322, sequenceNumber=64314, masterKeyId=371)
24/11/01 17:54:06 INFO SparkSQLSessionManager: Opening session for changguowei@........
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi/session: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi/session/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi/stop: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO ServerInfo: Adding filter to /kyuubi/gracefulstop: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
24/11/01 17:54:06 INFO SparkSQLEngine: 
    Spark application name: kyuubi_USER_SPARK_SQL_changguowei_default_6a07379c-4d18-49d7-b113-9a22c113351b
          application ID:  application_1730447465039_0023
          application tags: KYUUBI,6a07379c-4d18-49d7-b113-9a22c113351b
          application web UI: http://ali-odp-test-02.huan.tv:8088/proxy/application_1730447465039_0023,http://ali-odp-test-01.huan.tv:8088/proxy/application_1730447465039_0023
          master: yarn
          version: 3.4.2.1.2.2.0-130
          driver: [cpu: 1, mem: 3g]
          executor: [cpu: 1, mem: 3g, maxNum: 10]
    Start time: Fri Nov 01 17:53:24 CST 2024

    User: changguowei (shared mode: USER)
    State: STARTED

24/11/01 17:54:06 INFO SparkSQLSessionManager: changguowei's SparkSessionImpl with SessionHandle [6a07379c-4d18-49d7-b113-9a22c113351b] is opened, current opening sessions 1
24/11/01 17:54:06 WARN SparkTBinaryFrontendService: No matching Hive token found for engine metastore uris thrift://ali-odp-test-01.huan.tv:9083,thrift://ali-odp-test-02.huan.tv:9083
24/11/01 17:54:06 WARN SparkTBinaryFrontendService: Ignore token with earlier issue date: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ha-nn, Ident: (token for changguowei: HDFS_DELEGATION_TOKEN owner=changguowei, renewer=changguowei, realUser=hive/ali-odp-test-01.huan.tv@HUAN.TV, issueDate=1730453394322, maxDate=1731058194322, sequenceNumber=64314, masterKeyId=371)
24/11/01 17:54:07 INFO OperationLog: Creating operation log file /hwdata/kyuubi/apache-kyuubi-1.9.2-bin/work/engine_operation_logs/6a07379c-4d18-49d7-b113-9a22c113351b/2c34f952-6a5b-4ebd-ab78-cc37670ccde6
24/11/01 17:54:07 INFO SetCurrentCatalog: Processing changguowei's query[2c34f952-6a5b-4ebd-ab78-cc37670ccde6]: INITIALIZED_STATE -> RUNNING_STATE, statement:
SetCurrentCatalog
24/11/01 17:54:07 INFO SetCurrentCatalog: Processing changguowei's query[2c34f952-6a5b-4ebd-ab78-cc37670ccde6]: RUNNING_STATE -> FINISHED_STATE, time taken: 0.006 seconds
24/11/01 17:54:07 INFO SetCurrentCatalog: statementId=2c34f952-6a5b-4ebd-ab78-cc37670ccde6, operationRunTime=0 ms, operationCpuTime=0 ms
24/11/01 17:54:07 INFO DAGScheduler: Asked to cancel job group 2c34f952-6a5b-4ebd-ab78-cc37670ccde6
24/11/01 17:54:30 WARN ZookeeperDiscoveryClient: This Kyuubi instance ali-odp-test-01.huan.tv:46373 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
24/11/01 17:54:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:54:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:54:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:54:55 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 1
24/11/01 17:54:55 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1
24/11/01 17:54:55 INFO ExecutorAllocationManager: Executors 1 removed due to idle timeout.
24/11/01 17:54:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
24/11/01 17:54:58 INFO DAGScheduler: Executor lost: 1 (epoch 0)
24/11/01 17:54:58 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
24/11/01 17:54:58 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ali-odp-test-03.huan.tv, 43721, None)
24/11/01 17:54:58 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
24/11/01 17:54:58 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
24/11/01 17:54:58 INFO YarnScheduler: Executor 1 on ali-odp-test-03.huan.tv killed by driver.
24/11/01 17:54:58 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 1, unexpectedly exited: 0).
24/11/01 17:55:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:55:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 1
24/11/01 17:55:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:55:17 INFO SparkTBinaryFrontendService: Received request of closing SessionHandle [6a07379c-4d18-49d7-b113-9a22c113351b]
24/11/01 17:55:17 INFO SparkSQLSessionManager: changguowei's SparkSessionImpl with SessionHandle [6a07379c-4d18-49d7-b113-9a22c113351b] is closed, current opening sessions 0
24/11/01 17:55:17 INFO SparkSessionImpl: sessionId=6a07379c-4d18-49d7-b113-9a22c113351b, sessionRunTime=0 ms, sessionCpuTime=0 ms
24/11/01 17:55:17 INFO SparkTBinaryFrontendService: Finished closing SessionHandle [6a07379c-4d18-49d7-b113-9a22c113351b]
24/11/01 17:55:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:55:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:55:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:55:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 17:56:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:56:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 17:57:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:57:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 17:58:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:58:48 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ali-odp-test-05.huan.tv:33342 in memory (size: 3.5 KiB, free: 1458.6 MiB)
24/11/01 17:58:48 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ali-odp-test-01.huan.tv:40710 in memory (size: 3.5 KiB, free: 1458.6 MiB)
24/11/01 17:58:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 17:59:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 17:59:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 18:00:10 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:20 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:30 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:40 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:00:50 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:01:00 INFO EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
24/11/01 18:01:06 INFO SparkSQLSessionManager: Checking sessions timeout, current count: 0
24/11/01 18:01:06 INFO SparkSQLSessionManager: Idled for more than 300000 ms, terminating
24/11/01 18:01:06 INFO SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
24/11/01 18:01:06 INFO SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

wForget commented 2 weeks ago

This issue seems to be caused by the activeSessionCount not being updated which prevents the while loop from ending. Would you like to submit a pull request to fix this?

https://github.com/apache/kyuubi/blob/d3520ddbcea96ec55c525600126047c44c7adb35/kyuubi-ha/src/main/scala/org/apache/kyuubi/ha/client/ServiceDiscovery.scala#L70-L73

SGITLOGIN commented 2 weeks ago

This issue seems to be caused by the activeSessionCount not being updated which prevents the while loop from ending. Would you like to submit a pull request to fix this?

https://github.com/apache/kyuubi/blob/d3520ddbcea96ec55c525600126047c44c7adb35/kyuubi-ha/src/main/scala/org/apache/kyuubi/ha/client/ServiceDiscovery.scala#L70-L73

I'm sorry, I don't know how to submit a pull request

wForget commented 2 weeks ago

@SGITLOGIN Thank you for reporting this issue. If you are interested in contributing to Kyuubi you can read https://kyuubi.readthedocs.io/en/master/contributing/doc/get_started.html to learn how to submit a PR.

SGITLOGIN commented 2 weeks ago

ok