apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.09k stars 913 forks source link

[Bug] Batch session not be closed after batch job finished #6026

Closed jia-zhengwei closed 8 months ago

jia-zhengwei commented 8 months ago

Code of Conduct

Search before asking

Describe the bug

Sumbit spark job by kyuubi batch rest api, the batch session is not closed after the spark job finished.

And if I do not call batch DELETE api and session DELETE api, the batch session will not close untill kyuubi.batch.session.idle.timeout and if it support to config kyuubi.batch.session.idle.timeout by batch API?

img_v3_027f_9c6a9108-9b76-4fbd-902f-e0cd3645faag

Affects Version(s)

1.7.1

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

I can see a lots of log file in this folder.

some logs as follow:

24/01/22 22:43:22 INFO [dag-scheduler-event-loop] MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.7 KiB, free 912.3 MiB)
24/01/22 22:43:22 INFO [dag-scheduler-event-loop] MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.6 KiB, free 912.3 MiB)
24/01/22 22:43:22 INFO [dispatcher-BlockManagerMaster] BlockManagerInfo: Added broadcast_0_piece0 in memory on master-1-2.com:39409 (size: 3.6 KiB, free: 912.3 MiB)
24/01/22 22:43:22 INFO [dag-scheduler-event-loop] SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1520
24/01/22 22:43:22 INFO [dag-scheduler-event-loop] DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at isEmpty at KyuubiSparkUtil.scala:48) (first 15 tasks are for partitions Vector(0))
24/01/22 22:43:22 INFO [dag-scheduler-event-loop] YarnScheduler: Adding task set 0.0 with 1 tasks resource profile 0
24/01/22 22:43:37 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:43:52 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:44:07 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:44:18 INFO [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: Audit Status Log: name=sparkSql.async.batch.solr, interval=01:00.117 minutes, events=1, failedCount=1, totalEvents=1, totalFailedCount=1
24/01/22 22:44:18 WARN [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: failed to log audit event: {"repoType":3,"repo":"emr-hive","reqUser":"hive","evtTime":"2023-12-15 14:19:24.186","access":"DESCDATABASE","resource":"database\u003ddefault","resType":"database","action":"_any","result":1,"agent":"sparkSql","policy":9,"enforcer":"ranger-acl","agentHost":"master-1-2","logType":"RangerAudit","id":"14b7e240-b7e1-4110-913f-7faed9290c30-0","seq_num":1,"event_count":1,"event_dur_ms":0,"tags":[],"cluster_name":"c-5be1773fb21069f2","policy_version":1}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://master-1-1.com:6083/solr/ranger_audits: Server error writing document id 14b7e240-b7e1-4110-913f-7faed9290c30-0 to the index
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:668) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:643) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.lambda$addDocsToSolr$0(SolrAuditDestination.java:464) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.addDocsToSolr(SolrAuditDestination.java:472) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.log(SolrAuditDestination.java:253) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.provider.BaseAuditHandler.logJSON(BaseAuditHandler.java:193) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:827) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:757) ~[ranger-spark-plugin-2.3.0.jar:?]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
24/01/22 22:44:18 WARN [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: Log failure count: 1 in past 01:00.012 minutes; 2 during process lifetime
24/01/22 22:44:18 ERROR [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] AuditFileSpool: Error sending logs to consumer. provider=sparkSql.async.batch, consumer=sparkSql.async.batch.solr
24/01/22 22:44:18 INFO [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] AuditFileSpool: Destination is down. sleeping for 30000 milli seconds. indexQueue=3, queueName=sparkSql.async.batch, consumer=sparkSql.async.batch.solr
24/01/22 22:44:22 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:44:37 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:44:52 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:45:07 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:45:18 INFO [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: Audit Status Log: name=sparkSql.async.batch.solr, interval=01:00.006 minutes, events=1, failedCount=1, totalEvents=2, totalFailedCount=2
24/01/22 22:45:18 WARN [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: failed to log audit event: {"repoType":3,"repo":"emr-hive","reqUser":"hive","evtTime":"2023-12-15 14:19:24.186","access":"DESCDATABASE","resource":"database\u003ddefault","resType":"database","action":"_any","result":1,"agent":"sparkSql","policy":9,"enforcer":"ranger-acl","agentHost":"master-1-2","logType":"RangerAudit","id":"14b7e240-b7e1-4110-913f-7faed9290c30-0","seq_num":1,"event_count":1,"event_dur_ms":0,"tags":[],"cluster_name":"c-5be1773fb21069f2","policy_version":1}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://master-1-1:6083/solr/ranger_audits: Server error writing document id 14b7e240-b7e1-4110-913f-7faed9290c30-0 to the index
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:668) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:643) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.lambda$addDocsToSolr$0(SolrAuditDestination.java:464) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.addDocsToSolr(SolrAuditDestination.java:472) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.destination.SolrAuditDestination.log(SolrAuditDestination.java:253) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.provider.BaseAuditHandler.logJSON(BaseAuditHandler.java:193) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:827) ~[ranger-spark-plugin-2.3.0.jar:?]
        at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:757) ~[ranger-spark-plugin-2.3.0.jar:?]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
24/01/22 22:45:18 WARN [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] BaseAuditHandler: Log failure count: 1 in past 01:00.006 minutes; 3 during process lifetime
24/01/22 22:45:18 ERROR [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] AuditFileSpool: Error sending logs to consumer. provider=sparkSql.async.batch, consumer=sparkSql.async.batch.solr
24/01/22 22:45:18 INFO [sparkSql.async.batch_sparkSql.async.batch.solr_destWriter] AuditFileSpool: Destination is down. sleeping for 30000 milli seconds. indexQueue=3, queueName=sparkSql.async.batch, consumer=sparkSql.async.batch.solr
24/01/22 22:45:22 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/01/22 22:45:37 WARN [task-starvation-timer] YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources


### Kyuubi Server Configurations

_No response_

### Kyuubi Engine Configurations

_No response_

### Additional context

_No response_

### Are you willing to submit PR?

- [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- [X] No. I cannot submit a PR at this time.
jia-zhengwei commented 8 months ago

If I do batch delete operation for batch id 22f87ccc-9f20-4346-ba7e-025b644d4b7b, it returns as follow, but the session id 22f87ccc-9f20-4346-ba7e-025b644d4b7b is deleted, the batch id 22f87ccc-9f20-4346-ba7e-025b644d4b7b is not be deleted. this batch job is finished.

{
    "success": false,
    "msg": "batch 22f87ccc-9f20-4346-ba7e-025b644d4b7b is already terminal so can not kill it."
}
jia-zhengwei commented 8 months ago

@turboFei Hello, bro, please help to double check

turboFei commented 8 months ago

yes, it is by design.

Because if the batch session closed, you can not retrieve the batch log from server side.

So, we have to keep the batch session alive until it is closed caused by idle timeout.

jia-zhengwei commented 8 months ago

yes, it is by design.

Because if the batch session closed, you can not retrieve the batch log from server side.

So, we have to keep the batch session alive until it is closed caused by idle timeout.

OK, got it, Thanks. I will close this issue. But I suggest the idle timeout can set by batch api. not set global.

jia-zhengwei commented 8 months ago

As @turboFei said. https://github.com/apache/kyuubi/issues/6026#issuecomment-1917905709