apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.11k stars 917 forks source link

[FEATURE] 让 Kyuubi Engine 跑在阿里MaxCompute或AWS Glue上 #3409

Open kevinclcn opened 2 years ago

kevinclcn commented 2 years ago

Code of Conduct

Search before asking

Describe the feature

目前Kyuubi Engine可以运行在Yarn或K8s上以执行通过JDBC提交的任务,但在云原生环境里,通常云提供商都提供弹性的云计算资源,比如阿里云的MaxCompute和AWS Glue。如果Kyuubi Engine支持运行在MaxCompute和Glue上,可以大大降低Spark的运行成本和维护成本。

阿里云的通过MaxCompute运行spark任务的API: https://help.aliyun.com/document_detail/102357.html

AWS的通过Glue运行spark任务的API: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob

Motivation

目前Kyuubi Engine只能运行在Yarn或K8S上,这样在云原生的环境里要么需要申请EMR资源,要么需要申请K8S计算节点,这里存在两个问题:

  1. EMR和K8S的资源不是弹性的,当任务少时,不能缩容以减少硬件成本,当任务多时,不能扩容,以提高计算速度。
  2. 在云环境中,如果使用MaxCompute这样的弹性计算资源,JDBC只能使用Trino这样的交互式查询引擎,造成离线任务和交互式查询的SQL标准不完全一致。

Describe the solution

通过将Kyuubi Engine运行在MaxCompute和Glue这种弹性Spark计算资源上,可以让离线批量任务和交互式查询共用相同的spark sql能力,也可以让计算资源有弹性,节省基础设施成本和运维成本。

Additional context

No response

Are you willing to submit PR?

github-actions[bot] commented 2 years ago

Hello @kevinclcn, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi (Incubating).

pan3793 commented 2 years ago

Have a quick look at the doc, I think Kyuubi should work out-of-box w/ MaxCompute, but not Glue. Since Kyuubi uses spark-submit to create spark engine app, technically, you can deploy Kyuubi in any environment as long as there is a runnable spark-submit(requires Spark 3.x) under $SPARK_HOME/bin

pan3793 commented 2 years ago

@kevinclcn would you like to try deploying Kyuubi on MaxCompute? and the docs are welcome.

kevinclcn commented 2 years ago

Sure.

badbye commented 1 year ago

Have a quick look at the doc, I think Kyuubi should work out-of-box w/ MaxCompute, but not Glue. Since Kyuubi uses spark-submit to create spark engine app, technically, you can deploy Kyuubi in any environment as long as there is a runnable spark-submit(requires Spark 3.x) under $SPARK_HOME/bin

I'm trying to run Kyuubi with Adb spark (it is similar to MaxCompute Spark), I got this error in Adb Spark:

at org.apache.kyuubi.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) 23/03/27 20:46:27 ERROR ConnectionState: Connection timed out for connection string (beijing-datascience-dev-01:2181)

I'm using a standalone Kyuubi which has an EmbeddedZookeeper service, so the question is how to set the connection string of zookeeper to be the ip:port format instead of hostname:port? since the remote spark server does not know my hostname.

I've tried set kyuubi.zookeeper.embedded.client.port.address to be the public IP, it does not work.

pan3793 commented 1 year ago

the embedded zk is not recommended for production, it's designed to use for local testing, please deploy a dedicated zk first

badbye commented 1 year ago

After fixing the connection between the zookeeper and Adb Spark, I got a connect timeout error on the client side:

2023-03-28 11:47:11.728 INFO org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient: Get service instance:21.25.1.59:45625 and version:Some(1.6.1-incubating) under /kyuubi_1.6.1-incubating_USER_SPARK_SQL/test/default
2023-03-28 11:47:11.768 ERROR org.apache.kyuubi.session.KyuubiSessionImpl: Opening engine [kyuubi_USER_SPARK_SQL_test_default_32adf216-e872-48a9-a87e-6789ef2d4a4c 21.25.1.59:45625] for test session failed
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: connect timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.9.3.jar:0.9.3]
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:266) ~[libthrift-0.9.3.jar:0.9.3]
    at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3]
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createTProtocol(KyuubiSyncThriftClient.scala:455) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createClient(KyuubiSyncThriftClient.scala:471) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1(KyuubiSessionImpl.scala:128) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1$adapted(KyuubiSessionImpl.scala:113) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.ha.client.DiscoveryClientProvider$.withDiscoveryClient(DiscoveryClientProvider.scala:36) ~[kyuubi-ha_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.session.KyuubiSessionImpl.openEngineSession(KyuubiSessionImpl.scala:113) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at org.apache.kyuubi.operation.LaunchEngine.$anonfun$runInternal$2(LaunchEngine.scala:49) ~[kyuubi-server_2.12-1.6.1-incubating.jar:1.6.1-incubating]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_271]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_271]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_271]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_271]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_271]
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_271]
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476) ~[?:1.8.0_271]
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218) ~[?:1.8.0_271]
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200) ~[?:1.8.0_271]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394) ~[?:1.8.0_271]
    at java.net.Socket.connect(Socket.java:606) ~[?:1.8.0_271]
    at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.9.3.jar:0.9.3]
    ... 14 more
2023-03-28 11:47:11.774 INFO org.apache.curator.framework.imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
2023-03-28 11:47:11.777 INFO org.apache.zookeeper.ZooKeeper: Session: 0x10926b572df0001 closed
2023-03-28 11:47:11.777 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0x10926b572df0001
2023-03-28 11:47:11.789 INFO org.apache.kyuubi.operation.LaunchEngine: Processing test's query[19ab56d1-a2eb-429e-a858-6d96b0ffdbbb]: RUNNING_STATE -> ERROR_STATE, time taken: 60.261 seconds
Error: org.apache.kyuubi.KyuubiSQLException: Error operating LaunchEngine: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: connect timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:226)
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:266)
    at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createTProtocol(KyuubiSyncThriftClient.scala:455)
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createClient(KyuubiSyncThriftClient.scala:471)
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1(KyuubiSessionImpl.scala:128)
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1$adapted(KyuubiSessionImpl.scala:113)
    at org.apache.kyuubi.ha.client.DiscoveryClientProvider$.withDiscoveryClient(DiscoveryClientProvider.scala:36)
    at org.apache.kyuubi.session.KyuubiSessionImpl.openEngineSession(KyuubiSessionImpl.scala:113)
    at org.apache.kyuubi.operation.LaunchEngine.$anonfun$runInternal$2(LaunchEngine.scala:49)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
    at java.net.Socket.connect(Socket.java:606)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:221)
    ... 14 more

    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
    at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:75)
    at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:56)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.kyuubi.operation.LaunchEngine.$anonfun$runInternal$2(LaunchEngine.scala:51)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: connect timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:226)
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:266)
    at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createTProtocol(KyuubiSyncThriftClient.scala:455)
    at org.apache.kyuubi.client.KyuubiSyncThriftClient$.createClient(KyuubiSyncThriftClient.scala:471)
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1(KyuubiSessionImpl.scala:128)
    at org.apache.kyuubi.session.KyuubiSessionImpl.$anonfun$openEngineSession$1$adapted(KyuubiSessionImpl.scala:113)
    at org.apache.kyuubi.ha.client.DiscoveryClientProvider$.withDiscoveryClient(DiscoveryClientProvider.scala:36)
    at org.apache.kyuubi.session.KyuubiSessionImpl.openEngineSession(KyuubiSessionImpl.scala:113)
    at org.apache.kyuubi.operation.LaunchEngine.$anonfun$runInternal$2(LaunchEngine.scala:49)
    ... 5 more
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
    at java.net.Socket.connect(Socket.java:606)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:221)
    ... 14 more (state=,code=0)
Beeline version 1.6.1-incubating by Apache Kyuubi (Incubating)

any ideas to fix it? @pan3793

part of my kyuubi conf:

kyuubi.session.engine.login.timeout = 30
kyuubi.session.engine.alive.probe.interval = 30
kyuubi.session.engine.alive.timeout = 120
kyuubi.session.engine.alive.probe.enabled = true
pan3793 commented 1 year ago
Get service instance:21.25.1.59:45625 and version:Some(1.6.1-incubating) under /kyuubi_1.6.1-incubating_USER_SPARK_SQL/test/default

Does ADB Spark allow Kyuubi Server to access the Driver through IP directly?

pan3793 commented 1 year ago

And kyuubi.session.engine.login.timeout = 30 means 30ms, I suppose you expect 30s not 30ms, the suggested format is PT30S

pan3793 commented 1 year ago

Kyuubi uses ISO-8601 standard duration format, please read comments of java.time.Duration to get more details.

badbye commented 1 year ago
Get service instance:21.25.1.59:45625 and version:Some(1.6.1-incubating) under /kyuubi_1.6.1-incubating_USER_SPARK_SQL/test/default

Does ADB Spark allow Kyuubi Server to access the Driver through IP directly?

No, the Kyuubi server can not access this IP, I'll try to fix it. I see, so I guess the whole workflow is:

  1. Kyuubi server requests a spark session from the zookeeper, if no, start a Spark session vis spark-submit
  2. after the spark session started, it register itself in the zookeeper
  3. Kyuubi server finds spark session from zookeeper, tries to connect to the session directly
badbye commented 1 year ago

And kyuubi.session.engine.login.timeout = 30 means 30ms, I suppose you expect 30s not 30ms, the suggested format is PT30S

sorry, my bad. I've read the doc, just forget the unit.

pan3793 commented 1 year ago

Yes, that's exactly how Kyuubi works, you got it.

badbye commented 1 year ago
Get service instance:21.25.1.59:45625 and version:Some(1.6.1-incubating) under /kyuubi_1.6.1-incubating_USER_SPARK_SQL/test/default

Does ADB Spark allow Kyuubi Server to access the Driver through IP directly?

Turns out the Adb Spark cluster has two NICs(Network Interface Cards), and the default NIC is used when the service starts. Is there a way to get it to boot and register to the second NIC?

badbye commented 1 year ago

Seems it is using this findLocalInetAddress function to find the default IP.

Currently, there is no easy way to use the second NIC, am I right? @pan3793

pan3793 commented 1 year ago

Yes, we need to enhance this part to make it more flexible, e.g. introduce an address-binding election strategy, it also helps for K8s environment.

badbye commented 1 year ago

Yes, we need to enhance this part to make it more flexible, e.g. introduce an address-binding election strategy, it also helps for K8s environment.

Cool. I guess this is the last problem to make it work. I may not have the ability to contribute the code, but I'd like to write a doc. Let me know if there is any progress on this feature.

badbye commented 1 year ago

Finally solved, I wrote a doc: https://gist.github.com/badbye/2618d6ef47a042427836d4ba9518e203