apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.05k stars 900 forks source link

[TASK][MEDIUM] Support Amazon EMR Serverless on AWS #4458

Open davidshtian opened 1 year ago

davidshtian commented 1 year ago

Code of Conduct

Search before asking

Describe the feature

Support Amazon EMR Serverless as Kyuubi Spark Engine to minimize the operation cost and implement real serverless Spark SQL goal. Amazon EMR Serverless is not supported yet as Amazon EMR Serverless has no JDBC connection.

Motivation

Implement real serverless Spark SQL target on AWS cloud.

Describe the solution

Amazon EMR Serverless makes it easy for users to run Spark without configuring, managing, and scaling clusters or servers.

Additional context

No response

Are you willing to submit PR?

github-actions[bot] commented 1 year ago

Hello @davidshtian, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

yaooqinn commented 1 year ago

Does deploying Kyuubi on Amazon EMR satisfy this? AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi

davidshtian commented 1 year ago

Does deploying Kyuubi on Amazon EMR satisfy this? AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi

Thanks for your response~

Kyuubi can be deployed on AWS EMR cluster mode (EMR on EC2), but it still need to manage and operate the cluster, while EMR Serverless is fully serverless and managed bringing more flexibility and it could handle different scenarios. EMR Serverless has no JDBC connection and it uses AWS API to submit the job, it would be better to have the support to adapt EMR Serverless to Kyuubi. Thanks~

yaooqinn commented 1 year ago

EMR Serverless has no JDBC connection and it uses AWS API to submit the job

Sorry for being late. Never used AWS. Correct me if I am wrong, do you mean that an AWS EMR cluster does not support spark-submit?

PauloMigAlmeida commented 6 months ago

Hi @yaooqinn,

do you mean that an AWS EMR cluster does not support spark-submit?

EMR (on EC2) does support submitting via spark-submit utility. However, EMR-Serverless does not.

For EMR-Serverless, one can submit jobs by either using of the AWS' SDK or via AWSCli.

It looks something like this:

aws emr-serverless start-job-run \
    --application-id <EMR_Severless_App_Id> \
    --execution-role-arn arn:aws:iam::012345678901:role/my-cool-emr-exec-role \
    --job-driver 'sparkSubmit={entryPoint=s3://my-bucket/script_to_be_execeuted.py}' \
    --configuration-overrides '{"monitoringConfiguration": {"managedPersistenceMonitoringConfiguration": {"enabled": true}, "cloudWatchLoggingConfiguration": {"enabled": true, "logGroupName": "/aws/emr-serverless/my-logs"}}}'
yaooqinn commented 6 months ago

It looks like we need a specific version of the org.apache.kyuubi.engine.ProcBuilder for was. also cc @pan3793

pan3793 commented 6 months ago

To support AWS EMR Serverless Spark, we need to implement the following interface in Kyuubi

org.apache.kyuubi.engine.ApplicationOperation (for querying and canceling job)
org.apache.kyuubi.engine.ProcBuilder (for submitting job)

According to the EMR Docs, I think we can use the CLI aws emr-serverless to implement that.

One concern is the integration tests, I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless? I'm afraid that functionality without CI verification is fragile.

pan3793 commented 6 months ago

BTW, does GCP and Azure have similar services?

PauloMigAlmeida commented 6 months ago

@pan3793, @yaooqinn

I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless?

You are right, localstack doesn't support the endpoints for dealing with EMR serverless yet

I'm afraid that functionality without CI verification is fragile.

I agree. Does the apache foundation have AWS accounts that can be used for CI/CD purposes? If so, that would be the fastest way to address this as I think we may not be the first one needing that type of integration tests.

If it doesn't, you may want to try the AWS promotional credits for Open Source projects. More info at https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

Regardless of the option you go with, count on me for helping with the AWS implementation details (IAM permissions, service configuration, sdk, cli options and so on) - that can speed up the time machine for the development of this feature.

yaooqinn commented 6 months ago

It looks like ASF Infra doesn't have AWS resources for CI/CD. https://infra.apache.org/build-supported-services.html

PauloMigAlmeida commented 6 months ago

Another option would be to mock responses expected from the AWS services involved.

I've seen that done before in other project in which localstack didn't support the service required. Would it address the CI tests concern?

yaooqinn commented 6 months ago

Another option would be to mock responses expected from the AWS services involved.

This is a necessary step for local dev. AWS promotional credits might be necessary for setting the integration tests.

pan3793 commented 6 months ago

cc @zhaohehuhu, you may be interested in this feature

zhaohehuhu commented 6 months ago

Yup. I'm going to implement this feature.

PauloMigAlmeida commented 5 months ago

@zhaohehuhu just checking in to see if you need any help with the AWS side of things

zhaohehuhu commented 5 months ago

@zhaohehuhu just checking in to see if you need any help with the AWS side of things

Thanks. It's going well so far. I already finished the draft code and decided to do a round of test. @PauloMigAlmeida

PauloMigAlmeida commented 5 months ago

@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3

Instructions on how to deploy and run it are in the gist too.

zhaohehuhu commented 5 months ago

@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3

Instructions on how to deploy and run it are in the gist too.

Thanks. may contact you if needed.

zhaohehuhu commented 4 months ago

@pan3793 plz assign it to me.

zhaohehuhu commented 4 months ago

@PauloMigAlmeida I deployed a kyuubi server on EC2. When kyuubi server talks to Spark engine in EMR sververless, it always says connection timeout. Kyuubi server and EMR sververless already are in the same VPC, how should it be? Do you have any idea about it ?

PauloMigAlmeida commented 4 months ago

@zhaohehuhu EMRServerless doesn't run in a customer-managed VPC. It's accessible via an API that should be invoked using one of the AWS SDKs instead.

API method: https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html AWS SDK for Java: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/emrserverless/EmrServerlessClient.html#startJobRun(software.amazon.awssdk.services.emrserverless.model.StartJobRunRequest)

PS: It requires that an EMRServerless application is created.

Does this help?

zhaohehuhu commented 4 months ago

Thanks @PauloMigAlmeida. The EMR serverless runs on a VPC mannered by AWS, I just wonder it is possbile for Kyuubi Service talk to Spark Engine in EMR serverless through Thrift Protocol.

PauloMigAlmeida commented 4 months ago

@zhaohehuhu I'm almost certain that it isn't supported but let me check that internally first and I will come back to you with an answer tomorrow.

On a separate note, if communicating via thrift isn't possible, is there any alternative that could be explored instead?

zhaohehuhu commented 4 months ago

Thanks. It looks like it's hard for Kyuubi service running on EC2 or others to access Amazon EMR Serverless Spark. I will discuss it with @pan3793.

PauloMigAlmeida commented 4 months ago

@zhaohehuhu I got hold of a EMR Serverless specialist internally. Thrift communication isn't possible at this moment on EMR Serverless =/

pan3793 commented 4 months ago

@PauloMigAlmeida do you know the exact restriction? TCP inbound traffic or something?

PauloMigAlmeida commented 4 months ago

@pan3793 Seems to be that we don't have thrift server running on those nodes for the serverless offering.

pan3793 commented 4 months ago

@PauloMigAlmeida Kyuubi use thrift as internal RPC protocol for Kyuubi server and Spark driver, it will auto bootstrap a thrift server on the Spark driver. So the question is, is Spark Servless allows Thrift(kind of a TCP-based protocol) traffic between outside and inside, I suppose it should work, the Jupyter Notebook case runs in similiar way(not thrift, should be another TCP-based protocol). https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interactive-workloads.html

I have a offline talk with @zhaohehuhu, according to his feedback, the driver successfully launched Thrift RPC server, and registered it to Zookeeper, so Kyuubi server got the Thrift RPC server address but can NOT establish the connection. We have reported similar issues on other public cloud vendor, it caused by dual NICs https://github.com/apache/kyuubi/issues/6296, I'm not sure what's exact issue on AWS EMR Serverless

PauloMigAlmeida commented 4 months ago

@pan3793 got you point now.

Does the thrift communication initiate from the EMR serverless to the kyuubi server? Or is it the other way around?

In case, it's the former:

Out of curiosity, what's the security groups rules for both kyuubi server and EMR serverless (with VPC)?

I'm aware that EMR serverless can establish connections within a VPC

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html

The example above was for integration with databases but the set up should be relatively the same if the flow is from EMR serverless to kyuubi server

pan3793 commented 4 months ago

@PauloMigAlmeida connect initiate from Kyuubi server to Spark driver. The detailed steps are:

  1. when a new connection comes in, Kyuubi looks up Zookeeper to find a reusable Spark application. If not found, try to perform a spark-submit to launch a new Spark app (a Kyuubi-customized Spark app, called Kyuubi Spark SQL engine).
  2. after the Spark driver starts, it launches a Thrift RPC server, and registers itself to Zookeeper, so that the Kyuubi server knows how to connect to this RPC server.
  3. Kyuubi server connects to the Spark driver and forwards queries to it.
  4. Spark application self terminates (also deregister from zookeeper) after idle(no active connections) timeout
PauloMigAlmeida commented 4 months ago

@pan3793 I was afraid you were going say that (Kyuubi server to Spark driver flow).

I triple-checked that internally, at the moment inbound connections to EMR Serverless are not possible. I can't share the nitty-gritty of why but it seems to be by design.

I just put together a Product Feature Request on that to the service team so AWS is aware that this is something customers would want though.

Implementation-wise, are there any alternatives to circumvent that limitation?

zhaohehuhu commented 4 months ago

The outbound traffic to EMR Serverless can be solved by the VPC, but the inbound traffic seems to be a problem now.

sirajulm commented 1 month ago

There could be a workaround using external Hive metastore using AWS RDS or reusing Thrift server for existing EMR. This would mean additional resources need to be created and would make it complex to test.

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#external-metastore