apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.1k stars 914 forks source link

support for http frontend service #32

Closed yaooqinn closed 2 years ago

yaooqinn commented 6 years ago

Expected behavior

Actual behavior.

binary only now

Steps to reproduce the problem.

Specifications like the version of the project, operating system, or hardware.

yanghua commented 3 years ago

As discussed under #835 , will create a proposal for this feature

yaooqinn commented 3 years ago

thanks @yanghua

yanghua commented 3 years ago

Hi @yaooqinn I opened a proposal for this feature via Google Doc(for more convenient discussion). The link is here: https://docs.google.com/document/d/1yc-fk28oWbCNZjdFr1UXFbMtwSDQTpcwcNObn9ODnhQ/edit?usp=sharing

It's not completed now, please focus on the "API design" section. Any questions please let me know.

yaooqinn commented 3 years ago

thanks, @yanghua. the design doc LGTM. BTW, shall we also consider merging these APIs from HiveServer2. FYI https://issues.apache.org/jira/browse/HIVE-4752

yanghua commented 3 years ago

thanks, @yanghua. the design doc LGTM. BTW, shall we also consider merging these APIs from HiveServer2. FYI https://issues.apache.org/jira/browse/HIVE-4752

sounds good, will add them later.

yaooqinn commented 3 years ago

I am not quite sure whether they could be merged or not. If you find anything blocks, please let us know, we can do those separately

yanghua commented 3 years ago

I am not quite sure whether they could be merged or not. If you find anything blocks, please let us know, we can do those separately

I am currently polishing all the missed interfaces that come from org.apache.hive.service.cli.ICLIService into my documentation. Can we make these APIs land in the first phase?

yaooqinn commented 3 years ago

Yes. looks great

yanghua commented 3 years ago

Hi @yaooqinn I have updated the design doc and added a mapping sub-item for each API so that we can see the relationship between these APIs and the methods of ICLIService interface.

yaooqinn commented 3 years ago

thank you very much, the API design looks good to me, just left some minor revisions.

yanghua commented 3 years ago

thanks, have addressed your concern and replied to your suggestion.

yanghua commented 3 years ago

@yaooqinn I will continue to polish the design documentation about the "Implementation" section.

BTW, do you mind the implementation language? It's about technology selection.

In my eye, Javalin seems to be an option to play a role for a Restful API server.

What's your opinion?

yaooqinn commented 3 years ago

Thanks. Javain looks interesting,I am ok personally

pan3793 commented 3 years ago

@yanghua In my previous experiences, it's a little painful to mix use Kotlin and Scala, currently Kyuubi codebase is mostly based on Scala. Personally, I recommend using Java or Scala and avoid introducing new languages if not necessary.

yanghua commented 3 years ago

@yanghua In my previous experiences, it's a little painful to mix use Kotlin and Scala, currently Kyuubi codebase is mostly based on Scala. Personally, I recommend using Java or Scala and avoid introducing new languages if not necessary.

Hi @pan3793 Thanks for sharing your thoughts.

it's a little painful to mix use Kotlin and Scala. I recommend using Java or Scala and avoid introducing new languages if not necessary.

You mean using Javalin would introduce kotlin language in the Kyuubi project right?

Let me clarify this and try to address your concern.

Yes, Javalin project uses Kotlin language. While from a development view, we do not need to know and interact with Kotlin.

Assuming if we introduce Javalin, we only need to introduce a maven dependency for it. And then use Java to build an app framework like this in Hudi codebase. It's all about Java code at the development level. You can find that Hudi does not contain one code line that writes via Kotlin.

My thought is we may introduce Java language to write code and build around Javalin. Javalin is very light and is used to provide the timeline service in Hudi. That's why I add it to my choice list.

Not sure whether the above clarification is sufficient to answer your concerns.

Correct me if my understanding is wrong about your comment.

pan3793 commented 3 years ago

Personally, I love Kotlin and use Kotlin with Spring Boot in production heavily.

While from a development view, we do not need to know and interact with Kotlin.

This is an ideal vision, but what if developers without Kotlin background want to learn how Javalin works, how to extend the framework, and debug inside the framework?

It's all about Java code at the development level.

Hudi uses Scala limited in some modules, but Kyuubi uses Scala anywhere, maybe we can not avoid using Scala invoke Kotlin function, if the developers have no concepts on how Kotlin code compiles to bytecode, they may no idea how to invoke the Kotlin function.

Javalin is very light and is used to provide the timeline service in Hudi.

Have a quick look at the Javalin doc, the APIs design looks clean and light, but from the perspective of dependence, it's not so light. No offence, but I don't think Hudi did a great job on dependency management, I saw lots of Hudi users complain about jar conflicts in DingTalk groups.

That's why I add it to my choice list.

Do we have other options? A pure Java framework makes more sense to me.

yanghua commented 3 years ago

Although I think the purpose of the layered software stack and various encapsulation of the framework is to decouple and build abstraction, so that users do not have to perceive too much detail. But I respect your familiarity with Scala and Kotlin and rich experience in using them.

Do we have other options? A pure Java framework makes more sense to me.

Yes, the other option may be more pure and low-level, we can build it based on native netty.

This choice also coexists with pros and cons. The advantage is that it is more versatile and does not introduce more or niche dependencies. The disadvantage is that we need more coding and to understand more relatively low-level details.

This choice was successfully applied in the "flink-runtime-web" module.

WDYT?

pan3793 commented 3 years ago

Netty is a nice network library, it would be the first candidate for low-level or custom-defined protocol, e.g. MySQL or PostgreSQL protocol. But as you said, it's too low level for HTTP or Web Socket.

Quote from "flink-runtime-web"

The server side of the dashboard is implemented using Netty with Netty Router for REST paths.

From the project commit history, the Netty Router is not maintained since 2017, it's risky for Kyuubi to adopt it.

Have you considered Jetty or Spring Boot?

Jetty is light and Spring Boot is heavy. To my knowledge, those two are the most widely used HTTP server frameworks in Java ecosystem ( correct me if I'm wrong ).

For this case, the light one Jetty would be better.

Jetty is adopted by Apache Spark, Apache Hive, Apache Druid, etc. to build HTTP services. Trino/Presto builds on top of airlift which depends on Jetty. Seems Javalin also depends on Jetty?

Besides, there is a naive usage of Jetty in Kyuubi PrometheusReporterService, maybe we can build a unified HTTP server which also covers this case.

pan3793 commented 3 years ago

This is one PR #437 (not merged) uses Jetty to build a monitor service.

yanghua commented 3 years ago

sounds good.

For this case, the light one Jetty would be better.

between Jetty and Spring Boot, +1 for Jetty

Jetty is adopted by Apache Spark, Apache Hive, Apache Druid, etc. to build HTTP services. Trino/Presto builds on top of airlift which depends on Jetty. Seems Javalin also depends on Jetty?

It seems Jetty is a really wide-used framework.

Besides, there is a naive usage of Jetty in Kyuubi PrometheusReporterService, maybe we can build a unified HTTP server which also covers this case.

+1 to restrain the technology stack

pan3793 commented 3 years ago

Also cc @turboFei

turboFei commented 3 years ago

+1 for Jetty

yanghua commented 3 years ago

sounds good, thanks to everyone for sharing opinions. Will polish the "Implementation" section in the design documentation.

iodone commented 3 years ago

Akka-HTTP, Play, HTTP4s, Cask and Zio-HTTP are available in scala development stack, pure Scala and functional style.

yanghua commented 3 years ago

Akka-HTTP, Play, HTTP4s, Cask and Zio-HTTP are available in scala development stack, pure Scala and functional style.

Thanks for sharing your thoughts. IMO, here, we may deliberately lean towards the Java ecology, which will help Kyuubi's developer ecology. Of course, the core premise is that the selected Jetty is sufficient to meet our needs.

iodone commented 3 years ago

@yanghua Very good proposal with a few questions.

Who is the Rest API service target? If it's for normal users, this API is actually an HTTP implementation of ThriftCLIInterface, and it would be very complicated for users to call Submit a query steps:

  1. createSession
  2. create various operations based on session
  3. get the current state based on operationHandler
  4. until success, and then get the results
  5. first get the metadata, and then get the results again until all get completed
  6. release resources (close operation and session)

Submitting a query to get the results requires more than 6 HTTP API calls, and does not include the case of call exceptions. If the user forgets to call the release resource API, what should be done?

Later will be based on Rest API to provide Client SDK, like the Hive JDBC SDK? But see there is already a new Kyuubi Hive JDBC under development, can completely cover the Rest API capabilities

HTTP stateless protocol completely to imitate the RPC calls, the user side will be more complex to use. Is it possible to simplify the HTTP API from the user's point of view instead of simply copying the Thrift RPC call?

Referring to Presto/Trino's Rest API, there are only 3 core APIs.

  1. postStatement (POST /v1/statement)
  2. getStatus (GET /v1/statement/queued/{queryId}/{slug}/{token})
  3. getQueryResults (GET /v1/statement/executing/{queryId}/{slug}/{token})

The API for session and operation management is very well designed, and the target users are administrators and operators. But for ordinary users to submit SQL, I feel it is slightly complicated

pan3793 commented 3 years ago

Thanks @iodone participate in and share thoughts.

Submitting a query to get the results requires more than 6 HTTP API calls ... ... ... for ordinary users to submit SQL, I feel it is slightly complicated

I agree with you and have a basic idea to simplify it.

We can make session info optional in operations APIs, then create a new session if session info is absent, and return the session id to the client to make sure the user can reuse the session in following request.

yanghua commented 3 years ago

Hi @iodone Sorry for the late reply. Thanks for chime in and share your thoughts.

Answer your question:

who are these RESTful APIs target users?

I did not make assumptions about it. The original intention was to expose Kyuubi's capabilities through the HTTP protocol. Its target users can be administrators or some regular users who submit queries. It can be used directly or indirectly (for example, after building a Kyuubi UI based on it, use it step by step).

Thank you for your feedback, yes. This API format is very inconvenient for users who use it directly.

Because it relies on the design context of these APIs based on resources. From a resource point of view, it should be to create the Session first, and then create the Operation, which seems to have no sense of contradiction. But it brings complexity and higher cost of understanding to direct users, and even lower performance.

If we don't make assumptions about the target users of these APIs (I personally suggest). Otherwise, this matter will become more complicated in the future.

Then we look forward to bringing in your suggestions. Like:

Referring to Presto/Trino's Rest API, there are only 3 core APIs.

postStatement (POST /v1/statement) getStatus (GET /v1/statement/queued/{queryId}/{slug}/{token}) getQueryResults (GET /v1/statement/executing/{queryId}/{slug}/{token})

We can introduce a resource with a "statement" as the first-level resource. So that it can provide convenience for some users. It may be a little bit against the context of resources in HS2, but we should not always stick to the rules.

WDYT?

iodone commented 3 years ago

a resource with a "statement" as the first-level resource

@yanghua I strongly agree with the design of a resource with a "statement" as the first-level resource, and with @pan3793's point. The discussion can be divided into two parts:

For user submitting SQL:

First of all, a statement-based resource does simplify the SQL submission interaction, as Presto already does

But based on statement resource will conflict with the overall design of kyuubi's backend service module and operation module, equivalent to HTTP API and Thrift API are two parallel solutions (this is a personal opinion)

In addition, we need to consider the HTTP API and Thrift API-based authentication compatibility issues, as well as multiple kyuubi instance statement resource state (based on memory) management issues

For administrators and operators who need to detect the state of Kyuuby thrift services:

The session and operation HTTP APIs are provided to help them view and intervene in current services, including:

  1. session management
  2. Operation management

Some of the interfaces on the @yanghua documentation can be provided as APIs related to thrift context resource management. Of course, can this part be considered by kyuubi-ctl to replace the HTTP API implementation?

yanghua commented 3 years ago

But based on statement resource will conflict with the overall design of kyuubi's backend service module and operation module, equivalent to HTTP API and Thrift API are two parallel solutions (this is a personal opinion)

If we do not assume the users of these APIs, nor set limits on the scope of these APIs, can we just expose the capabilities of Kyuubi services as much as possible? Currently, our main focus is on matching capabilities equivalent to HS2 thrift rpc services. But Kyuubi's future ability should be a superset of this ability. For example, @turboFei wants to display Session/Operation logs.

I understand that your point is to classify APIs according to potential user roles. On the other hand, it is also reasonable. But in what dimension do we classify?

Without classification, we will only need to follow RESTful principles, and our core focus is on resources. Regarding "authentication", we can take the non-HS2 interface API as a special case.

In short, I personally have an open mind about whether to classify APIs. @pan3793 Can you chime in?

Of course, can this part be considered by kyuubi-ctl to replace the HTTP API implementation?

IMO, it could be happened in the future. Even if kyuubi-ctl supports these capabilities, we still expect to release these capabilities through the http api so that we can build a visual UI for kyuubi. Some basic management APIs are still needed.

iodone commented 3 years ago

@yanghua Thank you very much for your patient explanation, I was able to get your point. Let's discuss the next implementation based on

a resource with a "statement" as the first-level resource

and discuss the subsequent implementation.

Here is just the main flow of the user submitting SQL, I can think of two options.

Solution 1

As @pan3793 mentioned above:

We can make session info optional in operations APIs, then create a new session if session info is absent, and return the session id to the client to make sure the user can reuse the session in following request.

The underlying implementation is still by way of thrift cli interface, but in the HTTP API only exposing the concept of statement, hiding the session. When submitting a query, the server creates the session first, and then calls executeStatement on the session:

image

Benefits

  1. Simple implementation, as long as the implementation of a HttpFrontendService, the user's request based on Statement, eventually converted into an API call to BackendThriftService (to do some packaging, transparent to the user)
  2. Reuse Kyuubi most of the code implementation, keeping the overall framework does not change.

Disadvantages

  1. Hidden Session to the user, Session resource release timing is uncertain, may lead to session resource leakage. By per query per session mechanism and the mechanism of timing checks to avoid?
  2. You can see that the HTTP API is a short connection, while Kyuubi and Kyuubi engine directly establish a long link, which means that a Session state is maintained in memory to maintain the connection with Kyuubi Engine. We know that at least two HTTP API requests are required to execute a complete HTTP API SQL query.When Kyuubi is extended to multiple instances, since the HTTP API can be distributed to any instance, it is possible that the first two HTTP API requests will be distributed to different instances of Kyuubi, so that the second HTTP API request will not find the correct Session because the Session is stored on the other instance. Possible need to introduce a Session sharing mechanism (introducing a three-party persistent storage component)?

Solution 2

Based on @yanghua's point of view:

we will only need to follow RESTful principles, and our core focus is on resources.

Another set of fully RESTful-based architectural solutions, entirely different with the Thrift CLI interface:

image

Benefits

  1. Push the state down to the Kyuubi Engine to maintain. Kyuubi only does the forwarding of user requests, no need to maintain Sessions in memory
  2. Facilitate the ability to extend, based on the Kyuubi Engine HTTP API we can also extend some non-standard SQL or JDBC capabilities, like MLSQL: https://www.mlsql.tech/

Disadvantages

  1. Need to fully implement a set of HTTP API based on RESTful principles, compared to Option 1 workload is larger

Currently Solution-1 in our production environment has landed, and is currently on trial. But I personally prefer Solution-2.

@yanghua @pan3793 WDYT?

pan3793 commented 3 years ago

@iodone, thanks for sharing your thoughts. And I will give my views on the disadvantages you mentioned of solution 1.

  1. Hidden Session to the user, Session resource release timing is uncertain, may lead to session resource leakage. By per query per session mechanism and the mechanism of timing checks to avoid?

I'm not proposing to drop the session_id concept in REST API, make session_id optional means the user can still provide a session_id.

The session is not hidden for users, we can still keep the REST API of sessions, and return the implicitly created session_id to the client to make sure the user can reuse/release the session in the following request.

  1. You can see that the HTTP API is a short connection, while Kyuubi and Kyuubi engine directly establish a long link, which means that a Session state is maintained in memory to maintain the connection with Kyuubi Engine. We know that at least two HTTP API requests are required to execute a complete HTTP API SQL query. When Kyuubi is extended to multiple instances, since the HTTP API can be distributed to any instance, it is possible that the first two HTTP API requests will be distributed to different instances of Kyuubi, so that the second HTTP API request will not find the correct Session because the Session is stored on the other instance. Possible need to introduce a Session sharing mechanism (introducing a three-party persistent storage component)?

The key point here I think is how to share states between Kyuubi Servers.

We already have a discussion on it, the basic idea is

  1. store the state in shared storage(e.g. Zookeeper, MySQL, Redis) rather than in memory.
  2. move some states from Kyuubi Server to the engine side, suppose Kyuubi Server only keep and share the session_id to engine instance mapping, any Kyuubi Server can establish the thrift connection to the target engine when HTTP request arrived.
yanghua commented 3 years ago

Hi @iodone Thank you for sharing and analyzing the pros and cons of different solutions, and for your in-depth thinking.

@pan3793 has already given answers to some points(We can continue to discuss this). I give some personal views on these two designs from a macro perspective.

On the whole, I personally prefer the solution 1 you proposed. The existing design document is also based on this architecture.

Because it reuses and complies with Kyuubi's overall design and abstraction. Regarding the REST service, I think Kyuubi Server would not just act as a proxy forwarding layer similar to Solution 2 before Spark Engine. It should shield end users from direct dependence on the engine layer(relying on the interfaces exposed by the unique features of a single-engine is also regarded as a direct dependency in a sense.), and at the same time prepare to support multiple engines for the bottom layer in terms of architecture design.

And if we implement it based on Solution 2, although it has a proxy layer hosted in the Kyuubi server, it seems that it is strongly related to the engine in terms of interface, if we want to provide sufficient flexibility (correct me, if I am wrong). So, if we support another at the Engine layer, this approach seems to be a chimney rather than decoupling layer by layer.

Of course, abstraction means more basic and more universal capabilities. This is obviously contrary to the rich and diverse flexibility of a single engine. It seems that this is something that must be given up?