Closed yaooqinn closed 2 years ago
As discussed under #835 , will create a proposal for this feature
thanks @yanghua
Hi @yaooqinn I opened a proposal for this feature via Google Doc(for more convenient discussion). The link is here: https://docs.google.com/document/d/1yc-fk28oWbCNZjdFr1UXFbMtwSDQTpcwcNObn9ODnhQ/edit?usp=sharing
It's not completed now, please focus on the "API design" section. Any questions please let me know.
thanks, @yanghua. the design doc LGTM. BTW, shall we also consider merging these APIs from HiveServer2. FYI https://issues.apache.org/jira/browse/HIVE-4752
thanks, @yanghua. the design doc LGTM. BTW, shall we also consider merging these APIs from HiveServer2. FYI https://issues.apache.org/jira/browse/HIVE-4752
sounds good, will add them later.
I am not quite sure whether they could be merged or not. If you find anything blocks, please let us know, we can do those separately
I am not quite sure whether they could be merged or not. If you find anything blocks, please let us know, we can do those separately
I am currently polishing all the missed interfaces that come from org.apache.hive.service.cli.ICLIService
into my documentation. Can we make these APIs land in the first phase?
Yes. looks great
Hi @yaooqinn I have updated the design doc and added a mapping
sub-item for each API so that we can see the relationship between these APIs and the methods of ICLIService
interface.
thank you very much, the API design looks good to me, just left some minor revisions.
thanks, have addressed your concern and replied to your suggestion.
@yaooqinn I will continue to polish the design documentation about the "Implementation" section.
BTW, do you mind the implementation language? It's about technology selection.
In my eye, Javalin seems to be an option to play a role for a Restful API server.
What's your opinion?
Thanks. Javain looks interesting,I am ok personally
@yanghua In my previous experiences, it's a little painful to mix use Kotlin and Scala, currently Kyuubi codebase is mostly based on Scala. Personally, I recommend using Java or Scala and avoid introducing new languages if not necessary.
@yanghua In my previous experiences, it's a little painful to mix use Kotlin and Scala, currently Kyuubi codebase is mostly based on Scala. Personally, I recommend using Java or Scala and avoid introducing new languages if not necessary.
Hi @pan3793 Thanks for sharing your thoughts.
it's a little painful to mix use Kotlin and Scala. I recommend using Java or Scala and avoid introducing new languages if not necessary.
You mean using Javalin would introduce kotlin language in the Kyuubi project right?
Let me clarify this and try to address your concern.
Yes, Javalin project uses Kotlin language. While from a development view, we do not need to know and interact with Kotlin.
Assuming if we introduce Javalin, we only need to introduce a maven dependency for it. And then use Java to build an app framework like this in Hudi codebase. It's all about Java code at the development level. You can find that Hudi does not contain one code line that writes via Kotlin.
My thought is we may introduce Java language to write code and build around Javalin. Javalin is very light and is used to provide the timeline service in Hudi. That's why I add it to my choice list.
Not sure whether the above clarification is sufficient to answer your concerns.
Correct me if my understanding is wrong about your comment.
Personally, I love Kotlin and use Kotlin with Spring Boot in production heavily.
While from a development view, we do not need to know and interact with Kotlin.
This is an ideal vision, but what if developers without Kotlin background want to learn how Javalin works, how to extend the framework, and debug inside the framework?
It's all about Java code at the development level.
Hudi uses Scala limited in some modules, but Kyuubi uses Scala anywhere, maybe we can not avoid using Scala invoke Kotlin function, if the developers have no concepts on how Kotlin code compiles to bytecode, they may no idea how to invoke the Kotlin function.
Javalin is very light and is used to provide the timeline service in Hudi.
Have a quick look at the Javalin doc, the APIs design looks clean and light, but from the perspective of dependence, it's not so light. No offence, but I don't think Hudi did a great job on dependency management, I saw lots of Hudi users complain about jar conflicts in DingTalk groups.
That's why I add it to my choice list.
Do we have other options? A pure Java framework makes more sense to me.
Although I think the purpose of the layered software stack and various encapsulation of the framework is to decouple and build abstraction, so that users do not have to perceive too much detail. But I respect your familiarity with Scala and Kotlin and rich experience in using them.
Do we have other options? A pure Java framework makes more sense to me.
Yes, the other option may be more pure and low-level, we can build it based on native netty.
This choice also coexists with pros and cons. The advantage is that it is more versatile and does not introduce more or niche dependencies. The disadvantage is that we need more coding and to understand more relatively low-level details.
This choice was successfully applied in the "flink-runtime-web" module.
WDYT?
Netty is a nice network library, it would be the first candidate for low-level or custom-defined protocol, e.g. MySQL or PostgreSQL protocol. But as you said, it's too low level for HTTP or Web Socket.
Quote from "flink-runtime-web"
The server side of the dashboard is implemented using Netty with Netty Router for REST paths.
From the project commit history, the Netty Router is not maintained since 2017, it's risky for Kyuubi to adopt it.
Have you considered Jetty or Spring Boot?
Jetty is light and Spring Boot is heavy. To my knowledge, those two are the most widely used HTTP server frameworks in Java ecosystem ( correct me if I'm wrong ).
For this case, the light one Jetty would be better.
Jetty is adopted by Apache Spark, Apache Hive, Apache Druid, etc. to build HTTP services. Trino/Presto builds on top of airlift which depends on Jetty. Seems Javalin also depends on Jetty?
Besides, there is a naive usage of Jetty in Kyuubi PrometheusReporterService
, maybe we can build a unified HTTP server which also covers this case.
This is one PR #437 (not merged) uses Jetty to build a monitor service.
sounds good.
For this case, the light one Jetty would be better.
between Jetty and Spring Boot, +1 for Jetty
Jetty is adopted by Apache Spark, Apache Hive, Apache Druid, etc. to build HTTP services. Trino/Presto builds on top of airlift which depends on Jetty. Seems Javalin also depends on Jetty?
It seems Jetty is a really wide-used framework.
Besides, there is a naive usage of Jetty in Kyuubi PrometheusReporterService, maybe we can build a unified HTTP server which also covers this case.
+1 to restrain the technology stack
Also cc @turboFei
+1 for Jetty
sounds good, thanks to everyone for sharing opinions. Will polish the "Implementation" section in the design documentation.
Akka-HTTP, Play, HTTP4s, Cask and Zio-HTTP are available in scala development stack, pure Scala and functional style.
Akka-HTTP, Play, HTTP4s, Cask and Zio-HTTP are available in scala development stack, pure Scala and functional style.
Thanks for sharing your thoughts. IMO, here, we may deliberately lean towards the Java ecology, which will help Kyuubi's developer ecology. Of course, the core premise is that the selected Jetty is sufficient to meet our needs.
@yanghua Very good proposal with a few questions.
Who is the Rest API service target? If it's for normal users, this API is actually an HTTP implementation of ThriftCLIInterface, and it would be very complicated for users to call Submit a query steps:
Submitting a query to get the results requires more than 6 HTTP API calls, and does not include the case of call exceptions. If the user forgets to call the release resource API, what should be done?
Later will be based on Rest API to provide Client SDK, like the Hive JDBC SDK? But see there is already a new Kyuubi Hive JDBC under development, can completely cover the Rest API capabilities
HTTP stateless protocol completely to imitate the RPC calls, the user side will be more complex to use. Is it possible to simplify the HTTP API from the user's point of view instead of simply copying the Thrift RPC call?
Referring to Presto/Trino's Rest API, there are only 3 core APIs.
The API for session and operation management is very well designed, and the target users are administrators and operators. But for ordinary users to submit SQL, I feel it is slightly complicated
Thanks @iodone participate in and share thoughts.
Submitting a query to get the results requires more than 6 HTTP API calls ... ... ... for ordinary users to submit SQL, I feel it is slightly complicated
I agree with you and have a basic idea to simplify it.
We can make session info optional in operations APIs, then create a new session if session info is absent, and return the session id to the client to make sure the user can reuse the session in following request.
Hi @iodone Sorry for the late reply. Thanks for chime in and share your thoughts.
Answer your question:
who are these RESTful APIs target users?
I did not make assumptions about it. The original intention was to expose Kyuubi's capabilities through the HTTP protocol. Its target users can be administrators or some regular users who submit queries. It can be used directly or indirectly (for example, after building a Kyuubi UI based on it, use it step by step).
Thank you for your feedback, yes. This API format is very inconvenient for users who use it directly.
Because it relies on the design context of these APIs based on resources. From a resource point of view, it should be to create the Session first, and then create the Operation, which seems to have no sense of contradiction. But it brings complexity and higher cost of understanding to direct users, and even lower performance.
If we don't make assumptions about the target users of these APIs (I personally suggest). Otherwise, this matter will become more complicated in the future.
Then we look forward to bringing in your suggestions. Like:
Referring to Presto/Trino's Rest API, there are only 3 core APIs.
postStatement (POST /v1/statement) getStatus (GET /v1/statement/queued/{queryId}/{slug}/{token}) getQueryResults (GET /v1/statement/executing/{queryId}/{slug}/{token})
We can introduce a resource with a "statement" as the first-level resource. So that it can provide convenience for some users. It may be a little bit against the context of resources in HS2, but we should not always stick to the rules.
WDYT?
a resource with a "statement" as the first-level resource
@yanghua I strongly agree with the design of a resource with a "statement" as the first-level resource, and with @pan3793's point. The discussion can be divided into two parts:
First of all, a statement-based resource does simplify the SQL submission interaction, as Presto already does
But based on statement resource will conflict with the overall design of kyuubi's backend service module and operation module, equivalent to HTTP API and Thrift API are two parallel solutions (this is a personal opinion)
In addition, we need to consider the HTTP API and Thrift API-based authentication compatibility issues, as well as multiple kyuubi instance statement resource state (based on memory) management issues
The session and operation HTTP APIs are provided to help them view and intervene in current services, including:
Some of the interfaces on the @yanghua documentation can be provided as APIs related to thrift context resource management. Of course, can this part be considered by kyuubi-ctl to replace the HTTP API implementation?
But based on statement resource will conflict with the overall design of kyuubi's backend service module and operation module, equivalent to HTTP API and Thrift API are two parallel solutions (this is a personal opinion)
If we do not assume the users of these APIs, nor set limits on the scope of these APIs, can we just expose the capabilities of Kyuubi services as much as possible? Currently, our main focus is on matching capabilities equivalent to HS2 thrift rpc services. But Kyuubi's future ability should be a superset of this ability. For example, @turboFei wants to display Session/Operation logs.
I understand that your point is to classify APIs according to potential user roles. On the other hand, it is also reasonable. But in what dimension do we classify?
Without classification, we will only need to follow RESTful principles, and our core focus is on resources. Regarding "authentication", we can take the non-HS2 interface API as a special case.
In short, I personally have an open mind about whether to classify APIs. @pan3793 Can you chime in?
Of course, can this part be considered by kyuubi-ctl to replace the HTTP API implementation?
IMO, it could be happened in the future. Even if kyuubi-ctl supports these capabilities, we still expect to release these capabilities through the http api so that we can build a visual UI for kyuubi. Some basic management APIs are still needed.
@yanghua Thank you very much for your patient explanation, I was able to get your point. Let's discuss the next implementation based on
a resource with a "statement" as the first-level resource
and discuss the subsequent implementation.
Here is just the main flow of the user submitting SQL, I can think of two options.
As @pan3793 mentioned above:
We can make session info optional in operations APIs, then create a new session if session info is absent, and return the session id to the client to make sure the user can reuse the session in following request.
The underlying implementation is still by way of thrift cli interface, but in the HTTP API only exposing the concept of statement, hiding the session. When submitting a query, the server creates the session first, and then calls executeStatement on the session:
Based on @yanghua's point of view:
we will only need to follow RESTful principles, and our core focus is on resources.
Another set of fully RESTful-based architectural solutions, entirely different with the Thrift CLI interface:
Currently Solution-1 in our production environment has landed, and is currently on trial. But I personally prefer Solution-2.
@yanghua @pan3793 WDYT?
@iodone, thanks for sharing your thoughts. And I will give my views on the disadvantages you mentioned of solution 1.
- Hidden Session to the user, Session resource release timing is uncertain, may lead to session resource leakage. By per query per session mechanism and the mechanism of timing checks to avoid?
I'm not proposing to drop the session_id concept in REST API, make session_id optional means the user can still provide a session_id.
The session is not hidden for users, we can still keep the REST API of sessions, and return the implicitly created session_id to the client to make sure the user can reuse/release the session in the following request.
- You can see that the HTTP API is a short connection, while Kyuubi and Kyuubi engine directly establish a long link, which means that a Session state is maintained in memory to maintain the connection with Kyuubi Engine. We know that at least two HTTP API requests are required to execute a complete HTTP API SQL query. When Kyuubi is extended to multiple instances, since the HTTP API can be distributed to any instance, it is possible that the first two HTTP API requests will be distributed to different instances of Kyuubi, so that the second HTTP API request will not find the correct Session because the Session is stored on the other instance. Possible need to introduce a Session sharing mechanism (introducing a three-party persistent storage component)?
The key point here I think is how to share states between Kyuubi Servers.
We already have a discussion on it, the basic idea is
session_id
to engine instance
mapping, any Kyuubi Server can establish the thrift connection to the target engine when HTTP request arrived.Hi @iodone Thank you for sharing and analyzing the pros and cons of different solutions, and for your in-depth thinking.
@pan3793 has already given answers to some points(We can continue to discuss this). I give some personal views on these two designs from a macro perspective.
On the whole, I personally prefer the solution 1 you proposed. The existing design document is also based on this architecture.
Because it reuses and complies with Kyuubi's overall design and abstraction. Regarding the REST service, I think Kyuubi Server would not just act as a proxy forwarding layer similar to Solution 2 before Spark Engine. It should shield end users from direct dependence on the engine layer(relying on the interfaces exposed by the unique features of a single-engine is also regarded as a direct dependency in a sense.), and at the same time prepare to support multiple engines for the bottom layer in terms of architecture design.
And if we implement it based on Solution 2, although it has a proxy layer hosted in the Kyuubi server, it seems that it is strongly related to the engine in terms of interface, if we want to provide sufficient flexibility (correct me, if I am wrong). So, if we support another at the Engine layer, this approach seems to be a chimney rather than decoupling layer by layer.
Of course, abstraction means more basic and more universal capabilities. This is obviously contrary to the rich and diverse flexibility of a single engine. It seems that this is something that must be given up?
Expected behavior
Actual behavior.
binary only now
Steps to reproduce the problem.
Specifications like the version of the project, operating system, or hardware.