StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.9k stars 1.78k forks source link

[Feature]StarRocks Supports the Arrow Flight SQL protocol to enhance data transfer efficiency #50285

Open liubotao opened 2 months ago

liubotao commented 2 months ago

Feature request

Is your feature request related to a problem? Please describe. This feature request addresses the need for more efficient data transfer and query execution in StarRocks. Currently, large-scale data set transmission can be slow and resource-intensive, limiting the performance of data analysis and user experience.

Describe the solution you'd like This feature adds support for Apache Arrow Flight SQL to StarRocks. Arrow Flight is a high-performance remote procedure call (RPC) framework specifically designed for large-scale data set transmission. By integrating Arrow Flight SQL, StarRocks can achieve more efficient data transfer and query execution, with expected data transfer speed improvements of up to 10 to 100 times.

Describe alternatives you've considered An alternative to integrating Arrow Flight SQL could be optimizing the existing data transfer mechanisms within StarRocks. However, these traditional optimizations may not achieve the same level of performance improvements as the specialized Arrow Flight SQL integration. Another alternative could be using different RPC frameworks, but Arrow Flight SQL's compatibility and high performance make it a preferable choice.

Additional context The Arrow Flight SQL service in StarRocks will support connections through the Python ADBC (Arrow Database Connectivity) standard interface and JDBC (Java Database Connectivity) drivers. This compatibility allows for significant acceleration of queries and data transfers without requiring changes to existing SQL statements, thus enhancing overall data analysis performance and user experience.

Technical Solution

Default Query Execution Steps

Frame 683

Arrow Flight SQL Query Execution Steps

Frame 696

The upgrade plan for StarRocks to support Arrow Flight SQL is shown in the above diagram, with specific adjustments as follows:

  1. Addition of Arrow Flight SQL Service: Arrow Flight SQL services have been added to both the BE (Backend) and FE (Frontend) sides. The Arrow Flight SQL service on the FE side is responsible for returning the schema and endpoints of the query result to the client. The Arrow Flight SQL service on the BE side is responsible for converting the final query result into Arrow format and returning it to the client.

  2. Two-Stage Query: The original single-stage query process has been changed to a two-stage query process. For regular SQL queries, the client only needs to interact with the FE service, and all data is returned through the FE service. However, for Arrow Flight SQL queries, the query process is divided into two stages. The first stage retrieves the schema of the result and the endpoints where the final result data is stored. In the second stage, the client directly connects to the endpoints storing the query results, without the need to route data through the FE.

  3. New Protocol Introduction: A new protocol, TResultSinkType.ARROW_FLIGHT_PROTOCOL, is introduced to differentiate between regular SQL queries and Arrow Flight SQL queries. This allows both the FE and BE sides to handle and return the corresponding query results based on the different protocols.

alvin-celerdata commented 2 months ago

@liubotao Thanks for the issue ticket. I know that you have created a PR to implement this feature so I have some questions about it. first, can you make the pictures with only English?

  1. Do you need to open a new network port for this new protocol? Do you support the functionality that data in flight is encrypted?
  2. How is the authentication checked and how
alvin-celerdata commented 2 months ago

@liubotao In many user environments, the backend or compute nodes are not exposed to the end clients. So if a client only can fetch results from BE/CN, it will this user can not use this feature in their environments. So I would like to suggest that you add an option that client can fetch data from frontend. Then it will make this feature work in any user environment.

liubotao commented 2 months ago

@liubotao Thanks for the issue ticket. I know that you have created a PR to implement this feature so I have some questions about it. first, can you make the pictures with only English?

  1. Do you need to open a new network port for this new protocol? Do you support the functionality that data in flight is encrypted?
  2. How is the authentication checked and how

1. can you make the pictures with only English?

Yes,I will replace the image

2. Do you need to open a new network port for this new protocol? Do you support the functionality that data in flight is encrypted?

BE and FE need to open a new network port,Arrow Flight SQL is used for fast data transmission and does not handle encryption or decryption-related functions.

3. How is the authentication checked and how

Arrow Flight SQL uses the gRPC protocol, where the gRPC HTTP/2.0 protocol header includes username and password information during the authentication process. After authentication, the server returns a ticket, and all subsequent operations are authenticated using this ticket.

liubotao commented 2 months ago

@liubotao Thanks for the issue ticket. I know that you have created a PR to implement this feature so I have some questions about it. first, can you make the pictures with only English?

  1. Do you need to open a new network port for this new protocol? Do you support the functionality that data in flight is encrypted?
  2. How is the authentication checked and how
  1. This can add a data relay function in the FE's Arrow Flight SQL service, changing the data retrieval endpoint from the BE side to the FE side, and the data transmission flow would become: client -> FE Arrow Flight server -> BE Arrow result
  2. When the client requests data from the BE, it includes ticket information. The FE generates and encrypts the ticket information. The BE decrypts the ticket and, after verifying its validity, proceeds with data retrieval. This process helps prevent third parties from intercepting the results