xinyiZzz commented 11 months ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Description

Doris implements a high-speed data link based on the Arrow Flight SQL protocol, supporting multiple languages to use SQL to read large data batches from Doris at high speed.

1. Motivation

In data science scenarios, it is often necessary to load large amounts of data from Doris to Python/Java/Spark. Loading data using Pymysql/Pandas or JDBC is very slow.

Nowadays, many big data systems use columnar in-memory data formats, Mysql/JDBC/ODBC are the mainstream protocols and standards for interacting with database systems. Their performance defects have become more and more obvious in today’s big data world. Data needs to be transferred from the system. A specific column storage format is serialized into the row storage format of Mysql/JDBC/ODBC and then deserialized back to the client's column storage format, which will significantly slow down the data movement.

If both the source database and the target client support Arrow as a columnar in-memory format, transferring using the Arrow Flight SQL protocol eliminates the need to serialize and deserialize the data, thereby eliminating the overhead in this portion of the data transfer. Additionally, Arrow Flight can leverage multi-node and multi-core architectures to optimize throughput through full parallelization.

2. Introduction to Arrow Flight SQL

Apache Arrow Flight SQL is a protocol developed by the Apache Arrow community to interact with database systems. It is used by ADBC clients to use the Arrow data format to interact with databases that implement the Arrow Flight SQL protocol. It has the speed advantage of Arrow Flight and the advantages of JDBC/ODBC. Ease of use. Some basic concepts are as follows:

Apache Arrow Apache Arrow is an efficient columnar memory format widely used for large-scale data processing and is supported by many libraries in all major programming languages.
Apache Arrow Flight Arrow Flight is an RPC framework that transmits Arrow data format, allowing high-speed data exchange between different systems using Arrow format.
Arrow Flight SQL Although the Mysql/JDBC/ODBC protocols and standards are slower, they have simple APIs for developers to use. For this reason, Arrow Flight SQL is introduced based on Arrow Flight to provide a more friendly interface to interact with the database system.
ADBC ADBC is a driver that supports different languages to access the database. The database needs to implement the Arrow Flight SQL protocol, similar to JDBC/ODBC.

See also:

Introducing Apache Arrow Flight SQL: Accelerating Database Access Introduces the principles and implementation of Arrow Flight SQL. https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/

Introduction to Arrow Flight SQL Arrow Flight SQL documentation, describing the API and interaction flow. https://arrow.apache.org/docs/format/FlightSql.html

An Introduction to Apache Arrow Flight SQL Introduced the advantages of Arrow Flight SQL compared to JDBC/ODBC. https://www.dremio.com/blog/an-introduction-to-apache-arrow-flight-sql

Apache Arrow ADBC ADBC documentation introduces how to use ADBC Driver in different languages. https://arrow.apache.org/adbc/main/

3. Implementation method

3.1 Principle

In Apache Doris, query results are organized in columnar format blocks. In previous versions, if you need to transfer this data to the target client through MySQL Client or JDBC/ODBC driver, you need to first serialize the blocks into row-format bytes. If the target client is a column-formatted data science component or column-formatted database like Pandas, you also need to deserialize the row-format bytes into column-formatted ones, and the serialization/deserialization operation is a very time-consuming process.

Using Arrow Flight SQL, we first convert the column-stored Block into the same column-stored Arrow RecordBatch in Doris. This conversion step is very fast. There is no need to serialize and deserialize again during the transmission process, and then use the Python client to convert the Arrow RecordBatch is transferred to the Pandas DataFrame stored in the same column. This conversion step is also very fast.

In addition, Arrow Flight SQL also provides a universal JDBC driver, which supports the use of Arrow Flight SQL to interact with database systems that is fully compatible with the JDBC standard.

Python reading Doris acceleration has been implemented, as shown in the figure:

3.2 Outline design

ADBC Client sends a query request to Doris FE and completes the authentication on the first request.
FE parses the query plan and sends the Fragment to be executed to BE.
After BE completes the prepare and open of the Fragment, it returns the Schema of the query result in Arrow format to FE, starts executing the query, and puts the query results into a queue.
FE sends the QueryID, the Schema of the query result, and the BE address (Endpoints) where the query result is located back to the ADBC Client.
ADBC Client requests BE to pull the query results of the specified QueryID.
BE returns the query results in Arrow format in the queue to the ADBC Client, and the ADBC Client completes after verifying the Schema of the results.

3.3 Detailed design

Arrow version: 13.0.0

Take the ADBC Low-Level API execution process as an example:

3.3.1 ADBC Client

1.1 db = adbc_driver_flightsql.connect(uri="grpc://ip:port?user=&password=")

Create a Database connector that can maintain multiple shared Connections at the same time. Parameters: Arrow Flight Server IP, port, username, password

1.2 conn = adbc_driver_manager.AdbcConnection(db)

Creating a Database link will trigger authentication and obtain FlightSqlInfo.

Auth The authentication operation will be triggered when Arrow Flight Server is requested for the first time. The return value is a Bearer Token. Each subsequent request to Arrow Flight Server will bring this Token.
getFlightInfoSqlInfo Request Arrow Flight Server to return SQL Info, including the SQL syntax supported by the database, etc. The return value is the schema and endpoint of SQL Info. SQL Info is also data in arrow format, and the endpoint is still the current doris fe flight server. All data interacted in arrow flight are arrows. Usually before obtaining an arrow data, the first request will obtain its endpoint and schema and encapsulate it in a FlightInfo. Then the endpoint will be requested again to obtain the arrow data and verify it. schema
getStreamSqlInfo Request the endpoint to obtain SQL Info. The result is wrapped in ArrowArrayStream and associated with a ServerStreamListener.

1.3 stmt = adbc_driver_manager.AdbcStatement(conn)

It is used to maintain the status of the query. It can be a one-time query or a prepare statement, which can be used repeatedly, but the previous query results will be invalid.

1.4 stmt.set_sql_query("select * from tpch.hdf5 limit 10;")

1.5 stream, _ = stmt.execute_query()

Executing Query returns a RecordBatchReader, wrapped in a RecordBatchStream.

getFlightInfoStatement Returns the Endpoints and Schema where the query results are located, which is the Metadata of the Stream.
getStreamStatement Returns a RecordBatchReader for reading query results.

1.6 reader = pyarrow.RecordBatchReader._import_from_c(stream.address)

Created a Reader using Stream.

1.7 arrow_data = reader.read_all()

read_all() will loop to call RecordBatchReader.ReadNext() to obtain the RecordBatch of query results.

Corresponding code example:

import adbc_driver_flightsql
import adbc_driver_manager

db = adbc_driver_flightsql.connect(uri="grpc://127.0.0.1:8040", db_kwargs={
            adbc_driver_manager.DatabaseOptions.USERNAME.value: "root",
            adbc_driver_manager.DatabaseOptions.PASSWORD.value: "",
        })
conn = adbc_driver_manager.AdbcConnection(db)
stmt = adbc_driver_manager.AdbcStatement(conn)
stmt.set_sql_query("select * from tbl1 limit 1000000;")
stream, rows = stmt.execute_query()
reader = pyarrow.RecordBatchReader._import_from_c(stream.address)
arrow_data = reader.read_all()

3.3.2 Doris FE

2.1 Authentication

Implement arrow.flight.auth2 related interfaces to respond to authentication when the ADBC client connects for the first time. Extract the username and password in the request header and perform authentication. After generating a 130-bit Token, associate the Token with the user's permission information and save it in a cache. The cache size and Token expiration time can be adjusted in Config. Finally, the Token is returned to the ADBC client.

2.2 getFlightInfoSqlInfo

In response to the arrow flight sql request, SQL Info is returned and two methods, FlightSqlProducer.getFlightInfoSqlInfo() and FlightSqlProducer.getStreamSqlInfo, are implemented.

When Arrow Flight Server is initialized, it will create a FlightSqlProducer that responds to ADBC requests. When FlightSqlProducer is initialized, it will bind SQL Info, including the version of Arrow, whether it supports reading and writing, whether it supports DDL statements such as creating tables and modifying schema, and supported function lists and other SQL syntax. etc.

2.3 getFlightInfoStatement

Execute Query in response to the arrow flight sql request and return the Endpoints and Schema of the query results, implementing the FlightSqlProducer.getFlightInfoStatement method.

Initialize ConnectContext. The first time ADBC Client makes an Execute Query request, it will initialize ConnectContext, which is a Session that stores information related to query execution, including user permissions, Session variables, etc.
Initialize the executor FlightStatementExecutor. Saves Query, QueryID, connectContext, and resultServerInfo.
Execute Query. Initialize QueryID and StmtExecutor, then executeArrowFlightQuery to generate the query plan, initialize and execute the Coordinator, and send the Fragment to the specified BE.
Get the Arrow Result Set Schema. Request the Arrow Flight Server of the BE where the Result Sink Node in the query plan is located. The latter will generate the Schema of the query result after the Fragment completes Prepare and Open.
Use Query and QueryID to initialize Ticket, use the Arrow Flight Server address of the BE where the Result Sink Node in the query plan is located, that is, the Server address where the query result Arrow Result Set is located, and the Ticket to initialize FlightEndpoint, and finally use Arrow Result Set Schema and Endpoints to initialize FlightInfo Then send it back to ADBC Client.

3.3.3 Doris BE

3.1 Execute Fragment

Execute the Fragment and return the Arrow Result Set Schema. The overall execution process is the same as before. The difference is that the type of ResultSinkNode in the Fragment is no longer MYSQL_PROTOCAL, but ARROW_FLIGHT_PROTOCAL. After the Prepare and Open are completed, the Arrow Schema of the query result will be put into a Map. Wait for FE to obtain and initialize ArrowFlightResultWriter.

After the subsequent query results arrive at the ResultSink, use ArrowFlightResultWriter::append_block to convert the data block into a RecordBatch in Arrow format, and then put it into a separate queue BufferControlBlock, waiting for the ADBC Client to pull it.

3.2 GetStatement

After receiving the Endpoints sent back by Doris FE, the ADBC Client will request the Arrow Flight Server address corresponding to the Endpoints located in Doris BE. After receiving the ADBC Client request, Doris BE will first Decode the Ticket and then obtain the SQL and QueryID, and then use the QueryID Find the Arrow Schema of the previously saved query result and initialize a RecordBatchReader to return it, which is used by the ADBC Client to subsequently pull data and implement the FlightSqlServerBase::DoGetStatement() method.

In addition, when the ADBC Client requests Doris BE's Arrow Flight Server for the first time, the Header will also contain the Bearer Token, but the HeaderAuthServerMiddleware and BearerAuthServerMiddleware used when the BE Arrow Flight Server is initialized are both NoOp, that is, no verification will be done, so currently BE Arrow Flight Server's permission verification of requests is based on QueryID, that is, ADBC Client is allowed to read data as long as the QueryID is correct.

3.3 ArrowFlightBatchReader::ReadNext

ADBC Client will cyclically call the ReadNext method of the previously returned RecordBatchReader to pull data, and BE Arrow Flight Server will use the QueryID in the request to pull the RecordBatch in Arrow format from the BufferControlBlock and return it.

4. How to use

Using the ADBC Driver based on Python (require version >= 3.9) as an example, connect to Doris that implements Arrow Flight SQL and supports common syntaxes such as DDL, DML, Session Veriable, Show Stmt, etc.

Modify the configuration of Doris FE and BE:

Modify arrow_flight_sql_port in fe/conf/fe.conf to an available port, such as 9090.
Modify arrow_flight_sql_port in be/conf/be.conf to an available port, such as 9091.

After Python uses the ADBC Driver to connect to Doris, which implements Arrow Flight SQL, the following uses various ADBC APIs to load the Clickbench data set from Doris to Python.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import adbc_driver_manager
import adbc_driver_flightsql.dbapi as flight_sql
import pandas
from datetime import datetime

my_uri = "grpc://0.0.0.0:`fe.conf_arrow_flight_port`"
my_db_kwargs = {
    adbc_driver_manager.DatabaseOptions.USERNAME.value: "root",
    adbc_driver_manager.DatabaseOptions.PASSWORD.value: "",
}
sql = "select * from clickbench.hits limit 1000000;"

# PEP 249 (DB-API 2.0) API wrapper for the ADBC Driver Manager.
def dbapi_adbc_execute_fetchallarrow():
    conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)
    cursor = conn.cursor()
    start_time = datetime.now()
    cursor.execute(sql)
    arrow_data = cursor.fetchallarrow()
    dataframe = arrow_data.to_pandas()
    print("\n##################\n dbapi_adbc_execute_fetchallarrow" + ", cost:" + str(datetime.now() - start_time) + ", bytes:" + str(arrow_data.nbytes) + ", len(arrow_data):" + str(len(arrow_data)))
    print(dataframe.info(memory_usage='deep'))
    print(dataframe)

# ADBC reads data into pandas dataframe, which is faster than fetchallarrow first and then to_pandas.
def dbapi_adbc_execute_fetch_df():
    conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)
    cursor = conn.cursor()
    start_time = datetime.now()
    cursor.execute(sql)
    dataframe = cursor.fetch_df()    
    print("\n##################\n dbapi_adbc_execute_fetch_df" + ", cost:" + str(datetime.now() - start_time))
    print(dataframe.info(memory_usage='deep'))
    print(dataframe)

# Can read multiple partitions in parallel.
def dbapi_adbc_execute_partitions():
    conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)
    cursor = conn.cursor()
    start_time = datetime.now()
    partitions, schema = cursor.adbc_execute_partitions(sql)
    cursor.adbc_read_partition(partitions[0])
    arrow_data = cursor.fetchallarrow()
    dataframe = arrow_data.to_pandas()
    print("\n##################\n dbapi_adbc_execute_partitions" + ", cost:" + str(datetime.now() - start_time) + ", len(partitions):" + str(len(partitions)))
    print(dataframe.info(memory_usage='deep'))
    print(dataframe)

import adbc_driver_flightsql
import pyarrow

# ADBC Low-level api is root module, provides a fairly direct, 1:1 mapping to the C API definitions in Python. 
# For a higher-level interface, use adbc_driver_manager.dbapi. (This requires PyArrow.)
def low_level_api_execute_query():
    db = adbc_driver_flightsql.connect(uri=my_uri, db_kwargs=my_db_kwargs)
    conn = adbc_driver_manager.AdbcConnection(db)
    stmt = adbc_driver_manager.AdbcStatement(conn)
    stmt.set_sql_query(sql)
    start_time = datetime.now()
    stream, rows = stmt.execute_query()
    reader = pyarrow.RecordBatchReader._import_from_c(stream.address)
    arrow_data = reader.read_all()
    dataframe = arrow_data.to_pandas()
    print("\n##################\n low_level_api_execute_query" + ", cost:" + str(datetime.now() - start_time) + ", stream.address:" + str(stream.address) + ", rows:" + str(rows) + ", bytes:" + str(arrow_data.nbytes) + ", len(arrow_data):" + str(len(arrow_data)))
    print(dataframe.info(memory_usage='deep'))
    print(dataframe)

# Can read multiple partitions in parallel.
def low_level_api_execute_partitions():
    db = adbc_driver_flightsql.connect(uri=my_uri, db_kwargs=my_db_kwargs)
    conn = adbc_driver_manager.AdbcConnection(db)
    stmt = adbc_driver_manager.AdbcStatement(conn)
    stmt.set_sql_query(sql)
    start_time = datetime.now()
    streams = stmt.execute_partitions()
    for s in streams[0]:
        stream = conn.read_partition(s)
        reader = pyarrow.RecordBatchReader._import_from_c(stream.address)
        arrow_data = reader.read_all()
        dataframe = arrow_data.to_pandas()
    print("\n##################\n low_level_api_execute_partitions" + ", cost:" + str(datetime.now() - start_time) + "streams.size:" + str(len(streams)) + ", "  + str(len(streams[0])) + ", " + str(streams[2]))

dbapi_adbc_execute_fetchallarrow()
dbapi_adbc_execute_fetch_df()
dbapi_adbc_execute_partitions()
low_level_api_execute_query()
low_level_api_execute_partitions()

The execution results are as follows (repeated output is ignored). It can be seen that it takes 3 seconds to load the 1 million rows, 105 columns, and 780M Clickbench data set from Doris.

##################
 dbapi_adbc_execute_fetchallarrow, cost:0:00:03.548080, bytes:784372793, len(arrow_data):1000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 105 entries, CounterID to CLID
dtypes: int16(48), int32(19), int64(6), object(32)
memory usage: 2.4 GB
None
        CounterID   EventDate               UserID            EventTime              WatchID  JavaEnable                                              Title  GoodEvent  ...  UTMCampaign  UTMContent  UTMTerm  FromTag  HasGCLID          RefererHash              URLHash  CLID
0          245620  2013-07-09  2178958239546411410  2013-07-09 19:30:27  8302242799508478680           1  OWAProfessionov — Мой Круг (СВАО Интернет-магазин          1  ...                                                    0 -7861356476484644683 -2933046165847566158     0
999999       1095  2013-07-03  4224919145474070397  2013-07-03 14:36:17  6301487284302774604           0  @дневники Sinatra (ЛАДА, цена для деталли кто ...          1  ...                                                    0  -296158784638538920  1335027772388499430     0

[1000000 rows x 105 columns]

##################
 dbapi_adbc_execute_fetch_df, cost:0:00:03.611664
##################
 dbapi_adbc_execute_partitions, cost:0:00:03.483436, len(partitions):1
##################
 low_level_api_execute_query, cost:0:00:03.523598, stream.address:139992182177600, rows:-1, bytes:784322926, len(arrow_data):1000000
##################
 low_level_api_execute_partitions, cost:0:00:03.738128streams.size:3, 1, -1

5. Progress and TODO

upgrade thirdparty libs - again https://github.com/apache/doris/pull/23414
(step1) BE support Arrow Flight server, read data only https://github.com/apache/doris/pull/23765
(step2) FE support Arrow Flight server https://github.com/apache/doris/pull/24314
(step3) Support authentication and user session https://github.com/apache/doris/pull/24772
(step4) Support other DML and DDL statements, besides Select https://github.com/apache/doris/pull/25919
(step5) Support JDBC and PreparedStatement and Fix Bug https://github.com/apache/doris/pull/27661
(step6) Support regression test https://github.com/apache/doris/pull/27847

6. Test

Test Dataset: TableName	Rows	Describe
Clickbench	10000w	https://github.com/ClickHouse/ClickBench
tpch.lineitem	60000w	https://www.tpc.org/tpch/
96float_table	2000w	1 column of String, 95 columns of Float. `CREATE TABLE`hdf5`(` k0`varchar(65532) NULL,` k1`float NULL,` k2`float NULL,  ……` k95`float NULL ) ENGINE=OLAP DUPLICATE KEY(`k0`) DISTRIBUTED BY HASH(`k0`) BUCKETS 64;`

6.1 Python

Compare the performance of Python using Pymysql, Pandas and Arrow Flight SQL to read Doris:

Column type / cost (Unit: s)	Pymysql	pandas.read_sql	Arrow Flight SQL	SQL
Int column	70.097648	80.461473	0.154683	select ClientIP from clickbench.hits limit 10000000;
Bool column	68.84048	91.333049	0.109124	select CounterClass from clickbench.hits where CounterClass!=0 limit 10000000;
Float column	132.46575	152.666138	1.974839	select k1 from 96float_table
String column	68.21946	79.298519	3.614652	select URL from clickbench.hits where URL!='' limit 10000000
String column	126.32184	147.955599	6.195701	select k0 from 96float_table;
Mixed columns	229.72	248.41	3.920499	select * from clickbench.hits limit 1000000
Mixed columns	200.88983	206.912296	1.049839	select * from 96float_table limit 1000000

Column type / cost (Unit: s)	JDK version	Java jdbc:mysql DriverManager	Java jdbc:arrow-flight-sql DriverManager	Java Flight AdbcDriver	Java Flight JdbcDriver	SQL
1	Int column (1000w)	JDK 1.8	3.772	0.425	0.568	0.829	select ClientIP from clickbench.hits limit 10000000;
JDK 17	3.826	0.451	0.510	0.756
2	Bool column (1000w)	JDK 1.8	4.353	0.430	0.547	0.815	select CounterClass from clickbench.hits where CounterClass!=0 limit 10000000;
JDK 17	3.755	0.421	0.491	0.751
3	String column (1000w)	JDK 1.8	9.800	1.103	1.218	4.519	select URL from clickbench.hits where URL!='' limit 10000000
JDK 17	5.454	0.973	1.062	3.102
4	Mixed columns (100w)	JDK 1.8	8.478	1.799	2.123	13.431	select * from clickbench.hits limit 1000000
JDK 17	4.355	1.794	1.919	10.544
5	Decimal column (60000w)	JDK 1.8	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	23.354	24.288	Cannot get simple type for type DECIMAL	select l_extendedprice from tpch.lineitem;
JDK 17	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	23.357	23.701	Cannot get simple type for type DECIMAL
6	Decimal column (1000w)	JDK 1.8	4.499	0.654	0.873	Cannot get simple type for type DECIMAL	select l_extendedprice from tpch.lineitem limit 10000000;
JDK 17	3.456	0.660	0.789	Cannot get simple type for type DECIMAL
7	DATE (60000w)	JDK 1.8	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	24.323	24.559	122.514	select l_commitdate from tpch.lineitem;
JDK17	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	23.932	23.919	124.309
8	DATE (1000w)	JDK 1.8	4.554	0.636	0.864	2.784	select l_commitdate from tpch.lineitem limit 10000000;
JDK 17	3.226	0.689	0.838	2.712
9	Mixed columns (100w)	JDK 1.8	1.690	0.892	1.070	Cannot get simple type for type DECIMAL	select * from tpch.lineitem limit 1000000;
JDK 17	1.061	0.756	0.919	Cannot get simple type for type DECIMAL

Column type / cost (Unit: s)	JDK version	Java jdbc:mysql DriverManager	Java jdbc:arrow-flight-sql DriverManager	Java Flight AdbcDriver	Java Flight JdbcDriver	SQL
1	Int column (1000w)	JDK 1.8	11.972	2.239	1.797	2.078	select ClientIP from clickbench.hits limit 10000000;
JDK 17	10.793	1.580	1.466	1.686
2	Bool column (1000w)	JDK 1.8	6.514	2.337	1.332	1.581	select CounterClass from clickbench.hits where CounterClass!=0 limit 10000000;
JDK 17	5.436	1.601	1.074	1.343
3	String column(1000w)	JDK 1.8	12.376	2.378	4.536	7.855	select URL from clickbench.hits where URL!='' limit 10000000
JDK 17	7.460	1.766	5.674	6.709
4	Mixed columns (100w)	JDK 1.8	14.957	1.856	13.97	26.437	select * from clickbench.hits limit 1000000
JDK 17	7.840	1.818	13.24	25.421
5	Decimal column (60000w)	JDK 1.8	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	86.219	72.473	Cannot get simple type for type DECIMAL	select l_extendedprice from tpch.lineitem;
JDK 17	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	49.458	70.62	Cannot get simple type for type DECIMAL
6	Decimal column (1000w)	JDK 1.8	6.556	2.256	2.131	Cannot get simple type for type DECIMAL	select l_extendedprice from tpch.lineitem limit 10000000;
JDK 17	5.298	1.726	2.168	Cannot get simple type for type DECIMAL
7	DATE (60000w)	JDK 1.8	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	86.006	40.807	170.358	select l_commitdate from tpch.lineitem;
JDK 17	OutOfMemoryError: Java heap space -Xms20G -Xmx40g	49.792	37.610	169.474
8	DATE (1000w)	JDK 1.8	7.117	2.253	1.532	3.484	select l_commitdate from tpch.lineitem limit 10000000;
JDK 17	4.753	1.697	1.365	3.307
9	Mixed columns (100w)	JDK 1.8	2.126	0.992	3.309	Cannot get simple type for type DECIMAL	select * from tpch.lineitem limit 1000000;
JDK 17	1.264	0.793	3.109	Cannot get simple type for type DECIMAL

liugddx commented 11 months ago

Very cool feature! Can it be maintained on the confluence?

xinyiZzz commented 11 months ago

Very cool feature! Can it be maintained on the confluence?

Yes, I will maintain it later, thanks~

liugddx commented 9 months ago

Very cool feature! Can it be maintained on the confluence?

Yes, I will maintain it later, thanks~

I'd love to get involved in this issue, is there anything I can do to help?

jpohanka commented 6 months ago

@xinyiZzz Thank you for creating this feature request and for implementing Arrow Flight in Doris. Currently (Doris 2.1.0) we have only data retrieval. Are you also planning to implement data ingestion via Arrow Flight?

xinyiZzz commented 6 months ago

@xinyiZzz Thank you for creating this feature request and for implementing Arrow Flight in Doris. Currently (Doris 2.1.0) we have only data retrieval. Are you also planning to implement data ingestion via Arrow Flight?

Hi @jpohanka, currently no plans to implement data ingestion, but expect to support in future, especially in Spark and Flink. In the past we have tested Spark via Arrow Flight load Doris and reduce data serialization time by 10 times.

aditanase commented 6 months ago

Thanks for adding this! Wondering if there are plans for doing something similar on the federated query side. I think it would slot in nicely between data lake support and JDBC, allowing for the best of both worlds.

A simple use case would be to run queries against another Doris instance via ADBC instead of mysql protocol.

If someone wanted to try to implement this, what would be a good template for a partitioned / parallelisable data source? (assuming we'd want to distribute the query across multiple BEs).

xinyiZzz commented 6 months ago

Hi @aditanase , good suggestion!

We thought about using Arrow Flight SQL to implement federated query between multiple Doris clusters. to replace now use jdbc:mysql.

For partitioned data sources, users often divide data sources according to business. for Storage and Compute Separation, it is easy to achieve resource isolation. for Doris that use local storage, different businesses often create different Doris clusters. Arrow Flight SQL is helpful for federated query and data migration between clusters.

apache / doris

[Feature] Doris support Arrow Flight SQL protocol #25514

Search before asking

Description

1. Motivation

2. Introduction to Arrow Flight SQL

3. Implementation method

3.1 Principle

3.2 Outline design

3.3 Detailed design

3.3.1 ADBC Client

1.1 db = adbc_driver_flightsql.connect(uri="grpc://ip:port?user=&password=")

1.2 conn = adbc_driver_manager.AdbcConnection(db)

1.3 stmt = adbc_driver_manager.AdbcStatement(conn)

1.4 stmt.set_sql_query("select * from tpch.hdf5 limit 10;")

1.5 stream, _ = stmt.execute_query()

1.6 reader = pyarrow.RecordBatchReader._import_from_c(stream.address)

1.7 arrow_data = reader.read_all()

3.3.2 Doris FE

2.1 Authentication

2.2 getFlightInfoSqlInfo

2.3 getFlightInfoStatement

3.3.3 Doris BE

3.1 Execute Fragment

3.2 GetStatement

3.3 ArrowFlightBatchReader::ReadNext

4. How to use

5. Progress and TODO

6. Test

6.1 Python

6.2 Java

6.2.1

6.2.2

1 动机

2 Arrow Flight SQL介绍

3 基于 Arrow Flight SQL 实现高速数据链路

3.1 原理

3.2 概要设计

3.3 ADBC Low-Level API执行流程

1.3.2.1 ADBC Client

1.3.2.2 Doris FE

1.3.2.3 Doris BE

6 性能测试

6.1 Python

6.2 Java

6.2.1

6.2.2