apache / linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
https://linkis.apache.org/
Apache License 2.0
3.29k stars 1.16k forks source link

[Feature] Linkis result set discussion #1303

Open dlimeng opened 2 years ago

dlimeng commented 2 years ago

Search before asking

Problem Description

Linkis's current result set is stored in Parquet instead of custom Dolphin format

Description

1.linkis storage parquet 2.linkis storage orc

Use case

No response

solutions

1.Apache Parquet Introduce 2.Apache Orc Introduce

Anything else

No response

Are you willing to submit a PR?

dlimeng commented 2 years ago

This page describes the process for proposing breaking changes to Linkis. • Introduction • Storage stores a variety of file systems • Result Set - Parquet • Parquet composition • Parquet Design • Parquet implementation • Result Set - ORC • ORC composition • Compare • Release

Introduction

Linkis is faced with the need to store various types of data in files, such as: storing Hive table data in files, and hoping to save metadata information such as field types, column names, and comments.

Storage stores a variety of file systems

image

Result Set - Parquet

Parquet composition

Parquet is just a storage format, it is language- and platform-independent, and does not need to be bound to any data processing framework. Currently, the components that can be adapted to Parquet include the following, and it can be seen that basically the commonly used queries The engine and computing framework have been adapted, and data generated by other serialization tools can be easily converted into Parquet format.

• Query Engines: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL • Computing Framework: MapReduce, Spark, Cascading, Crunch, Scalding, Kite • Data Models: Avro, Thrift, Protocol Buffers, POJOs

The schema of each data model contains multiple fields, and each field can contain multiple fields. Each field has three attributes: repetition number, data type and field name. The repetition number can be the following three types: required (occurrence 1 time ), repeated (0 or more occurrences), optional (0 or 1 occurrences). The data type of each field can be divided into two types: group (complex type) and primitive (basic type). type of data INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY image

Parquet Design

image

Parquet implementation

image

Result Set - ORC

ORC composition

Unlike Parquet, ORC does not natively support nested data formats, but supports nested formats through special processing of complex data types. CREATE TABLE orcStructTable(  name string,  course struct<course:string,score:int>,  score map<string,int>,  work_locations array) Similar to Parquet, ORC files are also stored in binary mode, so they cannot be read directly. ORC files are also self-parsed and contain a lot of metadata, which are serialized by isomorphic ProtoBuffer.

• ORC file: Ordinary binary file saved on the file system. An ORC file can contain multiple stripes, and each stripe contains multiple records. These records are stored independently according to columns, corresponding to the concept of row group in Parquet. • File-level metadata: including file description information PostScript, file meta information (including statistical information of the entire file), all stripe information and file schema information. • stripe: A group of rows forms a stripe. Each time a file is read, the unit is row group, generally the block size of HDFS, which saves the index and data of each column. • stripe metadata: saves the position of the stripe, the statistics of each column in the stripe, and all stream types and positions. • row group: The smallest unit of the index. A stripe contains multiple row groups, which are composed of 10,000 values ​​by default. • stream: A stream represents a valid piece of data in the file, including index and data. The index stream saves the position and statistical information of each row group, and the data stream includes various types of data, which are determined by the column type and encoding method. image

Compare

hive • ORC wide table data performs better than parquet data. • The ORC file storage format performs better in terms of space storage, data import speed and query speed, and ORC can support ACID operations to a certain extent. The development of the community is currently a columnar format that is more advocated in Hive. storage format.

Release

expected release 2022-03-31

refer to

Linkis result set discussion