apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.27k stars 3.2k forks source link

Release Notes 0.13.0 #4370

Closed EmmyMiao87 closed 3 years ago

EmmyMiao87 commented 4 years ago

New Feature

Query spill to disk

Doris supports query spill to disk in sorting and window functions. When the enable_spilling is true and memory limit is reached, the query will spill to disk so as to avoid the problem of unable to query due to memory bottleneck. The 0.13 version supports spill in sort and window function.

[#3820] [#4151] [#4152]

Support bitmap_union, hll_union and count in materialized view

Materialized view supports richer aggregate functions: bitmap_union, hll_union and count. In the Order scenario, user needs to analyze the number of orders in different dimensions by count. Also the pre-calculation of bitmap and hll function can be performed for some deduplication analysis scenarios such as analyzing PV and UV data in website traffic. Doris can automatically match the user's query to an optimal materialized view to speed up the query.

[#3651] [#3677] [#3705] [#3873] [#4014] [#3677]

Spark load

Spark load implements the preprocessing of imported data through external Spark resources, improves the import performance of Doris large data volume and saves Doris cluster computing resources. It is mainly used for scenarios where a large amount of data is imported into Doris during the initial migration.

[#3418] [#3712] [#3715] [#3716]

Support load json-data into Doris by RoutineLoad or StreamLoad

RoutineLoad and StreamLoad support a new data format: json. The data in json format is finally imported into Doris through the transform rules in the load statement. This function is especially beneficial for log services whose original data format is json. Users no longer need to process the data into csv format in the outer layer.

[#3553]

Modify routine load

The properties of routine load such as concurrency, Kafka consumption progress could be modify by ALTER ROUTINE LOAD stmt. Only jobs in the PAUSED state can be modified. After routine load is modified, the newly set properties will be used to plan the task when the task is scheduled again.

[#4158]

Support fetch _id from ES and create table with wildcard or aliase index of ES

There is _id field from native ES document which is primary-key for ES index. This field could be fetch by Doris on ES. Also, Doris support create external table with aliases or wildcard index such as log_*. User can easily search all those index by using aliases and wildcards to match those indexes.

[#3900] [#3968]

Logstash Doris output plugin

Logstash plugin is used to output data to Doris for logstash. Use the HTTP protocol to interact with the Doris FE Http interface Load data through Doris's stream load.

[#3800]

Support SELECT INTO OUTFILE

Doris currently supports exporting query results to a third-party file system such as HDFS, S3, BOS. The grammar is referenced from the MySQL grammar manual. The export format is CSV. The export query results could be provide to other users to download or further processing by other systems. Especially good for this kind that the result reset is too large to through the MySQL protocol such as a large number of ids by bitmap_to_string.

[#3584]

Support in predicate in delete statement

The delete statement supports conditions for IN or NOT IN predicate. Users can delete rows that meet different values through this function.

[#4006]

Enhancement

Compaction rules optimization

This optimization updated the strategy for triggering compaction, a version merging strategy that compromises write amplification, space amplification, and read performance (it tends to merge files of adjacent sizes). When the number of the same version is the same, the number of merges is reduced and the total number of files is reduced.

[#4212]

Simplify the delete process to make it fast

The load checker of the rotation training during deletion is cancelled and replaced by txn callback, which will reduce the corresponding time of the delete command to the millisecond level.

[#3191]

Support simple transitivity on join predicate pushdown

When the columns involved in the query filter predicate are consistent with the columns involved in the join condition, the filter predicate can conduct column transmission and also filter another table in the join, reducing the amount of data and achieving the effect of improving the query speed.

[#3453]

Non blocking OlapTableSink

In this optimization, the sending process and the adding row process are executed concurrently in OlapTableSink, and the load performance is always improved. After testing, 56G broker load, the origin ver will run for 4 hours, the multi-ver can halve the time.

[#3143]

Support txn management in db level and use ArrayDeque to improve txn task performance

The transaction management part supports the division of db levels, and each db does not block each other, which improves the execution efficiency of transaction tasks

[#3369]

Improve the performance of query with IN predicate

Add a new config max_pushdown_conditions_per_column to limit the number of conditions of a single column that can be pushed down to the storage engine. It is different from the previous configuration that controls the split scan key. The default value alone is 1024. After the two configurations are separated, the qps of Doris has improved, and the CPU usage rate has also decreased.

[#3694]

Optimized the speed of reading parquet files

There is a cache buffer array in broker reading process when reading parquet file. When a broker about to seek for a position and get data from remote parquet file, try reading with this position in the cache buffer array. Once the expected data hits the cache buffer array, then we don't bother to read data from remote parquet file. After testing, the load time of parquet file in broker or spark load can halve the time.

[#3878]

New Built-in Functions

Other

API Change

Credits

@ZhangYu0123 @wfjcmcb @Fullstop000 @sduzh @stalary @worker24h @chaoyli @vagetablechicken @jmk1011 @funyeah @wutiangan @gengjun-git @xinghuayu007 @EmmyMiao87 @songenjie @acelyc111 @yangzhg @Seaven @hexian55 @ChenXiaoFei @WingsGo @kangpinghuang @wangbo @weizuo93 @sdgshawn @skyduy @wyb @gaodayue @HappenLee @kangkaisen @wuyunfeng @HangyuanLiu @xy720 @liutang123 @caiconghui @liyuance @spaces-X @hffariel @decster @blackfox1983 @Astralidea @morningman @hf200012 @xbyang18 @Youngwb @imay @marising @caoyang10

EmmyMiao87 commented 4 years ago

Credits

@ZhangYu0123
@wfjcmcb
@Fullstop000
@sduzh
@Stalary
@worker24h
@chaoyli
@vagetablechicken
@jmk1011
@funyeah
@wutiangan
@gengjun-git
@xinghuayu007
@EmmyMiao87
@songenjie
@acelyc111
@yangzhg
@Seaven
@hexian55
@ChenXiaofei
@WingsGo
@kangpinghuang
@wangbo
@weizuo93
@sdgshawn
@skyduy
@wyb
@gaodayue
@HappenLee
@kangkaisen
@wuyunfeng
@HangyuanLiu
@xy720
@liutang123
@caiconghui
@liyuance
@spaces-X
@hffariel
@decster
@blackfox1983
@Astralidea
@morningman
@hf200012
@xbyang18
@Youngwb
@imay
@marising @caoyang10

marising commented 4 years ago

Please merge the feature:

[Feature][Cache] Doris caches query results based on partition #2581

LiHaibo 2020-8-19

At 2020-08-17 19:52:47, "EmmyMiao87" notifications@github.com wrote:

Credits

@ZhangYu0123 @wfjcmcb @Fullstop000 @sduzh @stalary @worker24h @chaoyli @vagetablechicken @jmk1011 @funyeah @wutiangan @gengjun-git @xinghuayu007 @EmmyMiao87 @songenjie @acelyc111 @yangzhg @Seaven @hexian55 @ChenXiaoFei @WingsGo @kangpinghuang @wangbo @weizuo93 @sdgshawn @skyduy @wyb @gaodayue @HappenLee @kangkaisen @wuyunfeng @HangyuanLiu @xy720 @liutang123 @caiconghui @liyuance @spaces-X @hffariel @decster @blackfox1983 @Astralidea @morningman @hf200012 @xbyang18 @Youngwb @imay @marising

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

EmmyMiao87 commented 4 years ago

New Feature

Query spill to disk

Doris supports query spill to disk in sorting and window functions. When the enable_spilling is true and memory limit is reached, the query will spill to disk so as to avoid the problem of unable to query due to memory bottleneck. The 0.13 version supports spill in sort and window function.

[#3820] [#4151] [#4152]

Support bitmap_union, hll_union and count in materialized view

Materialized view supports richer aggregate functions: bitmap_union, hll_union and count. In the Order scenario, user needs to analyze the number of orders in different dimensions by count. Also the pre-calculation of bitmap and hll function can be performed for some deduplication analysis scenarios such as analyzing PV and UV data in website traffic. Doris can automatically match the user's query to an optimal materialized view to speed up the query.

[#3651] [#3677] [#3705] [#3873] [#4014] [#3677]

Spark load

Spark load implements the preprocessing of imported data through external Spark resources, improves the import performance of Doris large data volume and saves Doris cluster computing resources. It is mainly used for scenarios where a large amount of data is imported into Doris during the initial migration.

[#3418] [#3712] [#3715] [#3716]

Support load json-data into Doris by RoutineLoad or StreamLoad

RoutineLoad and StreamLoad support a new data format: json. The data in json format is finally imported into Doris through the transform rules in the load statement. This function is especially beneficial for log services whose original data format is json. Users no longer need to process the data into csv format in the outer layer.

[#3553]

Modify routine load

The properties of routine load such as concurrency, Kafka consumption progress could be modify by ALTER ROUTINE LOAD stmt. Only jobs in the PAUSED state can be modified. After routine load is modified, the newly set properties will be used to plan the task when the task is scheduled again.

[#4158]

Support fetch _id from ES and create table with wildcard or aliase index of ES

There is _id field from native ES document which is primary-key for ES index. This field could be fetch by Doris on ES. Also, Doris support create external table with aliases or wildcard index such as log_*. User can easily search all those index by using aliases and wildcards to match those indexes.

[#3900] [#3968]

Logstash Doris output plugin

Logstash plugin is used to output data to Doris for logstash. Use the HTTP protocol to interact with the Doris FE Http interface Load data through Doris's stream load.

[#3800]

Support SELECT INTO OUTFILE

Doris currently supports exporting query results to a third-party file system such as HDFS, S3, BOS. The grammar is referenced from the MySQL grammar manual. The export format is CSV. The export query results could be provide to other users to download or further processing by other systems. Especially good for this kind that the result reset is too large to through the MySQL protocol such as a large number of ids by bitmap_to_string.

[#3584]

Support in predicate in delete statement

The delete statement supports conditions for IN or NOT IN predicate. Users can delete rows that meet different values through this function.

[#4006]

Enhancement

Compaction rules optimization

This optimization updated the strategy for triggering compaction, a version merging strategy that compromises write amplification, space amplification, and read performance (it tends to merge files of adjacent sizes). When the number of the same version is the same, the number of merges is reduced and the total number of files is reduced.

[#4212]

Simplify the delete process to make it fast

The load checker of the rotation training during deletion is cancelled and replaced by txn callback, which will reduce the corresponding time of the delete command to the millisecond level.

[#3191]

Support simple transitivity on join predicate pushdown

When the columns involved in the query filter predicate are consistent with the columns involved in the join condition, the filter predicate can conduct column transmission and also filter another table in the join, reducing the amount of data and achieving the effect of improving the query speed.

[#3453]

Non blocking OlapTableSink

In this optimization, the sending process and the adding row process are executed concurrently in OlapTableSink, and the load performance is always improved. After testing, 56G broker load, the origin ver will run for 4 hours, the multi-ver can halve the time.

[#3143]

Support txn management in db level and use ArrayDeque to improve txn task performance

The transaction management part supports the division of db levels, and each db does not block each other, which improves the execution efficiency of transaction tasks

[#3369]

Improve the performance of query with IN predicate

Add a new config max_pushdown_conditions_per_column to limit the number of conditions of a single column that can be pushed down to the storage engine. It is different from the previous configuration that controls the split scan key. The default value alone is 1024. After the two configurations are separated, the qps of Doris has improved, and the CPU usage rate has also decreased.

[#3694]

Optimized the speed of reading parquet files

There is a cache buffer array in broker reading process when reading parquet file. When a broker about to seek for a position and get data from remote parquet file, try reading with this position in the cache buffer array. Once the expected data hits the cache buffer array, then we don't bother to read data from remote parquet file. After testing, the load time of parquet file in broker or spark load can halve the time.

[#3878]

New Built-in Functions

Other

API Change

Credits

@ZhangYu0123 @wfjcmcb @Fullstop000 @sduzh @stalary @worker24h @chaoyli @vagetablechicken @jmk1011 @funyeah @wutiangan @gengjun-git @xinghuayu007 @EmmyMiao87 @songenjie @acelyc111 @yangzhg @Seaven @hexian55 @ChenXiaoFei @WingsGo @kangpinghuang @wangbo @weizuo93 @sdgshawn @skyduy @wyb @gaodayue @HappenLee @kangkaisen @wuyunfeng @HangyuanLiu @xy720 @liutang123 @caiconghui @liyuance @spaces-X @hffariel @decster @blackfox1983 @Astralidea @morningman @hf200012 @xbyang18 @Youngwb @imay @marising @caoyang10

EmmyMiao87 commented 3 years ago

新功能

大查询落盘

Doris在排序和窗口函数功能中支持查询落盘功能。当参数 enable_spilling 为 true 并且查询达到内存限制时,查询将进行落盘,以避免由于内存瓶颈导致的无法查询的问题。 0.13版本主要支持在排序和窗口函数上的落盘功能。

[#3820] [#4151] [#4152]

物化视图支持 bitmap_unionhll_union 和 count

物化视图支持更丰富的聚合函数:bitmap_unionhll_union和 count。在“订单”业务情景中,用户可以借助生成 count 类型的物化视图,来分析不同维度的订单数量。还可以对某些重复数据精确去重分析(例如分析网站流量中的PV和UV数据)执行 bitmap 和 hll 函数的预先计算。 Doris可以自动将用户查询与最佳物化视图进行匹配,以加快查询速度。

[#3651] [#3677] [#3705] [#3873] [#4014] [#3677]

Spark 导入

通过外部 Spark 资源实现对导入数据的 ETL 处理,提高了 Doris 大数据量的导入性能,并节省了 Doris 集群计算资源。它主要用于在初始迁移期间将大量数据导入 Doris 的方案。

[#3418] [#3712] [#3715] [#3716]

RoutineLoad 和 StreamLoad 支持新的数据格式:Json

通过load语句中的转换规则将 Json 格式的数据导入 Doris。此功能对于原始数据格式为 Json 的日志服务特别有用。用户不再需要在外层将数据处理为 csv 格式。

[#3553]

修改 Routine load

可以通过 ALTER ROUTINE LOAD stmt 修改常规 Routine load 的属性,例如并发性,Kafka消费进度。注意只能修改处于 “pause” 状态的作业。修改并发度后,当 Routine load被被 resume 时,新设置的属性将用于 Routine load。

[#4158]

Doris on ES

  1. 支持从ES提取_id并使用ES的通配符或别名索引创建表 来自本地ES文档的id字段是ES索引的主键。此字段可由Doris在ES上获取。此外,Doris支持使用别名或通配符索引(例如log *)创建外部表。用户可以使用别名和通配符来匹配所有索引,从而轻松搜索所有这些索引。
  2. 重构并增强了读取 ES 元数据的逻辑
  3. 为doc_values扫描添加docvalue限制并默认启用doc_values扫描
  4. 忽略_total节点以提高效率和完全信任的文档数

[#3900] [#3968]

Logstash 插件

Logstash 插件用于将数据从 Logstash 输出到 Doris 中。使用HTTP协议与Doris FE Http接口进行交互 通过Doris的 Stream load 来加载 Logstash 的数据。

[#3800]

支持查询结果输出到文件

Doris 当前支持将查询结果导出到第三方文件系统,例如HDFS,S3,BOS。语法是从 MySQL 语法手册中引用的。导出格式为CSV。导出查询结果可以提供给其他用户,以供其他系统下载或进一步处理。对于因为结果集太大而无法通过 MySQL 协议输出的查询很有帮助,例如当函数 bitmap_to_string 后所标识的ID过多时。

[#3584]

在delete语句中支持谓词

delete语句支持IN或NOT IN谓词的条件。用户可以通过此功能删除满足不同值的行。

[#4006]

优化功能

合并规则优化

此优化更新了触发合并的策略,该版本合并策略将大福减少写放大,空间放大,读取性能等问题(它倾向于合并相邻大小的文件)。当相同版本的数量相同时,合并数量减少,并且文件总数减少。

[#4212]

简化删除过程

删除期间的轮训机制被取消,并由事务回调代替,把delete命令的响应时间减少到毫秒级。

[#3191]

在连接谓词下推时支持简单的可传递性

当查询过滤谓词中涉及的列与联接条件中涉及的列一致时,过滤谓词可以进行列传输,还可以过滤联接中的另一个表,从而减少了数据量,达到了提高查询速度的效果。 。

[#3453]

非阻塞 OlapTableSink

在此优化中,发送过程和加载新行过程在 OlapTableSink 中并发执行,并且导入性能得到改善。经过测试,56G Broker load,原始版本将运行4个小时,多版本可将时间减半。

[#3143]

支持数据库级别的事务管理并使用 ArrayDeque 提升事务任务性能

事务管理支持 DB 级别的划分,并且每个 DB 不会相互阻塞,从而提高了事务任务的执行效率

[#336​​9]

使用IN谓词提高查询性能

添加新的配置 max_pushdown_conditions_per_column 以限制可以向下推到存储引擎的单个列的条件数。 它与 split scan key 的先前配置不同。 仅默认值为1024。将两种配置分开后,Doris的 qps 有所提高,CPU使用率也降低了。

[#3694]

优化读取 parquet 文件的速度

读取 parquet 文件时,在 broker 读取过程中增加了一个缓存缓冲区。 当 broker 要寻找到 position 并从远程 parquet 文件中获取数据时,会先尝试在缓存缓冲区中读取该位置。 一旦期望的数据命中了缓存缓冲区,那么我们就不用再花时间从远程 parquet 文件中读取数据了。 测试后,broker 中的 parquet 文件的时间可以减半。适用于 Spark load 和 Broker load

[#3878]

新查询函数

  1. bitmap_intersect [#3571]
  2. orthogonal_bitmap_intersect in UDAF [#4198]
  3. orthogonal_bitmap_intersect_count in UDAF [#4198]
  4. orthogonal_bitmap_union_count in UDAF [#4198]

API 变化

  1. [SegmentV2] 新表的默认存储格式均为 Segment V2 (#4387)
  2. [License] 修复了 Doris 目前的一些 License 问题 (#4371) 编译 Doris 时,默认将不试用 Mysql client 和 LZO 库。如果需要该依赖,编译时通过修改配置引入依赖。

Credits

@ZhangYu0123 @wfjcmcb @Fullstop000 @sduzh @stalary @worker24h @chaoyli @vagetablechicken @jmk1011 @funyeah @wutiangan @gengjun-git @xinghuayu007 @EmmyMiao87 @songenjie @acelyc111 @yangzhg @Seaven @hexian55 @ChenXiaoFei @WingsGo @kangpinghuang @wangbo @weizuo93 @sdgshawn @skyduy @wyb @gaodayue @HappenLee @kangkaisen @wuyunfeng @HangyuanLiu @xy720 @liutang123 @caiconghui @liyuance @spaces-X @hffariel @decster @blackfox1983 @Astralidea @morningman @hf200012 @xbyang18 @Youngwb @imay @marising @caoyang10

EmmyMiao87 commented 3 years ago

Apache incubator Doris 0.13 has been released. Welcome to try it~