apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.64k stars 3.26k forks source link

Release Note 1.2.0 #14461

Open morningman opened 1 year ago

morningman commented 1 year ago

[Chinese Version. See below]

Feature

Highlight

  1. Full Vectorizied-Engine support, greatly improved performance

    In the standard ssb-100-flat benchmark, the performance of 1.2 is 2 times faster than that of 1.1; in complex TPCH 100 benchmark, the performance of 1.2 is 3 times faster than that of 1.1.

  2. Merge-on-Write Unique Key

    Support Merge-On-Write on Unique Key Model. This mode marks the data that needs to be deleted or updated when the data is written, thereby avoiding the overhead of Merge-On-Read when querying, and greatly improving the reading efficiency on the updateable data model.

  3. Multi Catalog

    The multi-catalog feature provides Doris with the ability to quickly access external data sources for access. Users can connect to external data sources through the CREATE CATALOG command. Doris will automatically map the library and table information of external data sources. After that, users can access the data in these external data sources just like accessing ordinary tables. It avoids the complicated operation that the user needs to manually establish external mapping for each table.

    Currently this feature supports the following data sources:

    1. Hive Metastore: You can access data tables including Hive, Iceberg, and Hudi. It can also be connected to data sources compatible with Hive Metastore, such as Alibaba Cloud's DataLake Formation. Supports data access on both HDFS and object storage.
    2. Elasticsearch: Access ES data sources.
    3. JDBC: Access MySQL through the JDBC protocol.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/multi-catalog)

    Note: The corresponding permission level will also be changed automatically, see the "Upgrade Notes" section for details.

  4. Light table structure changes

In the new version, it is no longer necessary to change the data file synchronously for the operation of adding and subtracting columns to the data table, and only need to update the metadata in FE, thus realizing the millisecond-level Schema Change operation. Through this function, the DDL synchronization capability of upstream CDC data can be realized. For example, users can use Flink CDC to realize DML and DDL synchronization from upstream database to Doris.

Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

When creating a table, set "light_schema_change"="true" in properties.

  1. JDBC facade

    Users can connect to external data sources through JDBC. Currently supported:

    • MySQL
    • PostgreSQL
    • Oracle
    • SQL Server
    • Clickhouse

    Documentation: https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/jdbc-of-doris/

    Note: The ODBC feature will be removed in a later version, please try to switch to the JDBC.

  2. JAVA UDF

    Supports writing UDF/UDAF in Java, which is convenient for users to use custom functions in the Java ecosystem. At the same time, through technologies such as off-heap memory and Zero Copy, the efficiency of cross-language data access has been greatly improved.

    Document: https://doris.apache.org/zh-CN/docs/dev/ecosystem/udf/java-user-defined-function

    Example: https://github.com/apache/doris/tree/master/samples/doris-demo

  3. Remote UDF

    Supports accessing remote user-defined function services through RPC, thus completely eliminating language restrictions for users to write UDFs. Users can use any programming language to implement custom functions to complete complex data analysis work.

    Documentation: https://doris.apache.org/zh-CN/docs/ecosystem/udf/remote-user-defined-function

    Example: https://github.com/apache/doris/tree/master/samples/doris-demo

  4. More data types support

    • Array type

      Array types are supported. It also supports nested array types. In some scenarios such as user portraits and tags, the Array type can be used to better adapt to business scenarios. At the same time, in the new version, we have also implemented a large number of data-related functions to better support the application of data types in actual scenarios.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/ARRAY

    Related functions: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/array-functions/array_max

    • Jsonb type

      Support binary Json data type: Jsonb. This type provides a more compact json encoding format, and at the same time provides data access in the encoding format. Compared with json data stored in strings, it is several times newer and can be improved.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/JSONB

    Related functions: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/json-functions/jsonb_parse

    • Date V2

      Sphere of influence:

      1. The user needs to specify datev2 and datetimev2 when creating the table, and the date and datetime of the original table will not be affected.
      2. When datev2 and datetimev2 are calculated with the original date and datetime (for example, equivalent connection), the original type will be cast into a new type for calculation
      3. The example is in the documentation

      Documentation: https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Types/DATEV2

More

  1. A new memory management framework

    Documentation: https://doris.apache.org/zh-CN/docs/dev/admin-manual/maint-monitor/memory-management/memory-tracker

  2. Table Valued Function

    Doris implements a set of Table Valued Function (TVF). TVF can be regarded as an ordinary table, which can appear in all places where "table" can appear in SQL.

    For example, we can use S3 TVF to implement data import on object storage:

    insert into tbl select * from s3("s3://bucket/file.*", "ak" = "xx", "sk" = "xxx") where c1 > 2;

    Or directly query data files on HDFS:

    insert into tbl select * from hdfs("hdfs://bucket/file.*") where c1 > 2;

    TVF can help users make full use of the rich expressiveness of SQL and flexibly process various data.

    Documentation:

    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/s3

    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/hdfs

  3. A more convenient way to create partitions

    Support for creating multiple partitions within a time range via the FROM TO command.

  4. Column renaming

    For tables with Light Schema Change enabled, column renaming is supported.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-RENAME

  5. Richer permission management

  6. Import

  7. Support viewing the contents of the catalog recycle bin through SHOW CATALOG RECYCLE BIN function.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Show-Statements/SHOW-CATALOG-RECYCLE-BIN

  8. Support SELECT * EXCEPT syntax.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/data-table/basic-usage

  9. OUTFILE supports ORC format export. And supports multi-byte delimiters.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE

  10. Support to modify the number of Query Profiles that can be saved through configuration.

    Document search FE configuration item: max_query_profile_num

  11. The DELETE statement supports IN predicate conditions. And it supports partition pruning.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/DELETE

  12. The default value of the time column supports using CURRENT_TIMESTAMP

    Search for "CURRENT_TIMESTAMP" in the documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

  13. Add two system tables: backends, rowsets

    Documentation:

    https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/backends

    https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/rowsets

  14. Backup and restore

    • The Restore job supports the reserve_replica parameter, so that the number of replicas of the restored table is the same as that of the backup.

    • The Restore job supports reserve_dynamic_partition_enable parameter, so that the restored table keeps the dynamic partition enabled.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/RESTORE

    • Support backup and restore operations through the built-in libhdfs, no longer rely on broker.

    Documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/CREATE-REPOSITORY

  15. Support data balance between multiple disks on the same machine

    Documentation:

    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-REBALANCE-DISK

    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-CANCEL-REBALANCE-DISK

  16. Routine Load supports subscribing to Kerberos-authenticated Kafka services.

    Search for kerberos in the documentation: https://doris.apache.org/zh-CN/docs/dev/data-operate/import/import-way/routine-load-manual

  17. New built-in-function

    Added the following built-in functions:

    • cbrt
    • sequence_match/sequence_count
    • mask/mask_first_n/mask_last_n
    • elt
    • any/any_value
    • group_bitmap_xor
    • ntile
    • nvl
    • uuid
    • initcap
    • regexp_replace_one/regexp_extract_all
    • multi_search_all_positions/multi_match_any
    • domain/domain_without_www/protocol
    • running_difference
    • bitmap_hash64
    • murmur_hash3_64
    • to_monday
    • not_null_or_empty
    • window_funnel
    • group_bit_and/group_bit_or/group_bit_xor
    • outer combine
    • and all array functions

Upgrade Notice

Known Issues

Behavior Changed

During Upgrade

  1. Upgrade preparation

    • Need to replace: lib, bin directory (start/stop scripts have been modified)

    • BE also needs to configure JAVA_HOME, and already supports JDBC Table and Java UDF.

    • The default JVM Xmx parameter in fe.conf is changed to 8GB.

  2. Possible errors during the upgrade process

    • The repeat function cannot be used and an error is reported: vectorized repeat function cannot be executed, you can turn off the vectorized execution engine before upgrading. (#13868)

    • schema change fails with error: desc_tbl is not set. Maybe the FE version is not equal to the BE (#13822)

    • Vectorized hash join cannot be used and an error will be reported. vectorized hash join cannot be executed. You can turn off the vectorized execution engine before upgrading. (#13753)

    The above errors will return to normal after a full upgrade.

Performance Impact

Api change

Big Thanks

Thanks to ALL who contributed to this release! (alphabetically)

@924060929 @a19920714liou @adonis0147 @Aiden-Dong @aiwenmo @AshinGau @b19mud @BePPPower @BiteTheDDDDt @bridgeDream @ByteYue @caiconghui @CalvinKirs @cambyzju @caoliang-web @carlvinhust2012 @catpineapple @ccoffline @chenlinzhong @chovy-3012 @coderjiang @cxzl25 @dataalive @dataroaring @dependabot[bot] @dinggege1024 @DongLiang-0 @Doris-Extras @eldenmoon @EmmyMiao87 @englefly @FreeOnePlus @Gabriel39 @gaodayue @geniusjoe @gj-zhang @gnehil @GoGoWen @HappenLee @hello-stephen @Henry2SS @hf200012 @huyuanfeng2018 @jacktengg @jackwener @jeffreys-cat @Jibing-Li @JNSimba @Kikyou1997 @Lchangliang @LemonLiTree @lexoning @liaoxin01 @lide-reed @link3280 @liutang123 @liuyaolin @LOVEGISER @lsy3993 @luozenglin @luzhijing @madongz @morningman @morningman-cmy @morrySnow @mrhhsg @Myasuka @myfjdthink @nextdreamblue @pan3793 @pangzhili @pengxiangyu @platoneko @qidaye @qzsee @SaintBacchus @SeekingYang @smallhibiscus @sohardforaname @song7788q @spaces-X @ssusieee @stalary @starocean999 @SWJTU-ZhangLei @TaoZex @timelxy @Wahno @wangbo @wangshuo128 @wangyf0555 @weizhengte @weizuo93 @wsjz @wunan1210 @xhmz @xiaokang @xiaokangguo @xinyiZzz @xy720 @yangzhg @Yankee24 @yeyudefeng @yiguolei @yinzhijian @yixiutt @yuanyuan8983 @Yulei-Yang @zbtzbtzbt @zenoyang @zhangboya1 @zhangstar333 @zhannngchen @ZHbamboo @zhengshiJ @zhenhb @zhqu1148980644 @zuochunwei @zy-kkk

morningman commented 1 year ago

Feature

Highlight

  1. 全面向量化支持,性能大幅提升

    在标准的ssb-100的宽表性能测试中,1.2 的性能相较于1.1 提升2倍;在复杂的TPCH 场景下1.2 的性能相较于1.1 提升3倍。

  2. Merge-on-Write Unique Key

    在原有的 Unique Key 数据模型上,支持了 Merge-on-Write 的数据更新模式。该模式在数据写入时即对需要删除或更新的数据进行标记,从而避免了在读取时对数据进行 Merge Read 的开销,极大的提高了可更新数据模型上的读取效率。

  3. Multi Catalog

    多源数据目录功能为Doris提供了快速接入外部数据源进行访问的能力。用户可以通过 CREATE CATALOG 命令连接到外部数据源。Doris 会自动映射外部数据源的库、表信息。之后,用户就可以像访问普通表一样,对这些外部数据源中的数据进行访问了。避免了之前用户需要对每张表手动建立外表映射的复杂操作。

    目前该功能支持以下数据源:

    1. Hive Metastore:可以访问包括 Hive、Iceberg、Hudi 在内的数据表。也可对接兼容 Hive Metastore 的数据源,如阿里云的 DataLake Formation。同时支持 HDFS 和对象存储上的数据访问。
    2. Elasticsearch:访问 ES 数据源。
    3. JDBC:通过 JDBC 协议访问 MySQL等数据库。

    文档:https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/multi-catalog)

    注:相应的权限层级也会自动变更,详见“升级注意事项”部分

  4. 轻量表结构变更

    在新版本中,对数据表的加减列操作,不再需要同步更改数据文件,仅需在 FE 中更新元数据即可,从而实现毫秒级的Schema Change 操作。通过该功能,可以实现对上游 CDC 数据的 DDL 同步能力。如用户可以通过 Flink CDC,实现上游数据库到 Doris 的 DML 和 DDL 同步。

    文档:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

    通过建表的时候在 properties设 "light_schema_change"="true" 即可。

  5. JDBC 外表

    在新版本中,用户可以通过 JDBC 连接支持JDBC的外部数据源。当前已支持:

    • MySQL
    • PostgreSQL
    • Oracle
    • SQLServer
    • Clickhouse

    文档:https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/jdbc-of-doris/

    注:ODBC 外表功能将在之后的某个版本移除,请尽量切换到 JDBC 外表功能。

  6. JAVA UDF

    支持通过 Java 编写 UDF/UDAF,方便用户在 Java 生态中使用自定义函数。同时,通过堆外内存、Zero Copy 等技术,使得跨语言的数据访问效率大幅提升。

    文档连接:https://doris.apache.org/zh-CN/docs/dev/ecosystem/udf/java-user-defined-function

    示例:https://github.com/apache/doris/tree/master/samples/doris-demo

  7. Remote UDF

    支持通过 RPC 的方式访问远程用户自定义函数服务,从而彻底消除用户编写UDF的语言限制。用户可以使用任意编程语言实现自定义函数,完成复杂的数据分析工作。

    文档:https://doris.apache.org/zh-CN/docs/ecosystem/udf/remote-user-defined-function

    示例:https://github.com/apache/doris/tree/master/samples/doris-demo

  8. 更多数据类型支持

More

  1. 全新的内存管理框架

    文档:https://github.com/apache/doris/blob/master/docs/zh-CN/docs/admin-manual/memory-management/memory-tracker.md

  2. Table Valued Function

    Doris 实现了一组 Table Valued Function(TVF),TVF 可以视作一张普通的表,可以出现在 SQL 中所有“表”可以出现的位置。

    比如我们可以使用 S3 TVF 实现对象存储上的数据导入:

    insert into tbl select * from s3("s3://bucket/file.*", "ak" = "xx", "sk" = "xxx") where c1 > 2;

    或者直接查询 HDFS 上的数据文件:

    insert into tbl select * from hdfs("hdfs://bucket/file.*") where c1 > 2;

    TVF 可以帮助用户充分利用 SQL 丰富的表达能,灵活处理各类数据。

    文档:
    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/s3
    https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/hdfs
  3. 更便捷的分区创建方式

    支持通过 FROM TO 命令创建一个时间范围内的多个分区。

  4. 列重命名

    对于开启了 Light Schema Change 的表,支持对列进行重命名。

    文档:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-RENAME

  5. 更丰富权限管理

  6. 导入相关

  7. 支持通过 SHOW CATALOG RECYCLE BIN 功能查看回收站中的内容。

    文档:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Show-Statements/SHOW-CATALOG-RECYCLE-BIN

  8. 支持 SELECT * EXCEPT 语法。

    文档:https://doris.apache.org/zh-CN/docs/dev/data-table/basic-usage

  9. OUTFILE 支持 ORC 格式导出。并且支持多字节分隔符。

    文档:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE

  10. 支持通过配置修改可保存的 Query Profile 的数量。

    文档搜索 FE 配置项:max_query_profile_num

  11. DELETE 语句支持 IN 谓词条件。并且支持分区裁剪。

    文档:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/DELETE

  12. 时间列的默认值支持使用 CURRENT_TIMESTAMP

    文档中搜索 "CURRENT_TIMESTAMP":https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

  13. 添加两张系统表:backends,rowsets

    文档: https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/backends https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/rowsets

  14. 备份恢复

  15. 支持同机多磁盘之间的数据均衡

    文档: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-REBALANCE-DISK https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-CANCEL-REBALANCE-DISK

  16. Routine Load 支持订阅 Kerberos 认证的 Kafka 服务。

    文档中搜索 kerberos:https://doris.apache.org/zh-CN/docs/dev/data-operate/import/import-way/routine-load-manual

  17. New built-in-function

    新增以下内置函数:

    • cbrt
    • sequence_match/sequence_count
    • mask/mask_first_n/mask_last_n
    • elt
    • any/any_value
    • group_bitmap_xor
    • ntile
    • nvl
    • uuid
    • initcap
    • regexp_replace_one/regexp_extract_all
    • multi_search_all_positions/multi_match_any
    • domain/domain_without_www/protocol
    • running_difference
    • bitmap_hash64
    • murmur_hash3_64
    • to_monday
    • not_null_or_empty
    • window_funnel
    • group_bit_and/group_bit_or/group_bit_xor
    • outer combine

    以及所有 array 函数

Upgrade Notice

Known Issues

使用JDK11 编译和运行 FE,BE,导致BE偶发性crash。请使用 JDK8。

Behavior Changed

During Upgrade

  1. 升级准备

    • 需替换:lib, bin 目录(start/stop 脚本均有修改)

    • BE 也需要配置 JAVA_HOME,已支持 JDBC Table 和 Java UDF。

    • fe.conf 中默认 JVM Xmx参数修改为 8GB。

  2. 升级过程中可能的错误

    • repeat 函数不可使用并报错:vectorized repeat function cannot be executed,可以在升级前先关闭向量化执行引擎。 (#13868)

    • schema change 失败并报错:desc_tbl is not set. Maybe the FE version is not equal to the BE (#13822)

    • 向量化 hash join 不可使用并报错。vectorized hash join cannot be executed。可以在升级前先关闭向量化执行引擎。(#13753)

    以上错误在完全升级后会恢复正常。

Performance Impact

Api change