StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.17k stars 1.82k forks source link

StarRocks Roadmap 2024 #39686

Open Dshadowzh opened 10 months ago

Dshadowzh commented 10 months ago

Refer to roadmap 2023 2022

Shared-data & StarOS

Performance

Easy to use

Data lake analytics

Lake Query Insert DDL Update/Delete/Merge into MV
Hive 1.18 3.2     2.5
Iceberg 2.1 3.1 3.0
Hudi 2.2       3.0
Paimon 3.0       3.2
Delta lake 3.0       3.2

Data warehousing(batch and streaming)

Batch processing & ETL improvement

All-in-one scenarios

Release

arsenalzp commented 10 months ago

Hello, Any chance to have good-first-issue feature among those tasks?

Dshadowzh commented 10 months ago

Hello, Any chance to have good-first-issue feature among those tasks?

Welcome, you can check this https://github.com/StarRocks/starrocks/issues/13300 first, we'll update more good-first-issues in 2024 later. Particularly regarding external catalog and connectors.

Zhangg7723 commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Dshadowzh commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

Zhangg7723 commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

We prefer Iceberg, for better interface design and less bugs. incremental snapshot refresh is useful for non-partition table.

MatthewH00 commented 9 months ago

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

Dshadowzh commented 9 months ago

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

https://github.com/StarRocks/starrocks/pull/38833 It has finished already, will be published in the next version.

trikker commented 9 months ago

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

38833 It has finished already, will be published in the next version.

I think there are different issues. Multi-warehouse enables different users to see different machines so as to get resource isolation at machine level. https://github.com/StarRocks/starrocks/pull/38833 is about the replica location. In share-nothing deployment the data is all on HDFS/S3, we don't have replicas but we still need multi-warehouse cabability to isolate different machines to different resource group.

trikker commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

  1. Datalake Query (1) support more types of catalogs, like oracle, tidb, oceanbase and greenplum(don't know if it is fully compatible with postgresql) (2) support catalog metadata cache for JDBC MySQL, JDBC postgresql and the above catalogs (3) support automatic sample and histogram statistic collection for Hive, Iceberg

  2. Materialized View (1) support rewrite for SQLs with UNION, ORDER BY and LIMIT (2) support rewrite for nested-aggregation SQLs, aggregation-then-join SQLs and related-subquery SQLs with one MV (3) enable view based mv rewrite to rewrite a query without needing user to query the view (4) support incremental refresh of MV(like flink stream computing), rather than refresh a whole table or partition (5) support materiailized view recommendation

  3. Query Plan (1) support query plan cache and plan binding for SQLs, like Oracle

  4. Resource Management (1) support CPU real hard limit for resource group, currently it is actually a soft limit; (2) suport multi-warehouse for users or resource group, different users can only see only different part of the machines when executing queries

  5. Stability (1) memory spill still don't work for some SQLs when enable_spill is true and spill_mode is force, see issue: https://github.com/StarRocks/starrocks/issues/40936

chengyi3192 commented 9 months ago

Expected to support deletion queries in the Parquet format of Iceberg

Dshadowzh commented 9 months ago

Expected to support deletion queries in the Parquet format of Iceberg

We plan implementing it in v3.3

Dshadowzh commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

  1. We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
  2. New statistics collection framework for hive and iceberg is WIP
  3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
  4. Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
  5. Hard limit for resource group and multi-warehouse are in our roadmap.
trikker commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

  1. We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
  2. New statistics collection framework for hive and iceberg is WIP
  3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
  4. Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
  5. Hard limit for resource group and multi-warehouse are in our roadmap.

OK, hope we have a detailed communication later. Happy Spring Festival! ^_^

inviscid commented 9 months ago

Top priorities for our ability to migrate to StarRocks from Greenplum. These two issues:

Plus, Time Travel natively in StarRocks without using an external Data Lake.

The 2024 Roadmap looks great. Hoping we can get migrated so we can contribute to the journey.

ericsun2 commented 8 months ago

This roadmap is exciting.

Let's collaborate on supporting Databricks Unity Catalog as well as MAP & STRUCT types as well. Thanks a million.

motto1314 commented 8 months ago

Connector: Directly reading data files instead of reading from BE.

Do you have a more detailed plan or issue for this feature?

thanks.

melin commented 7 months ago

Materialized view support kafka source && db cdc,Reduce dependencies on external components such as kafka connect, flink cdc, etc

Using starrocks as the lake warehouse is to build a lightweight data platform. If you rely on flink to import data in real time, flink needs to run on yarn or k8s, it will no longer be lightweight. Refer to redshift or risingwave to complete real-time lake entry through mv and advocate the concept of NOETL. Is a very attractive feature.

We have a lot of offshore business customers using aws. if user need to write redshift from kafka, use mv。ie: https://aws.amazon.com/cn/blogs/china/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

chulucninh09 commented 7 months ago

Any plan to support Arrow flight SQL protocol for better data transportation? As data engineer & data scientist, we expect less overhead of converting from SQL (currently Mysql) protocol to pandas/pyarrow table

giovannibonetti commented 7 months ago

Thanks for the explanation, @Dshadowzh. I have just a question about this part:

3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY...

I think the documentation is confusing in this regard.

On one hand, the limitations section of the Query rewrite with materialized views page says:

Limitations In terms of materialized view-based query rewrite, StarRocks currently has the following limitations: ...

  • Materialized views defined with statements containing LIMIT, ORDER BY... cannot be used for query rewrite. ...

On the other hand, the Set extra sort keys section of the Synchronous materialized view page says:

Suppose that the base table tableA contains columns k1, k2 and k3, where only k1 and k2 are sort keys. If the query including the sub-query where k3=x must be accelerated, you can create a synchronous materialized view with k3 as the first column.

CREATE MATERIALIZED VIEW k3_as_key AS
SELECT k3, k2, k1
FROM tableA

If I understood it correctly, ORDER BY k3, k2, k1 is implicit in the materialized view definition. So, it seems like ORDER BY is already working for materialized view query rewriting, isn't it?

malthe commented 4 months ago

@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.

Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.

Dshadowzh commented 4 months ago

@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.

Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.

We'll add this to the roadmap and launch the project soon. Thank you for your attention.

gaigaikuaipao commented 1 month ago

Is this year not the time for the functional development plan of a high-speed data transmission link based on Arrow Flight SQL。

Dshadowzh commented 3 weeks ago

@gaigaikuaipao Yes, we want to release it in 3.4. You can follow this PR https://github.com/StarRocks/starrocks/pull/50199