Dshadowzh commented 10 months ago

Refer to roadmap 2023 2022

Shared-data & StarOS

Align with all functionalities to shared-nothing
- [ ] Sync materialized view
- [x] Generated column
- [x] Partial update with column mode
- [x] Optimize table and manual compaction
Better cache system
- [ ] Multi-layer cache
- [ ] Global cache
- [x] Cache warmup Cache warmup
- [ ] Cache black/whitelist
- [x] Refine evict algorithm
StarOS internal optimization
- [ ] Multi-replicas for shard management
- [ ] Shard schedule optimization for large scale (more than 10M shards)
- [ ] Local storage for StarOS
[ ] Decoupled storage for FE
[ ] Open API for StarRocks table format (sink and source)
[ ] Time Travel
[ ] Backup support

Performance

[x] Full columnar Json index Flat json
[x] Cost model with primary key and foreign key constrains
[x] Arm optimization for codecs
[ ] Adaptive DOP and adaptive query engine
[ ] Global dictionary encoding
[ ] Enhance IO schedule framework
[x] JIT / Codegen
[ ] Fine granularity Fe lock(from db level to table level)

Easy to use

[x] Online optimize table
[x] List partition optimization
[ ] Arrow flight interface https://github.com/StarRocks/starrocks/issues/22944
Improve files table function
- [x] Improve schema inference
- [x] CSV and json format support
- [ ] Other format: Avro, Arrow, Protobuf
- [ ] Better performance for read, predicates pushdown
Insert statement improvement (on duplicate key, insert properties)
Unified data ingestion with Pipe
- [ ] Pipe for continuous ingestion from Kafka
- [ ] Read from external stream table(Kafka)
- [ ] Continues data ingestion from SQS with Pipe
[ ] Out-of-the-box parameters

Data lake analytics

Better file format support
- [x] Parquet reader tuning
- [ ] ORC reader tuning
Better table format support

Lake	Query	Insert	MV
Hive	1.18	3.2	2.5
Iceberg	2.1	3.1	3.0
Hudi	2.2		3.0
Paimon	3.0		3.2
Delta lake	3.0		3.2

[x] Iceberg metadata optimization https://github.com/StarRocks/starrocks/issues/43460
Materialized view improvement
- [x] Improve partition mapping (list partition, expression partition)
- [ ] Task scheduler DAG & Lineage
- [x] Better query rewrite
[x] JDBC catalog improvement
[ ] Enhance JNI reader and implement JNI writer
[ ] Text File format support
[ ] Presto/Trino/Spark/Hive SQL compatibility
[ ] Presto/Trino/Spark/Hive UDF compatibility
[ ] Automatic cooldown to lake format

Data warehousing(batch and streaming)

Batch processing & ETL improvement

[x] Enable spilling to GA
[ ] Multi-statement transaction
[x] Temporary table
[x] Group execution https://github.com/StarRocks/starrocks/pull/42352
[ ] Task auto retry
Streaming processing & real-time update
[ ] Schemaless partial update
[ ] Merge into statement
[ ] Binlog to flink and spark streaming
[ ] Transaction level incremental refresh in materialized view (Aggregation, Join, functions)
[ ] Incremental refresh for iceberg/Hudi/Paimon materialized view

All-in-one scenarios

[x] Search: Optimize full text inverted index inverted_index
[x] Row store: Optimize row store for high concurrent point lookup Hybrid row-column store
[ ] Time series db: Asof join, high concurrent ingestion
[x] Vector database: vector index https://github.com/StarRocks/starrocks/issues/46678

Release

arsenalzp commented 10 months ago

Hello, Any chance to have good-first-issue feature among those tasks?

Dshadowzh commented 10 months ago

Hello, Any chance to have good-first-issue feature among those tasks?

Welcome, you can check this https://github.com/StarRocks/starrocks/issues/13300 first, we'll update more good-first-issues in 2024 later. Particularly regarding external catalog and connectors.

Zhangg7723 commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Dshadowzh commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

Zhangg7723 commented 10 months ago

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

We prefer Iceberg, for better interface design and less bugs. incremental snapshot refresh is useful for non-partition table.

MatthewH00 commented 9 months ago

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

Dshadowzh commented 9 months ago

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

https://github.com/StarRocks/starrocks/pull/38833 It has finished already, will be published in the next version.

trikker commented 9 months ago

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

38833 It has finished already, will be published in the next version.

I think there are different issues. Multi-warehouse enables different users to see different machines so as to get resource isolation at machine level. https://github.com/StarRocks/starrocks/pull/38833 is about the replica location. In share-nothing deployment the data is all on HDFS/S3, we don't have replicas but we still need multi-warehouse cabability to isolate different machines to different resource group.

trikker commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Datalake Query (1) support more types of catalogs, like oracle, tidb, oceanbase and greenplum(don't know if it is fully compatible with postgresql) (2) support catalog metadata cache for JDBC MySQL, JDBC postgresql and the above catalogs (3) support automatic sample and histogram statistic collection for Hive, Iceberg
Materialized View (1) support rewrite for SQLs with UNION, ORDER BY and LIMIT (2) support rewrite for nested-aggregation SQLs, aggregation-then-join SQLs and related-subquery SQLs with one MV (3) enable view based mv rewrite to rewrite a query without needing user to query the view (4) support incremental refresh of MV(like flink stream computing), rather than refresh a whole table or partition (5) support materiailized view recommendation
Query Plan (1) support query plan cache and plan binding for SQLs, like Oracle
Resource Management (1) support CPU real hard limit for resource group, currently it is actually a soft limit; (2) suport multi-warehouse for users or resource group, different users can only see only different part of the machines when executing queries
Stability (1) memory spill still don't work for some SQLs when enable_spill is true and spill_mode is force, see issue: https://github.com/StarRocks/starrocks/issues/40936

chengyi3192 commented 9 months ago

Expected to support deletion queries in the Parquet format of Iceberg

Dshadowzh commented 9 months ago

Expected to support deletion queries in the Parquet format of Iceberg

We plan implementing it in v3.3

Dshadowzh commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
New statistics collection framework for hive and iceberg is WIP
View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
Hard limit for resource group and multi-warehouse are in our roadmap.

trikker commented 9 months ago

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.

New statistics collection framework for hive and iceberg is WIP

View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.

Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.

Hard limit for resource group and multi-warehouse are in our roadmap.

OK, hope we have a detailed communication later. Happy Spring Festival! ^_^

inviscid commented 9 months ago

Top priorities for our ability to migrate to StarRocks from Greenplum. These two issues:

Special Character Handling: https://github.com/StarRocks/starrocks/issues/38854
Case Sensitive Column Names: https://github.com/StarRocks/starrocks/issues/40292

Plus, Time Travel natively in StarRocks without using an external Data Lake.

The 2024 Roadmap looks great. Hoping we can get migrated so we can contribute to the journey.

ericsun2 commented 8 months ago

This roadmap is exciting.

Let's collaborate on supporting Databricks Unity Catalog as well as MAP & STRUCT types as well. Thanks a million.

motto1314 commented 8 months ago

Connector: Directly reading data files instead of reading from BE.

Do you have a more detailed plan or issue for this feature?

thanks.

melin commented 7 months ago

Materialized view support kafka source && db cdc，Reduce dependencies on external components such as kafka connect, flink cdc, etc

Using starrocks as the lake warehouse is to build a lightweight data platform. If you rely on flink to import data in real time, flink needs to run on yarn or k8s, it will no longer be lightweight. Refer to redshift or risingwave to complete real-time lake entry through mv and advocate the concept of NOETL. Is a very attractive feature.

We have a lot of offshore business customers using aws. if user need to write redshift from kafka, use mv。ie: https://aws.amazon.com/cn/blogs/china/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

chulucninh09 commented 7 months ago

Any plan to support Arrow flight SQL protocol for better data transportation? As data engineer & data scientist, we expect less overhead of converting from SQL (currently Mysql) protocol to pandas/pyarrow table

giovannibonetti commented 7 months ago

Thanks for the explanation, @Dshadowzh. I have just a question about this part:

3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY...

I think the documentation is confusing in this regard.

On one hand, the limitations section of the Query rewrite with materialized views page says:

Limitations In terms of materialized view-based query rewrite, StarRocks currently has the following limitations: ...

Materialized views defined with statements containing LIMIT, ORDER BY... cannot be used for query rewrite. ...

On the other hand, the Set extra sort keys section of the Synchronous materialized view page says:

Suppose that the base table tableA contains columns k1, k2 and k3, where only k1 and k2 are sort keys. If the query including the sub-query where k3=x must be accelerated, you can create a synchronous materialized view with k3 as the first column.
CREATE MATERIALIZED VIEW k3_as_key AS
SELECT k3, k2, k1
FROM tableA

If I understood it correctly, ORDER BY k3, k2, k1 is implicit in the materialized view definition. So, it seems like ORDER BY is already working for materialized view query rewriting, isn't it?

malthe commented 4 months ago

@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.

Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.

Dshadowzh commented 4 months ago

@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.

Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.

We'll add this to the roadmap and launch the project soon. Thank you for your attention.

gaigaikuaipao commented 1 month ago

Is this year not the time for the functional development plan of a high-speed data transmission link based on Arrow Flight SQL。

Dshadowzh commented 3 weeks ago

@gaigaikuaipao Yes, we want to release it in 3.4. You can follow this PR https://github.com/StarRocks/starrocks/pull/50199

StarRocks / starrocks

StarRocks Roadmap 2024 #39686

Shared-data & StarOS

Performance

Easy to use

Data lake analytics

Data warehousing(batch and streaming)

Batch processing & ETL improvement

Streaming processing & real-time update

All-in-one scenarios

Release

38833 It has finished already, will be published in the next version.