Open Dshadowzh opened 10 months ago
Hello, Any chance to have good-first-issue feature among those tasks?
Hello, Any chance to have good-first-issue feature among those tasks?
Welcome, you can check this https://github.com/StarRocks/starrocks/issues/13300 first, we'll update more good-first-issues in 2024 later. Particularly regarding external catalog and connectors.
How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv
How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv
Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view
. By the way, Iceberg and Hudi, which do you perfer?
How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv
Yes. We are considering about it, there is a
Incremental refresh for iceberg/Hudi/Paimon materialized view
. By the way, Iceberg and Hudi, which do you perfer?
We prefer Iceberg, for better interface design and less bugs. incremental snapshot refresh is useful for non-partition table.
On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?
On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?
https://github.com/StarRocks/starrocks/pull/38833 It has finished already, will be published in the next version.
On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?
38833 It has finished already, will be published in the next version.
I think there are different issues. Multi-warehouse enables different users to see different machines so as to get resource isolation at machine level. https://github.com/StarRocks/starrocks/pull/38833 is about the replica location. In share-nothing deployment the data is all on HDFS/S3, we don't have replicas but we still need multi-warehouse cabability to isolate different machines to different resource group.
Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:
Datalake Query (1) support more types of catalogs, like oracle, tidb, oceanbase and greenplum(don't know if it is fully compatible with postgresql) (2) support catalog metadata cache for JDBC MySQL, JDBC postgresql and the above catalogs (3) support automatic sample and histogram statistic collection for Hive, Iceberg
Materialized View (1) support rewrite for SQLs with UNION, ORDER BY and LIMIT (2) support rewrite for nested-aggregation SQLs, aggregation-then-join SQLs and related-subquery SQLs with one MV (3) enable view based mv rewrite to rewrite a query without needing user to query the view (4) support incremental refresh of MV(like flink stream computing), rather than refresh a whole table or partition (5) support materiailized view recommendation
Query Plan (1) support query plan cache and plan binding for SQLs, like Oracle
Resource Management (1) support CPU real hard limit for resource group, currently it is actually a soft limit; (2) suport multi-warehouse for users or resource group, different users can only see only different part of the machines when executing queries
Stability (1) memory spill still don't work for some SQLs when enable_spill is true and spill_mode is force, see issue: https://github.com/StarRocks/starrocks/issues/40936
Expected to support deletion queries in the Parquet format of Iceberg
Expected to support deletion queries in the Parquet format of Iceberg
We plan implementing it in v3.3
Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:
Thanks a lot for the extensive feedback:
Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:
Thanks a lot for the extensive feedback:
- We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
- New statistics collection framework for hive and iceberg is WIP
- View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
- Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
- Hard limit for resource group and multi-warehouse are in our roadmap.
OK, hope we have a detailed communication later. Happy Spring Festival! ^_^
Top priorities for our ability to migrate to StarRocks from Greenplum. These two issues:
Plus, Time Travel natively in StarRocks without using an external Data Lake.
The 2024 Roadmap looks great. Hoping we can get migrated so we can contribute to the journey.
This roadmap is exciting.
Let's collaborate on supporting Databricks Unity Catalog as well as MAP & STRUCT types as well. Thanks a million.
Connector: Directly reading data files instead of reading from BE.
Do you have a more detailed plan or issue for this feature?
thanks.
Materialized view support kafka source && db cdc,Reduce dependencies on external components such as kafka connect, flink cdc, etc
Using starrocks as the lake warehouse is to build a lightweight data platform. If you rely on flink to import data in real time, flink needs to run on yarn or k8s, it will no longer be lightweight. Refer to redshift or risingwave to complete real-time lake entry through mv and advocate the concept of NOETL. Is a very attractive feature.
We have a lot of offshore business customers using aws. if user need to write redshift from kafka, use mv。ie: https://aws.amazon.com/cn/blogs/china/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/
Any plan to support Arrow flight SQL protocol for better data transportation? As data engineer & data scientist, we expect less overhead of converting from SQL (currently Mysql) protocol to pandas/pyarrow table
Thanks for the explanation, @Dshadowzh. I have just a question about this part:
3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY...
I think the documentation is confusing in this regard.
On one hand, the limitations section of the Query rewrite with materialized views page says:
Limitations In terms of materialized view-based query rewrite, StarRocks currently has the following limitations: ...
- Materialized views defined with statements containing LIMIT, ORDER BY... cannot be used for query rewrite. ...
On the other hand, the Set extra sort keys section of the Synchronous materialized view page says:
Suppose that the base table tableA contains columns k1, k2 and k3, where only k1 and k2 are sort keys. If the query including the sub-query where k3=x must be accelerated, you can create a synchronous materialized view with k3 as the first column.
CREATE MATERIALIZED VIEW k3_as_key AS SELECT k3, k2, k1 FROM tableA
If I understood it correctly, ORDER BY k3, k2, k1
is implicit in the materialized view definition. So, it seems like ORDER BY is already working for materialized view query rewriting, isn't it?
@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.
Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.
@chulucninh09 there's an open issue tracking support for Apache Arrow Flight SQL, but it would be great to get it on the 2024 roadmap.
Interestingly, Apache Doris has this supported since 2.1, released on March 8th 2024.
We'll add this to the roadmap and launch the project soon. Thank you for your attention.
Is this year not the time for the functional development plan of a high-speed data transmission link based on Arrow Flight SQL。
@gaigaikuaipao Yes, we want to release it in 3.4. You can follow this PR https://github.com/StarRocks/starrocks/pull/50199
Shared-data & StarOS
Performance
Easy to use
files
table functionData lake analytics
Data warehousing(batch and streaming)
Batch processing & ETL improvement
Streaming processing & real-time update
All-in-one scenarios
Release