Routine Load for Iceberg tables

StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.

https://starrocks.io

Apache License 2.0

8.74k stars 1.75k forks source link

Routine Load for Iceberg tables #49956

Open Samrose-Ahmed opened 1 month ago

Samrose-Ahmed commented 1 month ago

Feature request

Support routine load to load data to iceberg tables.

Is your feature request related to a problem? Please describe.

Use starrocks directly to write to starrocks from kafka without having to use kafka connect or separate write fleet.

Describe the solution you'd like

Reuse routine load infra, adapt for Iceberg tables.

Describe alternatives you've considered

Additional context

jaogoy commented 1 month ago

It'd be better to be implemented. But, if every batch is too small, then the versions will be too much, thereforce the query performance on Iceberg tables will not be good, IMO.

And, can you share with me about your scenarios? Do you just want datalake analytics, and the query performance is not so much restricted to second level?

Samrose-Ahmed commented 1 month ago

Yes you need to not commit excessively. I think around 1min-5min intervals are reasonable (that's often used with Flink/Iceberg as the checkpoint interval).

Second level is not necessary and would generate too many files with iceberg. In general, data/metadata gets compacted away so a few new files don't really affect performance too much as long as commit interval is reasonable.