StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.28k stars 1.67k forks source link

Optimize Iceberg table count #46525

Open Samrose-Ahmed opened 1 month ago

Samrose-Ahmed commented 1 month ago

Enhancement

One can obtain the count(*) for an iceberg table from the Iceberg metadata without having to do a full scan of the data. Currently, Starrocks performs a full scan of the Iceberg Table data when doing a count(*) query on external Iceberg lake table. This should be optimized to just use the Iceberg metadata (this is already available via the statistics).

E.g.

StarRocks > explain select count(*) as cnt from tbl1;
+-------------------------------------------+
| Explain String                            |
+-------------------------------------------+
| PLAN FRAGMENT 0                           |
|  OUTPUT EXPRS:19: count                   |
|   PARTITION: UNPARTITIONED                |
|                                           |
|   RESULT SINK                             |
|                                           |
|   4:AGGREGATE (merge finalize)            |
|   |  output: count(19: count)             |
|   |  group by:                            |
|   |                                       |
|   3:EXCHANGE                              |
|                                           |
| PLAN FRAGMENT 1                           |
|  OUTPUT EXPRS:                            |
|   PARTITION: RANDOM                       |
|                                           |
|   STREAM DATA SINK                        |
|     EXCHANGE ID: 03                       |
|     UNPARTITIONED                         |
|                                           |
|   2:AGGREGATE (update serialize)          |
|   |  output: count(*)                     |
|   |  group by:                            |
|   |                                       |
|   1:Project                               |
|   |  <slot 21> : 1                        |
|   |                                       |
|   0:IcebergScanNode                       |
|      TABLE: iceberg.db.tbl1 |
|      cardinality=13219153                 |
|      avgRowSize=2.0                       |
+-------------------------------------------+

The cardinality in the IcebergScanNode already has the result it does not need to perform any scan.

Samrose-Ahmed commented 1 month ago

Related: https://github.com/StarRocks/starrocks/issues/44387 and https://github.com/StarRocks/starrocks/issues/43460