StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.85k stars 1.77k forks source link

Bitmap indexes on subcolumns of FLatJson. #52187

Open bhaskarshashank99 opened 1 day ago

bhaskarshashank99 commented 1 day ago

Feature request

The attached Bitmap indexes on flat json sub-columns.txt document presents the concept of enabling bitmap indexes on subcolumns of a JSON column by extending the Flat JSON feature introduced in version 3.3.4 and beyond.

Describe the solution you'd like By treating the subcolumns identified by FlatJson feature as regular columns, we can leverage this to build bitmap indexes. The attached document highlights similar solution and current first level blockers that need to be fixed.

Seaven commented 9 hours ago

This is a good suggestion, and I think it's quite complex to add bitmap indexes to sub-columns of flat JSON:

  1. Current bitmap indexes are designed for columns, including the design of FE metadata and the implementation of bitmap indexes on the BE. Therefore, using bitmap indexes for sub-columns may require a new design of the bitmap index.
  2. In the current design of flat JSON, the FE can't perceive the sub-columns of flat JSON, and for each segment file, the sub-columns of flat JSON may be different. For example, some files may have a "name" sub-column, while others may not, which brings significant complexity to creating bitmap indexes. Compared to flat JSON, struct is simpler in this scene because the schema of struct is fixed and known to the FE.
  3. About predicates for flat JSON, we have found that using complex type expression predicates in the segment iterator doesn't necessarily provide performance benefits, because executing complex type predicates requires pre-reading a large amount of unnecessary complex type data. In contrast, executing some simple data type predicates first can reduce a lot of unnecessary data access. Of course, this is based on the premise that the current complex type doesn't support bitmap index.

so, it's eed a very detailed discussion, if we want to supported bitmap index on subcolumns