apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.52k stars 1.29k forks source link

Support no forward index for column #6473

Closed kkrugler closed 1 year ago

kkrugler commented 3 years ago

Currently a text column can be created without any forward index, which is useful when using the column only for filtering. In this situation, the raw (original) text data is not needed, only the text index (see https://github.com/apache/incubator-pinot/pull/6284/).

There are other situations for non-text columns where this same functionality is useful to reduce the size of the column. In our particular use case, we're generating unique terms for a (large) string field, which we save as a multi-value STRING column. We need an inverted index for fast filtering, but we do not need the forward index, which (leaving aside the inverted index, which is built at load time) accounts for more than 80% of the total segment size.

@kishoreg suggested "having a empty forward Index reader impl" as a way of implementing this.

We could possible handle the configuration of this via a new noFwdIndexColumns table config field, similar to the noDictionaryColumns config setting.

There would be situations where specifying no forward index for a column would trigger a table config error, for example doing this for a metrics column (or so I assume).

I'm also not sure whether it would be valid to have a column that has no index/dictionary/forward index; does this mean ignore the field in the input data?

siddharthteotia commented 3 years ago

What's the size of the forward index for the multi value column? Dctionary IDs in the forward index are bit encoded. Looks like it's very high cardinality and and must be having several millions of rows per segment to result in reasonable size overhead.

kkrugler commented 3 years ago

Hi @siddharthteotia - yes, one example segment is 2,637,935 rows, and metadata.properties for the column of interest (creativeText_terms) has cardinality of 48,591 (though that's lower than what I was expecting).

column.creativeText_terms.cardinality = 48591
column.creativeText_terms.totalDocs = 2637935
column.creativeText_terms.dataType = STRING
column.creativeText_terms.bitsPerElement = 16
column.creativeText_terms.lengthOfEachEntry = 60
column.creativeText_terms.columnType = DIMENSION
column.creativeText_terms.isSorted = false
column.creativeText_terms.hasNullValue = false
column.creativeText_terms.hasDictionary = true
column.creativeText_terms.textIndexType = NONE
column.creativeText_terms.hasInvertedIndex = true
column.creativeText_terms.hasFSTIndex = false
column.creativeText_terms.hasJsonIndex = false
column.creativeText_terms.isSingleValues = false
column.creativeText_terms.maxNumberOfMultiValues = 49
column.creativeText_terms.totalNumberOfEntries = 14628086
column.creativeText_terms.isAutoGenerated = false
column.creativeText_terms.minValue = 0.01
column.creativeText_terms.maxValue = \u1EE9ng
column.creativeText_terms.defaultNullValue = null

The dictionary is 2.9MB, and the forward index is 31MB:

creativeText_terms.dictionary.startOffset = 1648876
creativeText_terms.dictionary.size = 2915468
creativeText_terms.forward_index.startOffset = 4564344
creativeText_terms.forward_index.size = 31110427
siddharthteotia commented 2 years ago

Related issue https://github.com/apache/pinot/issues/7870

siddharthteotia commented 2 years ago

@somandal is working on this.

somandal commented 2 years ago

I'm going to start working on this

siddharthteotia commented 2 years ago

Part 1 to add support for skipping forward index (during segment generation) and making all other code paths (load, query processing) aware of it has been merged in https://github.com/apache/pinot/pull/9333

Subsequent PRs will focus on changes to support regeneration of forward index from dict and inverted index and toggling this feature.

somandal commented 2 years ago

Here's a document which discusses the reload problem and how to solve it for forwardIndexDisabled columns. Please take a look and leave your feedback. cc @Jackie-Jiang @siddharthteotia @vvivekiyer

Just a note that a few details still need to be figured out and I will update the document as and when we figure them out.

siddharthteotia commented 2 years ago

User docs - https://docs.pinot.apache.org/basics/indexing/forward-index#disabling-the-forward-index (thanks @somandal)

siddharthteotia commented 2 years ago

Part 2 to disable / delete forward index for an existing column on the reload path has been merged in

https://github.com/apache/pinot/pull/9740

Part 3 will be to regenerate / enable back the forward index for existing column on the reload path using dictionary and inverted index.

FYI - @walterddr @Jackie-Jiang

siddharthteotia commented 1 year ago

With the latest PR getting merged, support for the following is completed

Support for derived columns and duplicates is pending which will be done as follow-ups as needed.

siddharthteotia commented 1 year ago

@somandal - I think you may want to update user docs and open follow up issues for the pending work and link here.

somandal commented 1 year ago

Opened issues: https://github.com/apache/pinot/issues/9972 and https://github.com/apache/pinot/issues/9973

somandal commented 1 year ago

User docs updated: https://docs.pinot.apache.org/basics/indexing/forward-index#disabling-the-forward-index