Closed kkrugler closed 1 year ago
What's the size of the forward index for the multi value column? Dctionary IDs in the forward index are bit encoded. Looks like it's very high cardinality and and must be having several millions of rows per segment to result in reasonable size overhead.
Hi @siddharthteotia - yes, one example segment is 2,637,935 rows, and metadata.properties
for the column of interest (creativeText_terms
) has cardinality of 48,591 (though that's lower than what I was expecting).
column.creativeText_terms.cardinality = 48591
column.creativeText_terms.totalDocs = 2637935
column.creativeText_terms.dataType = STRING
column.creativeText_terms.bitsPerElement = 16
column.creativeText_terms.lengthOfEachEntry = 60
column.creativeText_terms.columnType = DIMENSION
column.creativeText_terms.isSorted = false
column.creativeText_terms.hasNullValue = false
column.creativeText_terms.hasDictionary = true
column.creativeText_terms.textIndexType = NONE
column.creativeText_terms.hasInvertedIndex = true
column.creativeText_terms.hasFSTIndex = false
column.creativeText_terms.hasJsonIndex = false
column.creativeText_terms.isSingleValues = false
column.creativeText_terms.maxNumberOfMultiValues = 49
column.creativeText_terms.totalNumberOfEntries = 14628086
column.creativeText_terms.isAutoGenerated = false
column.creativeText_terms.minValue = 0.01
column.creativeText_terms.maxValue = \u1EE9ng
column.creativeText_terms.defaultNullValue = null
The dictionary is 2.9MB, and the forward index is 31MB:
creativeText_terms.dictionary.startOffset = 1648876
creativeText_terms.dictionary.size = 2915468
creativeText_terms.forward_index.startOffset = 4564344
creativeText_terms.forward_index.size = 31110427
Related issue https://github.com/apache/pinot/issues/7870
@somandal is working on this.
I'm going to start working on this
Part 1 to add support for skipping forward index (during segment generation) and making all other code paths (load, query processing) aware of it has been merged in https://github.com/apache/pinot/pull/9333
Subsequent PRs will focus on changes to support regeneration of forward index from dict and inverted index and toggling this feature.
Here's a document which discusses the reload problem and how to solve it for forwardIndexDisabled columns. Please take a look and leave your feedback. cc @Jackie-Jiang @siddharthteotia @vvivekiyer
Just a note that a few details still need to be figured out and I will update the document as and when we figure them out.
User docs - https://docs.pinot.apache.org/basics/indexing/forward-index#disabling-the-forward-index (thanks @somandal)
Part 2 to disable / delete forward index for an existing column on the reload path has been merged in
https://github.com/apache/pinot/pull/9740
Part 3 will be to regenerate / enable back the forward index for existing column on the reload path using dictionary and inverted index.
FYI - @walterddr @Jackie-Jiang
With the latest PR getting merged, support for the following is completed
Support for derived columns and duplicates is pending which will be done as follow-ups as needed.
@somandal - I think you may want to update user docs and open follow up issues for the pending work and link here.
Currently a text column can be created without any forward index, which is useful when using the column only for filtering. In this situation, the raw (original) text data is not needed, only the text index (see https://github.com/apache/incubator-pinot/pull/6284/).
There are other situations for non-text columns where this same functionality is useful to reduce the size of the column. In our particular use case, we're generating unique terms for a (large) string field, which we save as a multi-value STRING column. We need an inverted index for fast filtering, but we do not need the forward index, which (leaving aside the inverted index, which is built at load time) accounts for more than 80% of the total segment size.
@kishoreg suggested "having a empty forward Index reader impl" as a way of implementing this.
We could possible handle the configuration of this via a new
noFwdIndexColumns
table config field, similar to thenoDictionaryColumns
config setting.There would be situations where specifying no forward index for a column would trigger a table config error, for example doing this for a metrics column (or so I assume).
I'm also not sure whether it would be valid to have a column that has no index/dictionary/forward index; does this mean ignore the field in the input data?