Optimize VectorHasher for large batchSize low cardinality input

Description

Currently, for computeValueIds, VectorHasher consumes the entire input, either successfully hashing all values, or if an unhashable value is encountered, VectorHasher will still consume the entirety of the rest of the input in "AnalyzeValue" mode. So that subsequently, a "workable" new hashmode can be picked and rehash.

In the "fail to hash, analyze the remaining values" code path, current implementation is not optimal. Because current implementation calls "AnalyzeValue" on all values of the input vector. The optimal algorithm would only analyze the remaining values of input until "all unique values" of the "inner-most"/"decoded.base()" have been visited.

In low cardinality large batchsize workloads, this results in Nx improvement where if cardinality = C, batchSize = B, N = B / C.

In all other cases, this optimization is no worse than current implementation.

Real life workload example:

Age (say 0 y/o to 80 y/o)column of some large population (say 20mil). Even using batchSize of 1024, this optimization would result in 12.8x (1024/80)speed-up of computeValueIds.

@mbasmanova @oerling

facebookincubator / velox

Optimize VectorHasher for large batchSize low cardinality input #6331

Description