Open kagamiori opened 1 year ago
A more general solution could be to introduce a cap on amount of memory an operator can use for caching (VectorPool, and similar). The operator can then decide how to use that budget.
More practical solution might be to introduce a cap per Task, not per operator though.
CC: @xiaoxmeng
@xiaoxmeng and I discussed a few possibilities.
currentByte
in the VectorPool and update it with vector->estimateFlatSize()
whenever a vector is added or removed. But estimateFlatSize() itself has an overhead that can diminish the saving from VectorPool.
Description
VectorPool is used during expression evaluation to cache unneeded vectors to be reused later. Today, the size of VectorPool is capped by the number of vectors it caches per type (i.e., 10), the number of types (i.e., only primitive types), and the size of vectors (i.e., less than 1024 * 64). VectorPool is possessed by ExecCtx in each Operator and is destructed when Operator is closed.
However, it has been observed that in a streaming engine, operators are never destructed, so are the VectorPools of them. An experiment shows that caching 10 Varchar vectors of size 1024 * 64 of 1000-character strings can take 16MB. When a streaming query contains many operators, the sizes of their VectorPools can easily add up to a GB which can be a problem.
To address this problem, we can make VectorPool capped by the memory usage instead of the number of vectors and allow library users to configure the memory cap of VectorPool per operator.
Experiment:
Experiment result: