Partial aggregation flushes after adding just 2 distinct keys

gggrace14 commented 1 year ago

Partial aggregation using a single group-by key may flush after adding just 2 distinct keys. Here is an example:

TEST_F(MergeJoinTest, yyy) {
  vector_size_t size = 10'000;
  auto data = makeRowVector({
      makeFlatVector<StringView>(
          size, [](auto row) { return row % 2 == 0 ? "web"_sv : "app"_sv; }),
      makeFlatVector<int64_t>(size, [](auto row) { return row; }),
  });

  auto plan = PlanBuilder()
                  .values({data, data})
                  .partialAggregation(
                      {"c0"}, {"approx_percentile(c1, 1, 0.9995, 0.001)"})
                  .finalAggregation()
                  .planNode();

  auto [cursor, results] = AssertQueryBuilder(plan).readCursor();

  LOG(ERROR) << printPlanWithStats(*plan, cursor->task()->taskStats(), true);
}

-- Aggregation[FINAL [c0] a0 := approx_percentile("a0")] -> c0:VARCHAR, a0:BIGINT
   Output: 2 rows (35.84KB, 1 batches), Cpu time: 12.26ms, Blocked wall time: 0ns, Peak memory: 15.00MB, Memory allocations: 9, Threads: 1
      hashtable.capacity         sum: 1840602, count: 1, min: 1840602, max: 1840602
      hashtable.numDistinct      sum: 2, count: 1, min: 2, max: 2
      hashtable.numRehashes      sum: 1, count: 1, min: 1, max: 1
      hashtable.numTombstones    sum: 0, count: 1, min: 0, max: 0
  -- Aggregation[PARTIAL [c0] a0 := approx_percentile(ROW["c1"],1,0.9995,0.001)] -> c0:VARCHAR, a0:ROW<"":ARRAY<DOUBLE>,"":BOOLEAN,"":DOUBLE,"":INTEGER,"":BIGINT,"":BIGINT,"":BIGINT,"":ARRAY<BIGINT>,"":ARRAY<INTEGER>>
     Output: 4 rows (306.64KB, 2 batches), Cpu time: 20.79ms, Blocked wall time: 0ns, Peak memory: 40.00MB, Memory allocations: 47, Threads: 1
        flushRowCount                               sum: 2, count: 1, min: 2, max: 2
        flushTimes                                  sum: 1, count: 1, min: 1, max: 1
        hashtable.capacity                          sum: 1840602, count: 1, min: 1840602, max: 1840602
        hashtable.numDistinct                       sum: 2, count: 1, min: 2, max: 2
        hashtable.numRehashes                       sum: 1, count: 1, min: 1, max: 1
        hashtable.numTombstones                     sum: 0, count: 1, min: 0, max: 0
        maxExtendedPartialAggregationMemoryUsage    sum: 32.00MB, count: 1, min: 32.00MB, max: 32.00MB
        partialAggregationPct                       sum: 0, count: 1, min: 0, max: 0
    -- Values[20000 rows in 2 vectors] -> c0:VARCHAR, c1:BIGINT
       Input: 0 rows (0B, 0 batches), Output: 20000 rows (575.62KB, 2 batches), Cpu time: 40.00us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1

Notice that there are only 2 distinct values of the group-by key: "web" and "app", but partial aggregation still flushes (note flushTimes and flushRowCount above).

Group-by keys which are short strings, i.e. strings of size <= 7 bytes, are mapped to 64-bit integers using the following logic from VectorHasher.h

// Maps a binary string of up to 7 bytes to int64_t. Each size maps
// to a different numeric range, so leading zeros are considered.
static inline int64_t stringAsNumber(const char* data, int32_t size) {
  int64_t word =
      bits::loadPartialWord(reinterpret_cast<const uint8_t*>(data), size);
  return size == 0 ? word : word + (1L << (size * 8));
}

This maps "web" to 23,225,719 and "app" to 24,146,017. After converting string keys to numbers, VectorHasher reports that the keys are in range [23,225,719, 24,146,017]. The size of that range is 920,298. HashTable::decideHashMode then chooses kArray as hash mode. It extends the range by adding 50% padding on both ends, which increases the range size to 1,840,596. This is still within limits for kArray as it allows up to 2M entries.

In array mode, we allocate an array of char pointers of the range size. In this case, that array uses 9 1,840,596 bytes ~= 16MB of memory.

In addition, we use approx_percentile aggregation which allocates ~ 0.5MB for 2 accumulators. (CC: @Yuhta)

As a result, we hit memory limit for partial aggregation (16MB) and trigger flushing.

--- Additional notes

Query config kMaxPartialAggregationMemory is the memory threshold to check for partial aggregation flush. Its current default values is 16MB. https://github.com/facebookincubator/velox/blob/94484dff07180ecef5a651e3130f259af7ecefca/velox/exec/HashAggregation.cpp#L190-L193

The functionality of adaptively increase kMaxPartialAggregationMemory until it hits kExtendedMaxPartialAggregationMemory is also in place to avoid future flush. Currently kExtendedMaxPartialAggregationMemory is also set to 16MB so that this functionality is disabled. https://github.com/facebookincubator/velox/blob/94484dff07180ecef5a651e3130f259af7ecefca/velox/exec/HashAggregation.cpp#L245-L251

The threshold for the kArray hash mode of HashTable, kArrayHashMaxSize, is 2M entries or 16MB. https://github.com/facebookincubator/velox/blob/a50188efa8154aa7e948a94a2db9b260585775d0/velox/exec/HashTable.h#L69-L70 So the HashTable used by partial aggregation could be ~16MB large in kArray mode and trigger flushes repeatedly. This could even happen to a partial aggregation we've seen with a short string key of only 2 values.

Possible solutions

Increase kMaxPartialAggregationMemory to 20MB and increase kExtendedMaxPartialAggregationMemory to turn on the adaptive memory threshold relaxing.
Decrease kArrayHashMaxSize from 2M entries to 256K entries, or from 16MB to 14MB.
Keep the current value of kMaxPartialAggregationMemory and kExtendedMaxPartialAggregationMemory for Velox, while setting them through Prestissimo or other system specific configs.
[Update Masha] Switch to normalizedKey or hash-based aggregation if we hit the memory limit using array-based aggregation and number of unique keys is relatively small.

gggrace14 commented 1 year ago

@mbasmanova @spershin @xiaoxmeng

mbasmanova commented 1 year ago

CC: @oerling

@gggrace14 Ge, another solution is to switch to normalizedKey or hash-based aggregation if we hit the memory limit using array-based aggregation and number of unique keys is relatively small.

gggrace14 commented 1 year ago

CC: @oerling

@gggrace14 Ge, another solution is to switch to normalizedKey or hash-based aggregation if we hit the memory limit using array-based aggregation and number of unique keys is relatively small.

Masha, @oerling is looking at that now as a longer term solution.

mbasmanova commented 1 year ago

The same issue applies to partial group-by over a single integer key when there are small number of distinct key values within a large range. For example, 2 distinct keys: 460'000 and -460'000. CC: @oerling

facebookincubator / velox

Partial aggregation flushes after adding just 2 distinct keys #4577