facebook / hhvm

A virtual machine for executing programs written in Hack.
https://hhvm.com
Other
18.19k stars 3k forks source link

[ HSL ] Dict\chunk() behaves differently from doc block when given a KeyedTraversable with duplicate keys #9271

Open lexidor opened 2 years ago

lexidor commented 2 years ago

Describe the bug When given a KeyedTraversable with duplicate keys, Dict\chunk() will return chunks smaller than the chunk size. Some (but not all) duplicate keys will be lost.

 * Returns a vec containing the original dict split into chunks of the given
 * size.
 *
 * If the original dict doesn't divide evenly, the final chunk will be
 * smaller.

The vec will not contain all the elements of the KeyedTraversable ("original dict") and the chunks will not be of the "given size".

Standalone code, or other way to reproduce the problem

function keyed_traversable_with_duplicate_keys(
): KeyedTraversable<string, int> {
  // a => 0, b => 1, c => 2, a => 3, b => 4, c => 5, ...
  for ($i = 0; $i < 15; $i += 3) {
    yield 'a' => $i;
    yield 'b' => $i + 1;
    yield 'c' => $i + 2;
  }
}

<<__EntryPoint>>
function main(): void {
  echo \json_encode(\HH\Lib\Dict\chunk(keyed_traversable_with_duplicate_keys(), 5));
  // [{"a":3,"b":4,"c":2},{"c":8,"a":9,"b":7},{"b":13,"c":14,"a":12}]
}

Steps to reproduce the behavior:

  1. Run the repro listed above

Expected behavior

There is no "correct" behavior for a KeyedTraversable with duplicate keys that makes much sense.


My preferred solution would be to restrict the input type to KeyedContainer<Tk, Tv>. KeyedContainers can not have duplicate keys. This forces callers with a KeyedTraversable to call the function like this:

Dict\chunk(dict(keyed_traversable_with_duplicate_keys()), 5);

The loss of data will be more reasonable this way. Later values of duplicate keys overwrite previous occurrences. The order of the elements is maintained. The sub dicts will all be of the correct size, except for the trailing dict.

In order to support users who depended on the current behavior, I'd move the current implementation to the Legacy_FIXME namespace.

Actual behavior

[{"a":3,"b":4,"c":2},{"c":8,"a":9,"b":7},{"b":13,"c":14,"a":12}]

Environment

Additional context

This behavior has existed since the first public release of the hsl.

https://github.com/hhvm/hsl/blob/e3bb5e944f0639819f4d6416205449418cbc96c5/src/dict/transform.php#L20-L35

/**
 * Returns a vec containing the original dict split into chunks of the given
 * size. If the original dict doesn't divide evenly, the final chunk will be
 * smaller.
 */
function chunk<Tk, Tv>(
  KeyedTraversable<Tk, Tv> $traversable,
  int $size,
): vec<dict<Tk, Tv>> {
  invariant($size > 0, 'Chunk size must be positive.');
  $result = vec[];
  $ii = 0;
  foreach ($traversable as $key => $value) {
    if ($ii % $size === 0) {
      $result[] = dict[];
    }
    $result[(int)($ii / $size)][$key] = $value;
    $ii++;
  }
  return $result;
}

https://github.com/hhvm/hsl/blob/2615e1a30ced5518339a46f971c0434249ab5617/src/dict/transform.php#L15-L42

/**
 * Returns a vec containing the original dict split into chunks of the given
 * size.
 *
 * If the original dict doesn't divide evenly, the final chunk will be
 * smaller.
 *
 * Time complexity: O(n)
 * Space complexity: O(n)
 */
function chunk<Tk as arraykey, Tv>(
  KeyedTraversable<Tk, Tv> $traversable,
  int $size,
)[]: vec<dict<Tk, Tv>> {
  invariant($size > 0, 'Expected positive chunk size, got %d.', $size);
  $result = vec[];
  $ii = 0;
  $chunk_number = -1;
  foreach ($traversable as $key => $value) {
    if ($ii % $size === 0) {
      $result[] = dict[];
      $chunk_number++;
    }
    $result[$chunk_number][$key] = $value;
    $ii++;
  }
  return $result;
}