Performance of `DeepCollectionEquality().equals(a, b)` is horribly slow

mkustermann commented 1 year ago

There are customers that use DeepCollectionEquality().equals(a, b) where a and b are e.g. Map<String, dynamic> (let's say json). It turns out this is very slow.

See https://github.com/zino-hofmann/graphql-flutter/issues/1196 for example use case suffering from this.

A very simple example that demonstrates the difference between a simple custom implementation and DeepCollectionEquality().equals():

import 'dart:math';

import 'package:collection/collection.dart';
import 'package:benchmark_harness/benchmark_harness.dart';

class CollectionBench extends BenchmarkBase {
  final Map map;
  final Map map2;
  CollectionBench(this.map, this.map2) : super('Deep');

  @override
  void run() {
    if (!const DeepCollectionEquality().equals(map, map2)) throw 'failed';
  }
}

class ManualBench extends BenchmarkBase {
  final Map map;
  final Map map2;
  ManualBench(this.map, this.map2) : super('Manual');

  @override
  void run() {
    if (!recursiveEquality(map, map2)) throw 'failed';
  }
}

void main() {
  final map = genDeepMap(4, Random(10));
  final map2 = genDeepMap(4, Random(10));

  ManualBench(map, map2).report();
  CollectionBench(map, map2).report();
}

Map<String, dynamic> genDeepMap(int n, Random random) {
  final map = <String, dynamic>{};
  if (n == 0) return map;
  final count = random.nextInt(40);
  for (var i = 0; i < count; ++i) {
    map['key-$i'] = genDeepMap(n - 1, random);
  }
  return map;
}

bool recursiveEquality(dynamic a, dynamic b) {
  if (identical(a, b)) return true;
  if (a is Map) {
    if (b is! Map) return false;
    if (a.length != b.length) return false;
    for (final key in a.keys) {
      if (!b.containsKey(key)) return false;
      if (!recursiveEquality(a[key], b[key])) return false;
    }
    return true;
  }
  if (a is String) {
    if (b is! String) return false;
    return a == b;
  }
  throw 'unsupported ${a.runtimeType} ${b.runtimeType}';
}

In JIT:

% dart bench.dart
Manual(RunTime): 50178.21951219512 us.
Deep(RunTime): 3862795.5 us.

In AOT:

% dart compile exe bench.dart
Generated .../bench.exe
% ./bench.exe
Manual(RunTime): 99011.71428571429 us.
Deep(RunTime): 6663128.0 us.

So package:collection is a hopping 70x slower than a manual version.

A big reason is that when comparing two maps for equality, the responsible MapEquality.equals is building a new HashMap (of equal size). Building it will require inserting _MapEntry(key, value) as keys. On insertion of each element into the HashMap it requires (recursively) calculating a hashcode (and possibly recursively doing equals calls):


class MapEquality<K, V> implements Equality<Map<K, V>> {
  ...

  bool equals(Map<K, V>? map1, Map<K, V>? map2) {
    ...
    Map<_MapEntry, int> equalElementCounts = HashMap();
    for (var key in map1.keys) {
      var entry = _MapEntry(this, key, map1[key]);
      var count = equalElementCounts[entry] ?? 0;
      equalElementCounts[entry] = count + 1;
    }
    for (var key in map2.keys) {
      var entry = _MapEntry(this, key, map2[key]);
      var count = equalElementCounts[entry];
      if (count == null || count == 0) return false;
      equalElementCounts[entry] = count - 1;
    }
    return true;
  }

  int hash(Map<K, V>? map) {
    ...
  }
}

That's a lot of object allocations, repeatedly recursive hashing & equality checking (possibly on same values).

Other things to note:

All those MapEquality<K, V> / ... classes are generic - but in reality it's used with dynamic. => Using generics requires performing as-type checks on parameters with covariant types (such as Map<K, V>) => Especially considering that the DeepCollectionEquality class ignores the reified generics (it considers e.g. <String, dynamic>{} and <Object, dynamic> to be equal)
Using special HashMap over {} may be slower for many cases.
It could re-use the same MapEquality / ... objects instead of creating new ones for every equality operation

/cc @lrhn @mraleph

mkustermann commented 1 year ago

Disclaimer: I do understand that semantics are a little bit different in that DeepCollectionEquality does allow a different equality on keys than the actual Map implementation will use. But this seems to be a rather rare use case (if it's useful at all?) - and one shouldn't pay with 70x slower performance due to this by-default.

lrhn commented 1 year ago

Might be something we can optimize, if we can find a safe algorithm.

The issue with MapEquality itself is that we don't know the equalities used by the maps, so we have to treat them as unordered sets of key/value pairs, and check that those sets are equal according to _keyEquality/_valueEquality pair comparison.

Not only do we not know that the key/value equality is default == equality (we could detect that and special-case, and it even be worth it), but we also don't know that the maps use that as their equality (could be identity maps, could be splay-tree maps with a different ordering, could be anything.) We can't use keys from one map as lookups in the other. That's a serious restriction on what we can do, and it might be worth adding a more efficient SameMapEquality that assumes the two maps use the same key-equality, and uses that equality for the comparison too, so:

/// Equality on maps which assumes that both maps use `keyEquality` for key equality.
///
/// More efficient than [MapEquality] when doing `equals`, because [MapEquality.equals]
/// has to compare the key/value pairs of each map independently of the maps' own
/// equality behavior. This class assumes that compared maps use an equality which is
/// compatible with the `keyEquality`, and can therefore reliy on map lookups
/// for comparing keys from different maps.
class SameMapEquality<K, V> extends MapEquality<K, V> {
  SameMapEquality(super.keyEquality, super.valueEquality);

  bool equals(Map<K, V> map1, Map<K, V> map2) {
    if (map1.length != map2.length) return false;
    for (var key in map1.keys) {
      var v2 = map2[key];
      if (v2 == null && !map2.containsKey(key)) return false;
      if (!_valueEquality.equals(map1[key], v2)) return false;
    }
    return true;
  }
}

We could make the implementation of MapEquality be a MapEquality<Never, Never> which accepts Map<Object?, Object?> arguments, or do so only when provided with DefaultEquality or IdentityEquality as arguments, because those accept any arguments. (But I'd be worried we might shoot ourselves in the foot if we do this before getting variance annotations, because the equality classes are naturally contravariant, and we could just lean into that and make.)

For DeepCollectionEquality, the direct version is definitely simpler. It avoids doing is List and is Set checks, so it saves a little there, but it's probably not where the big difference is. We can cache the nested MapEquality(this, this), but it will require expandos since the class has a const constructor.

More reasonably, we could special case the default case (default values for both arguments), and optimize that. We will still run into nested map equalities with unknown map types, which has to do the full map key/value pair multi-set comparison, but at least we can use efficient {} maps for that when we know that the equality is ==.

I think that's the most viable place to start, optimize the default case, if that's what everybody uses anyway.

knaeckeKami commented 1 year ago

I created a (naive) benchmark repo for finding a more optimised version for a deep equality of json-like maps (where we assume the maps are of type <String, dynamic>, use the == equality for keys and values and contain only nulls, bools, nums, Strings or Lists and Maps thereof as values) here:

https://github.com/knaeckeKami/json_equals

It uses benchmark_harness to compare the performance of different ways to compare two json-like Maps.

It shows that DeepCollectionEquality() has O(n^2) runtime where n is the level of nesting in the map. I suspect that's because of the recursive re-hashing.

Here the results of the benchmark: https://github.com/knaeckeKami/json_equals/blob/main/README.md

You can see that with a map that is nested 9 levels deep, DeepCollectionEquality is ~4000 times slower than a custom implementation, while with 1 level of nesting it's only ~10 times slower.

lrhn commented 1 year ago

That quadratic complexity should definitely be fixed, I don't know if we can just cache the hash code (probably not, because we still call equals on the sub-objects repeatedly, and they don't cache). Maybe we can cache something else along the way, to avoid repeated computations. Worth checking.

We should perhaps also add a JsonEquals which is optimized for JSON structures, so it only needs to recognize List and Map<String,..>, and it can assume normal string-equality on the keys for the maps.

That seems like a common enough use-case to create an optimized variant for it. E.g.,

/// Deep equality on JSON-like object structures.
///
/// JSON-like objects can be: 
/// * Lists of JSON-like objects.
/// * Maps with `String` keys and JSON-like objects as values.
/// * Other objects compared by normal `==`.
///
/// Lists are considered equal if they have the same length and the elements at the same indices are equal.
/// Maps are considered equal if they have the same string keys, and their values for each key are equal.
/// Such maps are assumed to be normal maps using `==` as key equality.
///
/// This equality treat any value which is not a `List<Object?>` or `Map<String, Object?>`
/// as a plain value to be compared using `==`.
/// Normal JSON structures only contain strings, numbers, booleans and `null` as such values.
class JsonEquality implements Equality<Object?> {
  const JsonEquality();
  bool equals(Object? v1, Object? v2) {
    if (v1 == v2) return true;
    if (v1 is List) {
      if (v2 is! List) return false;
      return const ListEquality<Object?>(JsonEquality()).equals(v1, v2);
    }
    if (v1 is Map<String, Object?>) {
      if (v2 is! Map<String, Object?>) return false;
      // Assume normal string equality as map key equality.
      var length = v1.length;
      if (length != v2.length) return false;
      for (var entry in v1) {
        if (!equals(entry.value, v2[entry.key])) return false;
      }
    }
    return false;
  }
  int hash(Object? value) {
    if (value is List) return const ListEquality(JsonEquality()).hash(value);
    if (value is Map<String, Object?>) {
      return const MapEquality(DefaultEquality(), JsonEquality()).hash(value);
    }
    return value.hashCode;
  }
  bool isValidKey(_) => true;
}

lrhn commented 1 year ago

Also, thinking about it more, MapEquality should perhaps be allowed to assume that the two maps use the same equality, whatever it is.

Because, otherwise, are they really equal maps? Being "equal maps" is not just about having the same keys/value pairs, it could also mean that using one instead of the other is a valid substitution. If not, what is "being equal" actually worth?

I'd be willing to change the definition of MapEquality to something actually useful, instead of theoretically pleasing. It might end up with false positives, but only when using every key of one map to look up values in the other map gives values that are equal.

The biggest issue with that change is that it makes keyEquality unnecessary, because it delegates that to the maps. That's an API-change. Probably not one which matters much in practice. WDYT?

mkustermann commented 1 year ago

Also, thinking about it more, MapEquality should perhaps be allowed to assume that the two maps use the same equality, whatever it is.

Because, otherwise, are they really equal maps? Being "equal maps" is not just about having the same keys/value pairs, it could also mean that using one instead of the other is a valid substitution. If not, what is "being equal" actually worth?

WDYT?

Makes perfect sense. That's what I was hoping for :)

mraleph commented 1 year ago

@lrhn Yeah, I think we should change things to assume that Set and Map instances that are being compared have the same equality behaviour and document that as a requirement for the algorithm.

Alternatively, we could think if there is a way for us to add a fast path here, e.g. we could maybe add something like an interface:

abstract class EqualityRule {
  bool operator ==(Object other);
}

abstract class ContainerWithEqualityRule {
  EqualityRule get equalityRule;
}

have underlying implementation classes for Map and Set implement this interface and then do:

if (map1 is ContainerWithEqualityRule &&
    map2 is ContainerWithEqualityRule &&
    map1.equalityRule == map2.equalityRule) {
  // go go fast path gadget
}

but maybe this is way too baroque.

lrhn commented 1 year ago

I'd be a little more worried about changing SetEquality to use contains, because then it won't be using elementEquality at all. Well, it will for hashing, which may then differ (more easily) from equality.

For maps, we can check that each key of one map is also a key of the other map, and that the values are equal. For that to get out of sync with hash, the second map needs to equate two keys of the first map, which also have equal values in the first map, which makes equals asymmetric, and the hashcodes likely different. Possible, but not as likely. (I don't want to iterate both maps and do lookups in both directions, because in practice the problem won't happen.)

For sets, the equal-values clause disappears, and then it's much easier to have supposedly equal sets which are actually different (which again means different hash codes). Still, it's likely that all sets in practice are LinkedHashSets with default == equality and hashCode, so anything we do to handle the other cases are wasted effort.

Let's try and see what happens.

mkustermann commented 1 year ago

For that to get out of sync with hash, the second map needs to equate two keys of the first map, which also have equal values in the first map, which makes equals asymmetric, and the hashcodes likely different.

Even now MapEquality has if (length != map2.length) return false;. So for cases where two keys of map a map to one entry in map b their length would be different. One could only maintain same length if two keys of a map to one in b and two keys in b map to one in a. I think this is entirely hypothetical and will never happen in practice.

I'd be a little more worried about changing SetEquality to use contains, because then it won't be using elementEquality at all. Well, it will for hashing, which may then differ (more easily) from equality. For sets, the equal-values clause disappears, and then it's much easier to have supposedly equal sets which are actually different (which again means different hash codes).

Not quite sure I follow here. We already now check that the sets have equal number of elements. One should only consider two sets are equal if a.contains(<some-key>) === b.contains(<some-key>) - then the sets are practically interchangeable/substitutable. The .contains() call is internally also doing equality checking of the keys (may or may not use .hashCode).

This idea that one wants to consider two sets/maps equal (based on elementEquality) but are in fact not equal when it comes to lookups seems not very useful at all, since it allows for sets to be equal but not interchangeable (i.e. sets are considered equal, but .contains(<some-key>) in both result different answers). I just don't see the use-case for this.

mraleph commented 1 year ago

It would be nice to figure out our path forward with this - I noticed that some internal code also uses MapEquality in few potentially performance sensitive locations.

mkustermann commented 1 year ago

@lrhn Would you be interested in looking at this?

lrhn commented 1 year ago

I'd love to get back to https://github.com/dart-lang/collection/pull/265/files, but right now I'mtoo busy adding class modifiers, and other 3.0 features, to platform libraries.

mkustermann commented 11 months ago

@lrhn friendly ping

basharh commented 8 months ago

Any updates on this?

vasilich6107 commented 2 weeks ago

Brace yourself - winter is coming. Would be nice to have a fix on this performance issue under Christmas Eve

dart-lang / core

Performance of `DeepCollectionEquality().equals(a, b)` is horribly slow #645