dart-lang / language

Design of the Dart language
Other
2.65k stars 202 forks source link

Add multimap literal #2405

Open ghost opened 2 years ago

ghost commented 2 years ago

Background

The four categories of associative abstract data types (ADTs) are:

It would then make sense that there should be a collection literal in Dart for each of these four kinds of ADTs since associative ADTs are by far the most common ADTs used in a typical program and lend themselves nicely to literal syntax. This is unlike other ADTs such as Stacks which arguably tend to be better off created empty and then pushed to.

Since typing a literal naturally requires an order to be specified, it also makes sense that the default kind of associative ADT chosen is one sorted by insertion order, rather than unordered or sorted by the intrinsic ordering of the objects (Comparable). If the default insertion order is not desired, then the literal could be converted using a dot operator (ex: <int>[1, 2, 3, 4].toSortedList(), assuming SortedList exists).

Current State

  1. Dart already supports List literals, which is an Ordered Multiset so that is good.
  2. Dart supports Set and Map literals. Set and Map are unordered ADTs but Dart seems to have made the unfortunate decision to have a leaky abstraction for Set and Map which are technically unordered but their default implementations maintain insertion order. This technically means that their implementations can be changed at any time to unordered data structures but I feel that is very unlikely since the leaky abstraction has undoubtedly resulted in many people relying on the insertion order. Having them be OrderedSet and OrderedMap would be a much better design but I digress since the point is that Dart technically has Ordered Sets and Ordered Maps covered with literals.
  3. Thus, an Ordered Multimap is the only missing associative ADT from Dart's list of collection literals. This is understandable since many people aren't familiar with Multimaps but I feel that they are useful, underappreciated and that they would fit very nicely into Dart's current literal syntax in a completely unintrusive way.

Proposal

My suggested syntax would be the following:[^1]

final myOrderedMultimap = <String, int>['key': 5, 'key': 3, 'key2': 5, 'key': 5];

This is logical because it uses the current List (Ordered Multiset) syntax and doesn't break any current code or change the rest of the syntax of the language in any way. It also nicely creates symmetry between the Set/Map literal syntax of {}. This would then make it clear that {} is for no-duplicate associative ADTs (regular and multi) and [] is for duplicate associative ADTs (regular and multi).

This symmetry can be seen here:

// Ordered Set (Unique Elements)
<E>{}
// Ordered Map (Unique Keys)
<K, V>{}
// Ordered Multiset/List (Duplicate Elements)
<E>[]
// Ordered Multimap (Duplicate Keys)
<K, V>[]

Of course, this proposal would also require an OrderedMultimap class to be added to the dart:core library as well. However, Google seems to have already created a usable (ordered?) Multimap in quiver.

[^1]: People familiar with common implementations of Multimap will know that there tend to be "List" and "Set" implementations, where "List" allows duplicate values for the same key and "Set" does not. I personally believe this is a misunderstanding of what a Multimap is and that the "List" version is the only true Multimap since a Multimap simply refers to duplicate keys being allowed. The value of each key is still scalar. The idea of a "Set" Multimap is really just a Set of Entries/Pairs. The Guava library for Java has a nice explanation of Multimap that despite implementing both the "List" and "Set" version of a Multimap, makes it clear that a Multimap is not: a → 1, 2 b → 3 but rather: a → 1 a → 2 b → 3 despite being potentially implemented as the former.

lrhn commented 2 years ago

The Dart platform libraries do not contain a multimap or multiset type to begin with, so adding a literal would be a little ahead of ourselves.

We could add a multimap/multiset interface and default (likely hash-based) implementation. It would require a little design work, but definitely doable. (Question: Can a mutlimap contain the same key/value combination more than once? Or will that require a multimultimap?) A List is not a good multiset, because set operations are inefficient. A Map<X, int> is a better implementation of a multiset, with add(X x) { var value = map[x]; map[x] = (value ?? 0) + 1; }.

Adding syntax for literal multisets/multimaps gets trickier. We could make any set with repeated values be a multiset, and any map with repeated keys be a multimap, but that's problematic because you can't create an empty multiset/multimap that way, or one not containing repeats from the start. We'd need some kind of marker on the literal, and Dart's syntax is pretty loaded around brace blocks already.

Instead of introducing a specific multiset syntax and a specific multimap syntax, I'd be more inclined towards a general "custom collection literal", like what C# has. Maybe MultiSet(){1, 1, 2, 3, 5, 8, 13} (which sadly looks like a function declaration, so we may have to disallow it at the beginning of an expression statement). It would be equivalent to MultiSet()..add(1)..add(1)..add(2)..add(3)..add(5)..add(8)..add(13). We can do the same for map syntax, where its ..[key]=value for each entry.

ghost commented 2 years ago

The Dart platform libraries do not contain a multimap or multiset type to begin with, so adding a literal would be a little ahead of ourselves.

Dart already has an Ordered Multiset since a "List" is just another name for an Ordered Multiset in respect to associative ADTs (insertion-ordered/duplicates allowed) and the [] notation already covers this case. I'm aware that people don't think of a List as an Ordered Multiset and that it lacks set operations in its API, but my point in mentioning it was simply to list out the four kinds of ordered associative ADTs and show that one is missing from the current Dart literal syntax. I'm not proposing a regular unordered Multiset (aka Bag) because when making a literal collection, people implicitly provide an insertion order and a "List" covers that. To clarify, "Multiset/List" here really just means "duplicate elements allowed".

It's only the Ordered Multimap that is missing and it seems that Google has already created a usable one in quiver that could be transferred over to dart:core. With that being said, a regular unordered Multiset and even Sorted Multiset (Sorted List) would be nice additions to Dart, but my proposal is just about the bare minimum so that a Multimap literal syntax can be created.

(Question: Can a mutlimap contain the same key/value combination more than once? Or will that require a multimultimap?)

Yes since Multimaps are really just concerned with having duplicate keys. The entries being unique would be more like a Set<Entry>. My proposal's footnote goes into more detail.

A List is not a good multiset, because set operations are inefficient. A Map<X, int> is a better implementation of a multiset, with add(X x) { var value = map[x]; map[x] = (value ?? 0) + 1; }.

The proposal is just about an Ordered Multimap since the List already covers the Ordered Multiset but if you want to add a regular unordered Multiset then I agree, but that would be a different issue.

Adding syntax for literal multisets/multimaps gets trickier. We could make any set with repeated values be a multiset, and any map with repeated keys be a multimap, but that's problematic because you can't create an empty multiset/multimap that way, or one not containing repeats from the start.

I proposed the following syntax:

final myOrderedMultimap = <String, int>['key': 5, 'key': 3, 'key2': 5, 'key': 5];

Considering that [] is already used for the "Ordered Multiset" (List), it makes perfect sense and creates symmetry with your current syntax without breaking anything. Since there already exists a Google implementation of a Multimap in quiver (and Guava for that matter) and this syntax is so elegant and unintrusive, I feel this proposal would be very easy to implement without breaking anything.

To further illustrate this symmetry:

// Ordered Set (Unique Elements)
<E>{}
// Ordered Map (Unique Keys)
<K, V>{}
// Ordered Multiset/List (Duplicate Elements)
<E>[]
// Ordered Multimap (Duplicate Keys)
<K, V>[]

Instead of introducing a specific multiset syntax and a specific multimap syntax, I'd be more inclined towards a general "custom collection literal", like what C# has. Maybe MultiSet(){1, 1, 2, 3, 5, 8, 13} (which sadly looks like a function declaration, so we may have to disallow it at the beginning of an expression statement). It would be equivalent to MultiSet()..add(1)..add(1)..add(2)..add(3)..add(5)..add(8)..add(13). We can do the same for map syntax, where its ..[key]=value for each entry.

For a general use case, I agree that this C#-like syntax could be useful for creating literals for any desired ADT, but I believe that would be more appropriate for another issue. Would you like me to make another issue regarding a generalized literal syntax? It seems like a good idea as well.

Also, would you like me to make a task list in this issue's proposal and have one of the tasks be a separate issue for adding a Multimap to dart:core?

lrhn commented 2 years ago

I'd definitely avoid copying MultiMap from package quiver. It is defined as a "Map to collection", where the collection can be list or set. That always leads you to wanting to define the kind of set, and then you end up with abstractions upon abstractions. (It's not created "by Google", it's just some people.)

I'd also consider whether the API for a multimap is really that different from a set of pairs of key and value, Set<(K, V)> with record types, Set<MapEntry<K, V>> without. At least that's how I'd define it if I were to design a multimap API from scratch.

I'd prefer a more direct API which only exposes iterables, like the Dart non-multi-maps.

Something like

/// A set of pairs of key and value, indexed by key.
///
/// The multimap can contain any combinations of key and value
/// at most once. Each *entry*, a pair of key and value, is considered
/// equivalent to another entry if keys and values are equal according to `==`.
/// There is efficient lookup of the values associated with each key,
/// but not efficient lookup by value, which is why the key is so named.
abstract class MultiMap<K, V> {
 /// Whether the map contains any entries.
 bool get isEmpty;  
 /// Whether the map does not contain any entries.
 bool get isNotEmpty;
 /// The individual keys of the multiset. Each key occurs precisely once.
 Iterable<K> get keys;
 /// The values associated to the keys of the multiset.
 ///
 /// The same value may occur more than once, if it's associated to more than one key.
 Iterable<V> get values;
 /// All pairs of key and value in the multiset.
 Iterable<MapEntry<K, V>> get entries;

 /// The values associated with [key].
 ///
 /// If no values are associated with [key], the
 /// returned iterable is empty.
 /// The iterable is lazy, and won't look at the collection
 /// until the returned iterable is iterated.
 Iterable<V> operator [](K key);
 /// Whether the multimap contains the key [key]
 /// 
 /// The multimap contains a key if any value is associated with the key.
 bool containsKey(K key);
 /// Whether the multimap contains an entry for [key] and [value].
 bool contains(K key, V value);
 /// Associates [key] with [value] in the multimap.
 ///
 /// Returns true if an entry was added, false if the entry
 /// was already in the multimap.
 bool add(K key, V value);
 /// Associates all the [values] with [key] in the multimap.
 void addValues(K key, Iterable<V> values);
 /// Adds all the key/value pairs of [entries] to the multimap.
 void addEntries(Iterable<MapEntry<K, V>> entries);
 /// Removes the entry for [key]/[value] from the multimap.
 ///
 /// Returns true if the entry was in the multimap, and was removed,
 /// and false if the entry wasn't in the multimap.
 bool remove(K key, V value);
 /// Removes all key/value pairs of [entries] from the multimap.
 void removeEntries(Iterable<MapEntry<K, V>> entries);
 /// Removes the [key] and all its values from the multimap.
 ///
 /// Returns an iterable containing the values.
 /// If the key was not in the multimap, the returned iterable is empty.
 Iterable<V> removeKey(K key); 
 /// Removes all entries from the multimap.
 void clear();
 /// Adds all entries of [other] map to this multimap.
 void addAll(MultiMap<K, V> other);
 /// Removes all entries of [other] map from this multimap.
 void removeAll(MultiMap<K, V> other);
 /// Removes all entries of this multimap that are not entries of [other].
 void retainAll(MultiMap<K, V> other);
 /// Creates a new multimap containing the entries of this multimap and [other].
 MultiMap<K, V> union(MultiMap<K, V> other);
 /// Creates a new multimap containing the entries that are in both this multimap and [other].
 MultiMap<K, V> intersection(MultiMap<K, V> other);
 /// Creates a new multimap containing the entries of this multimap, that are not in [other].
 MultiMap<K, V> disjunction(MultiMap<K, V> other);
}

(Or something like that, have to consider how efficient we can do a lookup including the value.)

The fact that I can disagree so fundamentally on the design of a multimap suggests that maybe there isn't just one valid design. And that's a risk factor.

I'm not opposed to adding a multimap to the platform libraries, but I'd honestly be more comfortable adding to to package:collection. The risk is that there are multiple different possible semantics for a multimap, and people might want to choose between them. In such cases, we prefer to not canonicalize one particular approach in the platform libraries, because we risk it being the wrong one. And providing language syntax for a literal of only one kind raises the stakes by a lot.

ghost commented 2 years ago

I'd definitely avoid copying MultiMap from package quiver. It is defined as a "Map to collection", where the collection can be list or set.

Right, I see what you mean.

(It's not created "by Google", it's just some people.)

I apologize, I saw that quiver was published by google.dev and assumed that was officially created by Google but I see that was an incorrect assumption.

I'd also consider whether the API for a multimap is really that different from a set of pairs of key and value, Set<(K, V)> with record types, Set<MapEntry<K, V>> without. At least that's how I'd define it if I were to design a multimap API from scratch.

Well if you're defining Multimap as a "SetMultimap", then it's not different, which is why I don't believe it is really correct to implement a Multimap with a Set but rather as a List since a Multimap is a flattened list of keys with scalar values as Guava's Multimap API doc illustrates.

A collection that maps keys to values, similar to Map, but in which each key may be associated with multiple values. You can visualize the contents of a multimap either as a map from keys to nonempty collections of values: a → 1, 2 b → 3 ... or as a single "flattened" collection of key-value pairs: a → 1 a → 2 b → 3 Important: although the first interpretation resembles how most multimaps are implemented, the design of the Multimap API is based on the second form. So, using the multimap shown above as an example, the [size()](https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html#size()) is 3, not 2, and the [values()](https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html#values()) collection is [1, 2, 3], not [[1, 2], [3]]. For those times when the first style is more useful, use the multimap's [asMap()](https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html#asMap()) view (or create a Map<K, Collection> in the first place).

Your API looks good but as mentioned, I think it should simply be defined as a map with "duplicate keys" as opposed to a set of "unique entries", in particular since the "multi" part of Multimap refers to duplicates. Multimap isn't an ADT that seems very formally defined despite being implemented in many different language libraries and thus it's hard to really show any references or formal definitions but as you say, a SetMultimap is basically just a set of entry records and thus defining a Multimap in that way is, in my opinion, redundant. Maybe we can get some different opinions on this but if I wanted unique entries I'd just use a Set of entries and see no point for Multimap to be used in that context.

I'm not opposed to adding a multimap to the platform libraries, but I'd honestly be more comfortable adding to to package:collection.

If it's not added to platform libraries, then can the literal syntax still be added?

The risk is that there are multiple different possible semantics for a multimap, and people might want to choose between them. In such cases, we prefer to not canonicalize one particular approach in the platform libraries, because we risk it being the wrong one.

In the worst-case scenario, could you not just have ListMultimap and SetMultimap as different implementations and default to using ListMultimap?

lrhn commented 2 years ago

To me, a multimap is not just a map of sets, or a map of lists. You can build those yourself, and then you get to choose the value collections yourself. And a "map with collection of values" design is precisely something I'd want to avoid canonicalizing. It's a pattern of designs more than a single design, and a good implementation should be parameterized on the collection strategy. (A set of queues is also valid, allowing you do do "remove first for key").

From the mathematical perspective, a traditional map is a set of entries satisfying that each key occurs only once, and indexed by key. Similarly, a multimap (to me) is a set of entries indexed by key. It would make no sense to say that (2, "a") occurs twice in that set. I can see that that is not what people are generally implementing, so maybe I'm just being a little too mathy.

Still, it is not even close to a settled design choice multimap design one to provide.

That's why I'd prefer, strongly, to not provide a canonical multimap. I would be much more interested in providing a generalizable collection literal syntax, so that ever choice of multimap can be a literal.

ghost commented 2 years ago

From the mathematical perspective, a traditional map is a set of entries satisfying that each key occurs only once, and indexed by key. Similarly, a multimap (to me) is a set of entries indexed by key. It would make no sense to say that (2, "a") occurs twice in that set.

My understanding was that from a mathematical perspective, a map is a set of keys, each with an associated value. And then by extension, a multimap is a multiset of keys, each with an associated value. By your definition, a multimap is no different than just a set of entries (since it lacks the extra constraint a Map has) and then I wouldn't see the point of its existence. Do you have a reference for a formal mathematical definition for a Multimap? I've been unable to find one but have been trying to contact academics on the subject.