JuliaCollections / DataStructures.jl

Julia implementation of Data structures
https://juliacollections.github.io/DataStructures.jl/latest/
MIT License
692 stars 244 forks source link

[Tracking] MultiDict and SortedMultiDict API updates before v1.0 #666

Open NHDaly opened 4 years ago

NHDaly commented 4 years ago

I wanted to create this meta-issue to track all the API changes that we want to do for MultiDict, which are currently scattered around various other threads:

And to use this thread to discuss other changes we might make.


Current suggested changes:

Open Questions:

StephenVavasis commented 4 years ago

Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.

NHDaly commented 4 years ago

What do you think about using a Dict{K,Set{V}} instead of a Dict{K,Vector{V}} for storing the values?

Properties that I like:

Properties that are maybe less good?:


After some quick checking, it looks like both Python's MultiDict and Golang's MultiMap support exact duplicates: https://pypi.org/project/multidict/

>>> d = multidict.CIMultiDict([("key", "one"), ("key2", "two"), ("key", 3)])
>>> d.items()
_ItemsView('key': 'one', 'key2': 'two', 'key': 3)

https://github.com/jwangsadinata/go-multimap

George: [Washington Bush Bush]

So I think we should probably stick with this here as well. However, for algorithmic reasons, I'm now thinking we should indeed then implement the backing Dict as a Dict{K, Set{Tuple{V, Int}}}.

This isn't exactly a breaking API change, except that if we do change to this, we may want to reconsider what we return from getindex(): returning maybe an iterator instead of a Vector (which users can of course always collect()).

NHDaly commented 4 years ago

Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.

Oh, thanks, agreed! I hadn't actually even noticed that one. Yes, i think you're right.

NHDaly commented 4 years ago

Oh, actually, to update what Golang does: they provide both a "setmultimap" package and a "slicemultimap" package, one of which backs their multimap with a Set, and the other with an array. Only the array-backed one supports duplicates:

It seems that a Set{Tuple{V,Int}} would get us the best of both worlds, but with slightly more bookkeeping overhead. I can see why they elected to offer both, but 🤷 i don't necessarily think we need to.

NHDaly commented 4 years ago

@oxinabox here's another question:

  • [ ] Should a MultiDict be <: AbstractDict?

i can kind of go either way on this. I think I'm leaning towards "no," since they're different than a Dictionary in a few ways. But if I read the description here, I feel like it more less holds?:

help?> AbstractDict
search: AbstractDict AbstractDisplay

  AbstractDict{K, V}

  Supertype for dictionary-like types with keys of type K and values of type V. Dict, IdDict and other types are subtypes of this. An AbstractDict{K, V}
  should be an iterator of Pair{K, V}.
NHDaly commented 3 years ago

After thinking more about this, and comparing implementations in other languages, I've come around to thinking it should be <: AbstractDict. I believe the MultiDict is a type of dictionary; it has almost exactly the same interface, and behavior. The only weird thing is that it holds multiple elements per key, so getindex returns a set instead of a single element.

push!, pop!, and insert! are all basically the same. One weird thing is delete!(), where delete!(MultiDict(1=>2, 1=>3), 1) deletes more than one element, while delete!(MultiDict(1=>2, 1=>3), 1, 2) deletes only one, which is something to watch out for.

The main difference is that right now we're not supporting setindex!(md, k, v), I think maybe because the semantics are a bit confusing? It seems maybe weird that if you write md[1] = 2, then you can turn around and call md[1] and you'll get back something that's not 2 (you'll get back a set containing at least 2). Instead you have to call insert!(md, 1, 2). So I do think it makes sense to not include this function, to call attention to this behavior, but the question is whether this is "still a dictionary". And I think the answer is yes.

I'll also added a checkbox question for whether or not to include setindex!() here:

  • [ ] Should we include setindex!(md::MultiDict, k, v)? See above.
quildtide commented 3 years ago

As a bit of an outsider to this conversion, but as a person who has used this library before, I think there's actually a legitimate case for the existence of 3 different types related to the current MultiDict. See me perhaps as less of an experienced developer and more of a naive end-user.

Imagine we have input data, perhaps from a csv, in the form of 2 columns, keys and values:

Case 1 - Set

We want to know all unique values that occur for each specific key. A MultiDict backed by a Set works well for this.

Case 2 - MultiSet or Vector

We want to know all values that occur for each specific key, including duplicates. Both MultiSets and Vectors are good here, depending on specifics.

Case 2a - MultiSet specific

After accomplishing the previous, we want to check if a value is present for a key. A MultiSet-backed MultiDict make sense here.

Case 2b - Vector specific

If we don't need to ever check if a single value is present for a key, and we instead expect to iterate across all values for a key, we might instead see better performance from a Vector-backed MultiDict.

Case 3 - Vector

We want to know all unique values that occur for each specific key while keeping them in the original order that they were inserted in. A Vector-backed MultiDict is the logical choice here.

Proposal

Create abstract type AbstractMultiDict <: AbstractDict end.

Some methods, like insert!(md, k, v) can be shared by AbstractMultiDict.

Begin with 3 different MultiDicts that are respectively:

It is conceivable that end users may want to make their own MultiDicts in the future using different backing data structures such as CuArrays or whatever. It may be possible to address this with a more generalized struct MultiDict{ktype, vtype, btype} <: AbstractMultiDict.

Additional opinion on setindex!

I am of the opinion that it wouldn't even be weird for setindex!(md::AbstractMultiDict, k, v) to replace the entirety of its contents for k with whatever v it's given, coerced into the proper type for md in question.

I'm willing to implement this proposal in a pull request if anyone else thinks it's worth exploring.

quildtide commented 3 years ago

After a bit of experimentation and thinking about this overnight, I've come to this personal conclusion:

Scope of the MultiDict's Purpose

The main purpose of a MultiDict is to abstract over certain very common code patterns, such as:

md::Dict{K,C{V}}

for (key, value) in data
    if key in md
        insert!(md[key], value)
    else
        md[key] = C([value])
    end
end

In this case, it is useful to handle this common, but ugly, chunk of code with a more convenient function, such as insert!(md::MultiDict{K, V}, key::K, value::V).

Out of Scope

However, a good Julian library should not make assumptions about why the user is using the MultiDict past this common interface.

There are many use cases for these extremely common chunks of code that a MultiDict can provide a convenient interface over, but these different use cases benefit from different internal containers.

Some use cases favor the MultiSet; some favor the Vector; some favor the Set. Furthermore, an end-user might want a MultiDict using a Stack, Queue, or DiBitVector from this library, or they might want something from a different library such as a CuArray.

It's worth noting, however, that all of these possible backends, assuming proper behavior, support some sort of insert!, pop!, and in functions. The performance and exact mechanisms of these function methods is irrelevant to the core MultiDict implementation. If there are some common cases where optimizations are possible, multiple dispatch can be utilized.

Proposal

I believe that the MultiDict should leverage Multiple Dispatch to allow for maximal flexibility in the MultiDict and not assume what backing collection type the end user wants.

I.e. Instead of struct MultiDict{K,V} d::Dict{K,Vector{V}} end or struct MultiDict{K,V} d::Dict{K,MultiSet{V}} end, I propose:

struct MultiDict{K,C} 
   d::Dict{K,C} 
end

It has also come to my attention that C, the container type, might not even be a parametric type, in a really bizarre use-case.

The parametric nature of C is therefore not important to the function of MultiDict.