andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

Allow syntactic sugar `d[a, b, c] == d[(a, b, c)]` #87

Open raphaelsaavedra opened 2 years ago

raphaelsaavedra commented 2 years ago

Base defines the following for convenience when working with Dicts:

# t[k1,k2,ks...] is syntactic sugar for t[(k1,k2,ks...)].  (Note
# that we need to avoid dispatch loops if setindex!(t,v,k) is not defined.)
getindex(t::AbstractDict, k1, k2, ks...) = getindex(t, tuple(k1,k2,ks...))
setindex!(t::AbstractDict, v, k1, k2, ks...) = setindex!(t, v, tuple(k1,k2,ks...))

In the spirit of Dictionaries.jl, which allows us to deal with dictionaries in an array-like manner, would this be something nice to add?

It would be very convenient in some of my code, as it deals with variables that can be either Arrays or Dictionaries, meaning I have to worry about using either d[a,b,c] or d[(a,b,c)].

andyferris commented 2 years ago

Yes, it's definitely something we should look at here.

What really needs to happen is to come up with a cohesive plan for multi-dimensional dictionaries. Sometimes the indices are Cartesian products of indices, and sometimes they could just be a subset. What interface do we want to support all this? What about the tokens for Cartesian products of indices - are they literally a Cartesian product of tokens and we have e.g. a matrix-like array backing the dictionary]? Do we use ragged arrays?

One obvious corner case with this syntax (which happens in Base) is the one-element version doesn't create tuples, but it makes sense to have Cartesian products of zero or one sets. Sometimes I wonder if this should have similar syntax rules to tuples, where d[] is like a zero-element tuple for the key and d[i,] (note the comma) produces a 1-element tuple.

mtfishman commented 2 years ago

I would love to have this, I have an immediate use case and have started designing a simplistic multi-dimensional dictionary type as an AbstractDictionary subtype, with the multidimensional indexing syntax proposed in the first post.

Indeed, I came across the same exact design questions you mention: how does slicing and subset indexing work? What are the underlying keys, and how do we map to them from a variety of inputs like subsets and slicing? What is the underlying data structure, if not a Vector? What happens in the case of a single dimension?

I started going with the design that the underlying keys are just Tuple (though probably better to use a special type like a generalized CartesianIndex) and the data structure is just a Vector, so indexing with the syntax d[a, b, c] just makes the tuple (a, b, c) and gets that element. Additionally, in the 1-dimensional case, d[a] gets mapped to the key (a,), for consistency. This design is mostly for simplicity of the implementation and piggybacking on the functionality of Dictionaries.jl to get things working.

I didn't go with a general underlying array data structure since for my case in general I won't have rectangular data (more like ragged arrays). The issue is that I want to support slicing like d[a, :, c], which becomes an O(length(d)) operation since you have to search each key to see if it in the subset specified by the range. For my use case I don't think the efficiency matters much and I just want it for a convenient high-level interface.

Additionally, there is ambiguity with other syntax like d[a, [b1, b3], c] where it is not clear if [b1, b2] should be treated as a key or a set of keys in that dimension. My plan was to make certain inputs like ranges and Vector special and treated as subsets of keys (like Base Array does), and just disallow those as being parts of individual keys of the dictionary, but I'm not sure if that is the right way to go. Maybe the getindices interface can help to disambiguate things?

If you want rectangular data, my conclusion seemed to be that the way to go would be to have an underlying array and then a Dictionary for each dimension mapping multidimensional keys to multidimensional cartesian indices which are linear in each dimension, like the design of NamedArrays:

julia> using NamedArrays

julia> n = NamedArray(rand(2, 4))
2×4 Named Matrix{Float64}
A ╲ B │         1          2          3          4
──────┼───────────────────────────────────────────
1     │  0.512214   0.700272     0.2982   0.211813
2     │  0.165471   0.912168  0.0165095   0.873511

julia> setnames!(n, ["one", "two"], 1)
(OrderedCollections.OrderedDict{Any, Int64}("one" => 1, "two" => 2), OrderedCollections.OrderedDict{Any, Int64}("1" => 1, "2" => 2, "3" => 3, "4" => 4))

julia> setnames!(n, ["a", "b", "c", "d"], 2)
(OrderedCollections.OrderedDict{Any, Int64}("one" => 1, "two" => 2), OrderedCollections.OrderedDict{Any, Int64}("a" => 1, "b" => 2, "c" => 3, "d" => 4))

julia> n
2×4 Named Matrix{Float64}
A ╲ B │         a          b          c          d
──────┼───────────────────────────────────────────
one   │  0.512214   0.700272     0.2982   0.211813
two   │  0.165471   0.912168  0.0165095   0.873511

julia> n.
array    dicts     dimnames
julia> n.array
2×4 Matrix{Float64}:
 0.512214  0.700272  0.2982     0.211813
 0.165471  0.912168  0.0165095  0.873511

julia> n.dicts
(OrderedCollections.OrderedDict("one" => 1, "two" => 2), OrderedCollections.OrderedDict("a" => 1, "b" => 2, "c" => 3, "d" => 4))

Not sure if there is a fundamental disadvantage to that design, but conceptually it seems very simple.

So my conclusion is that you needed two separate types, MultiDimDictionary designed like NamedArray and RaggedMultiDimDictionary with either some more complicated data structure like a ragged array to make slicing fast or a simple design like I've started working on with slow slicing.