andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

`Dictionary` that can be indexed either with a key or with an integer #113

Open ParadaCarleton opened 1 year ago

ParadaCarleton commented 1 year ago

Could be useful for something like DictTables--it'd be super useful if I could index the ith row of a DictTable with an integer, or instead index it with a symbol. (For this type, integer keys shouldn't be allowed, as that obviously causes ambiguity.)

andyferris commented 1 year ago

@ParadaCarleton I'm not 100% sure what the suggestion is - do you have a code example?

For the moment, with Dictionary you can use the token as an integer key. If you avoid deletions the tokens are guaranteed to start at 1 for the first element, 2 for the second, and so-on.

ParadaCarleton commented 1 year ago

@ParadaCarleton I'm not 100% sure what the suggestion is - do you have a code example?

As in, add a new data structure (NumDict, maybe?) such that the following would work:

julia> x = NumDict('a':'z', 'A':'Z')
26-element NumDict{Char, Char}
 'a' │ 'A'
 ⋮ │ ⋮
julia> x['a']
'A'
julia> x[1]
# returns 'A'

Alternatively, we could have integer indexing work for a normal Dictionary, as long as the key type isn't an integer; i.e. dispatch on the types of Dictionary and the key (so that integers index by order if the keys aren't integers). Ideally this would work even with deletions.

andyferris commented 1 year ago

So overloading getindex this way will never work because it would be ambiguous what to do with types like NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} - would the index for getindex be a dictionary key or an ordinal?

It is necessary to do one of these:

Does that make sense?

ParadaCarleton commented 1 year ago

So overloading getindex this way will never work because it would be ambiguous what to do with types like NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} - would the index for getindex be a dictionary key or an ordinal?

It is necessary to do one of these:

  • Transform the dictionary into some kind of array-like indexable structure with a generic function (e.g. collect(dict)[1])
  • Grab an array or similar from inside the dictionary with the data you want (e.g. dict.values[1])
  • Introduce a new generic function - so you get an element via get_num(dict, 1) or similar.

Does that make sense?

NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} should throw errors, as supertypes of Int; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.

andyferris commented 1 year ago

Hmm. While I can see why you could desire that and see it as a worthwhile tradeoff, I can't see that we'll be able to cleanly support a second meaning to getindex here. Even dict[:] isn't really possible to make work like an array since : is a value and not syntax. The "spare" space for indexing is what we might do with dict[a, b], dict[] and so-on.

For other meanings we need to use another interface. For example we have getindices from Indexing.jl. Functions like map, filter/etc, findall/etc can do interesting things with indices.

For a generic iterable (that supports iterate) I believe Julia should have get a new function like get_nth_element(iterable, n) (but obviously with a better name) that could operate the "slow" via iterate but be overloaded to be faster for e.g. arrays, ArrayDictionary, Dictionary with no holes, B-trees, etc. We could introduce such a function here and try and support it for all dictionaries, but this function will need to be something other than getindex. Maybe like get_nth_token? I am thinking of enforcing that iterate uses tokens for state, and perhaps even that tokens are ordered, so that you could then "resume" iterating, e.g. getting the 1000th to 2000th elements of a dictionary would be pretty speedy even in the worst case, since you'd zip past the first 999 as fast as possible and then use standard iterate from there. This is planned to support sorted dictionary indices with slices/range queries (e.g. get all values for keys greater than x, or between x and y, using just gettoken and iterate).

In the very short term, keep in mind that often with a DictTable you could simply do gettokenvalue(table, n) to get the nth row (that won't work in the presence of holes, lazily filtered indices, certain dictionary types, etc - but it can be valid shortcut when you know how the data was made). Also note that the support for Iterators.reverse is pretty robust and was added so you could e.g. easily peek into the end of a dictionary for show etc, and I recognize that you might want to look somewhere other than the front or back of a collection.

andyferris commented 1 year ago

NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} should throw errors, as supertypes of Int; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.

To be clear where my thinking is at, I think this is a case where it's cleaner to use a new generic function as the interface rather than a wrapper type. With the wrapper type you'd have to keep in mind that it is wrapped when you use getindex, so that you know what the meaning of getindex is. The wrapper type would have strange special cases like being weird for Any keys. It's just less for the programmer to keep in mind to use.

Also I did some digging - Iterators.drop can be used to get the nth element. We could endeavor to support first(Iterators.drop(dict, n-1)) and make it fast? It's a bit ugly, but it's a standard Base interface. (Maybe it is a bit too ugly...)

ParadaCarleton commented 1 year ago

NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} should throw errors, as supertypes of Int; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.

To be clear where my thinking is at, I think this is a case where it's cleaner to use a new generic function as the interface rather than a wrapper type. With the wrapper type you'd have to keep in mind that it is wrapped when you use getindex, so that you know what the meaning of getindex is. The wrapper type would have strange special cases like being weird for Any keys. It's just less for the programmer to keep in mind to use.

Also I did some digging - Iterators.drop can be used to get the nth element. We could endeavor to support first(Iterators.drop(dict, n-1)) and make it fast? It's a bit ugly, but it's a standard Base interface. (Maybe it is a bit too ugly...)

The main point of this issue was to let code written with the assumption that Vectors/Tables/etc. will be indexed by integers work when possible (and less importantly to let indexing with integers as easy as with keys). I don't think there's much mental overhead added as long as we throw a type error whenever users try constructing anything weird--"This dictionary can't hold integer keys" isn't much more complicated to deal with than "This dictionary can only hold keys of a single type."

ParadaCarleton commented 1 year ago

I think I've found a very nice solution in this package; would you be interested in a PR to support it?

mtfishman commented 9 months ago

I've been thinking about similar things in relation to designing labelled graph types, i.e. a graph with vertex labels that could be anything, as opposed to the SimpleGraph type in Graphs.jl which explicitly uses integer vertex labels from 1:nv(g). In NamedGraphs.jl I wrap a SimpleGraph as the underlying data structure and use a Dictionary from this package to map the vertex labels that the user interacts with to the linear "simple" vertex labels used by the underlying parent SimpleGraph. It's necessary for efficiency, at least internally, to have functions that work directly with those integer labels of the parent graph, where right now we use functions for that but it would be nice to have a simple external API.

Something I've considered is having an indexing type[^1], so then I could use syntax like g[SimpleVertex(1)] or maybe g[ParentVertex(1)], say for accessing data of a labelled graph on a vertex using the parent graph vertex label instead of the external vertex label. This is related to a proposal that has been made a few times^2^4 for introducing a LinearIndex type for arrays that support linear indexing, to be more explicit about whether you are using linear or cartesian indexing. Could something similar be used for dictionaries, such as d[Token(1)]? I suppose that's related to the suggestion above to use OrdinalIndexing.jl as syntax for providing a specialized type to use in indexing into the parent data structure.

Relatedly, in the case I describe above, the Dictionary has a very particular structure, which is that the keys can be anything but the values are always 1:n (and we also store a reverse map from simple vertices to vertex labels which is a Vector storing an ordered list of the vertex labels, so basically is a bijection^5 but simplified given one side of the map is simple integers). @andyferris do you know if that maps onto an existing dictionary abstraction in this package? Maybe related to your comment in #104 about creating a bijection based on the tokens of the dictionary, it seems to essentially just be an Indices object but I'd have to be careful about the ordering of the values.

[^1]: Maybe also a trait for defining different kinds of indexing, like the Base Julia IndexStyle trait.

ParadaCarleton commented 9 months ago

I've been thinking about similar things in relation to designing labelled graph types, i.e. a graph with vertex labels that could be anything, as opposed to the SimpleGraph type in Graphs.jl which explicitly uses integer vertex labels from 1:nv(g).

I think that ideally, the best combination would be something like OrdinalIndices.jl (for explicitly indexing at a fixed position) and DimensionalData.jl (for indexing with symbols, strings, etc.). (But in the case of graphs, I'm not sure what the natural ordering on those would be?)

mtfishman commented 9 months ago

Good point that this is related to DimensionalData.jl and other packages with named dimensions/indices (like AxisKeys.jl, which seems a bit more flexible?).

Here's a fun demonstration:

using AxisKeys
using BenchmarkTools
using SparseArrays
using Dictionaries
using UniqueVectors

struct IndicesVector{I} <: AbstractVector{I}
  inds::Indices{I}
end
IndicesVector(data::Vector) = IndicesVector(Indices(data))
Base.size(inds::IndicesVector) = (length(inds.inds),)
Base.getindex(inds::IndicesVector, index::Integer) = inds.inds.values[index]
Base.findfirst(pattern::Base.Fix2{typeof(isequal)}, inds::IndicesVector) = gettoken(inds.inds, pattern.x)[2][2]

Which allows us to do:

julia> A = KeyedArray(sprandn(4, 4, 0.5), (IndicesVector(["a", "b", "c", "d"]), IndicesVector(["a", "b", "c", "d"])))
2-dimensional KeyedArray(...) with keys:
↓   4-element IndicesVector{String}
→   4-element IndicesVector{String}
And data, 4×4 SparseMatrixCSC{Float64, Int64} with 7 stored entries:
         ("a")      ("b")       ("c")      ("d")
  ("a")   0.0        0.0         0.0       -0.510721
  ("b")   0.0        0.0         0.600002   0.653016
  ("c")   0.0        0.0592783   2.38575    0.0
  ("d")  -0.899354   0.0        -0.101834   0.0

julia> A("a", "d")
-0.5107207258723061

And a benchmark:

function main(d)
  inds_data = string.(1:d)
  data = sprandn(d, d, 0.1)

  inds = IndicesVector(inds_data)
  A = KeyedArray(data, (inds, inds))
  @btime $A($("50"), $("50"))

  inds = UniqueVector(inds_data)
  A = KeyedArray(data, (inds, inds))
  @btime $A($("50"), $("50"))

  A = KeyedArray(data, (inds_data, inds_data))
  @btime $A($("50"), $("50"))
end

which outputs:

julia> main(100)
  66.224 ns (0 allocations: 0 bytes)
  53.414 ns (0 allocations: 0 bytes)
  314.233 ns (0 allocations: 0 bytes)

Basically a labelled/named graph (i.e. a graph with general vertex labels or names) can be represented by the object I'm creating above, a sparse array encoding the adjacency matrix and then labelled or named dimensions.

UniqueVectors.jl seems to be closely related to the Indices type from this package but it has an AbstractVector interface and uses a Base.Dict for fast lookup. The IndicesVector type let's us use Indices as an index structure of a KeyedArray with fast lookup.

andyferris commented 9 months ago

Yes, sometimes you want an "array that can be either indexed with a integer or a key" rather than a "dictionary that can be either indexed with a key or an integer" :) Especially if you want the integer indices to be dense.

Another package in this space that you might consider is AcceleratedArrays.jl (shameless plug; note it hasn't received much attention for a while).

ParadaCarleton commented 9 months ago

Good point that this is related to DimensionalData.jl and other packages with named dimensions/indices (like AxisKeys.jl, which seems a bit more flexible?).

Both packages are almost identical, except that DimensionalData.jl is more fully-featured and fully-developed at this point; I generally recommend it because it's more widely used throughout the rest of the ecosystem. (And if you only need to label columns, DataFrames.jl and StructArrays.jl are better-supported.)

The difference between array-labeling and a Dictionary is whether you plan to be doing a lot of insertions at runtime. DimensionalData.jl et al. are fundamentally array packages, which means they're designed for (mostly) static axes. When you index an array with something like x[At(:a)], the compiler is doing the job of figuring out that :a means the 1st position, so you don't have to pay for that cost at runtime. But if you don't know what :a means at compile time, you need a dictionary instead.