Open ParadaCarleton opened 1 year ago
@ParadaCarleton I'm not 100% sure what the suggestion is - do you have a code example?
For the moment, with Dictionary
you can use the token as an integer key. If you avoid deletions the tokens are guaranteed to start at 1
for the first element, 2
for the second, and so-on.
@ParadaCarleton I'm not 100% sure what the suggestion is - do you have a code example?
As in, add a new data structure (NumDict
, maybe?) such that the following would work:
julia> x = NumDict('a':'z', 'A':'Z')
26-element NumDict{Char, Char}
'a' │ 'A'
⋮ │ ⋮
julia> x['a']
'A'
julia> x[1]
# returns 'A'
Alternatively, we could have integer indexing work for a normal Dictionary
, as long as the key type isn't an integer; i.e. dispatch on the types of Dictionary
and the key (so that integers index by order if the keys aren't integers). Ideally this would work even with deletions.
So overloading getindex
this way will never work because it would be ambiguous what to do with types like NumDict{Union{Int, Char}, Char}
or NumDict{Any, Any}
- would the index for getindex
be a dictionary key or an ordinal?
It is necessary to do one of these:
collect(dict)[1]
)dict.values[1]
)get_num(dict, 1)
or similar.Does that make sense?
So overloading
getindex
this way will never work because it would be ambiguous what to do with types likeNumDict{Union{Int, Char}, Char}
orNumDict{Any, Any}
- would the index forgetindex
be a dictionary key or an ordinal?It is necessary to do one of these:
- Transform the dictionary into some kind of array-like indexable structure with a generic function (e.g.
collect(dict)[1]
)- Grab an array or similar from inside the dictionary with the data you want (e.g.
dict.values[1]
)- Introduce a new generic function - so you get an element via
get_num(dict, 1)
or similar.Does that make sense?
NumDict{Union{Int, Char}, Char}
or NumDict{Any, Any}
should throw errors, as supertypes of Int
; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.
Hmm. While I can see why you could desire that and see it as a worthwhile tradeoff, I can't see that we'll be able to cleanly support a second meaning to getindex
here. Even dict[:]
isn't really possible to make work like an array since :
is a value and not syntax. The "spare" space for indexing is what we might do with dict[a, b]
, dict[]
and so-on.
For other meanings we need to use another interface. For example we have getindices
from Indexing.jl
. Functions like map
, filter
/etc, findall
/etc can do interesting things with indices.
For a generic iterable (that supports iterate
) I believe Julia should have get a new function like get_nth_element(iterable, n)
(but obviously with a better name) that could operate the "slow" via iterate
but be overloaded to be faster for e.g. arrays, ArrayDictionary
, Dictionary
with no holes, B-trees, etc. We could introduce such a function here and try and support it for all dictionaries, but this function will need to be something other than getindex
. Maybe like get_nth_token
? I am thinking of enforcing that iterate
uses tokens for state, and perhaps even that tokens are ordered, so that you could then "resume" iterating, e.g. getting the 1000th to 2000th elements of a dictionary would be pretty speedy even in the worst case, since you'd zip past the first 999 as fast as possible and then use standard iterate
from there. This is planned to support sorted dictionary indices with slices/range queries (e.g. get all values for keys greater than x
, or between x
and y
, using just gettoken
and iterate
).
In the very short term, keep in mind that often with a DictTable
you could simply do gettokenvalue(table, n)
to get the n
th row (that won't work in the presence of holes, lazily filtered indices, certain dictionary types, etc - but it can be valid shortcut when you know how the data was made). Also note that the support for Iterators.reverse
is pretty robust and was added so you could e.g. easily peek into the end of a dictionary for show
etc, and I recognize that you might want to look somewhere other than the front or back of a collection.
NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} should throw errors, as supertypes of Int; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.
To be clear where my thinking is at, I think this is a case where it's cleaner to use a new generic function as the interface rather than a wrapper type. With the wrapper type you'd have to keep in mind that it is wrapped when you use getindex
, so that you know what the meaning of getindex
is. The wrapper type would have strange special cases like being weird for Any
keys. It's just less for the programmer to keep in mind to use.
Also I did some digging - Iterators.drop
can be used to get the nth
element. We could endeavor to support first(Iterators.drop(dict, n-1))
and make it fast? It's a bit ugly, but it's a standard Base
interface. (Maybe it is a bit too ugly...)
NumDict{Union{Int, Char}, Char} or NumDict{Any, Any} should throw errors, as supertypes of Int; the idea of a new type would be to enforce the requirements needed to be sure there's no ambiguity.
To be clear where my thinking is at, I think this is a case where it's cleaner to use a new generic function as the interface rather than a wrapper type. With the wrapper type you'd have to keep in mind that it is wrapped when you use
getindex
, so that you know what the meaning ofgetindex
is. The wrapper type would have strange special cases like being weird forAny
keys. It's just less for the programmer to keep in mind to use.Also I did some digging -
Iterators.drop
can be used to get thenth
element. We could endeavor to supportfirst(Iterators.drop(dict, n-1))
and make it fast? It's a bit ugly, but it's a standardBase
interface. (Maybe it is a bit too ugly...)
The main point of this issue was to let code written with the assumption that Vectors/Tables/etc. will be indexed by integers work when possible (and less importantly to let indexing with integers as easy as with keys). I don't think there's much mental overhead added as long as we throw a type error whenever users try constructing anything weird--"This dictionary can't hold integer keys" isn't much more complicated to deal with than "This dictionary can only hold keys of a single type."
I think I've found a very nice solution in this package; would you be interested in a PR to support it?
I've been thinking about similar things in relation to designing labelled graph types, i.e. a graph with vertex labels that could be anything, as opposed to the SimpleGraph
type in Graphs.jl which explicitly uses integer vertex labels from 1:nv(g)
. In NamedGraphs.jl I wrap a SimpleGraph
as the underlying data structure and use a Dictionary
from this package to map the vertex labels that the user interacts with to the linear "simple" vertex labels used by the underlying parent SimpleGraph
. It's necessary for efficiency, at least internally, to have functions that work directly with those integer labels of the parent graph, where right now we use functions for that but it would be nice to have a simple external API.
Something I've considered is having an indexing type[^1], so then I could use syntax like g[SimpleVertex(1)]
or maybe g[ParentVertex(1)]
, say for accessing data of a labelled graph on a vertex using the parent graph vertex label instead of the external vertex label. This is related to a proposal that has been made a few times^2^4 for introducing a LinearIndex
type for arrays that support linear indexing, to be more explicit about whether you are using linear or cartesian indexing. Could something similar be used for dictionaries, such as d[Token(1)]
? I suppose that's related to the suggestion above to use OrdinalIndexing.jl
as syntax for providing a specialized type to use in indexing into the parent data structure.
Relatedly, in the case I describe above, the Dictionary has a very particular structure, which is that the keys can be anything but the values are always 1:n
(and we also store a reverse map from simple vertices to vertex labels which is a Vector
storing an ordered list of the vertex labels, so basically is a bijection^5 but simplified given one side of the map is simple integers). @andyferris do you know if that maps onto an existing dictionary abstraction in this package? Maybe related to your comment in #104 about creating a bijection based on the tokens of the dictionary, it seems to essentially just be an Indices
object but I'd have to be careful about the ordering of the values
.
[^1]: Maybe also a trait for defining different kinds of indexing, like the Base Julia IndexStyle trait.
I've been thinking about similar things in relation to designing labelled graph types, i.e. a graph with vertex labels that could be anything, as opposed to the
SimpleGraph
type in Graphs.jl which explicitly uses integer vertex labels from1:nv(g)
.
I think that ideally, the best combination would be something like OrdinalIndices.jl (for explicitly indexing at a fixed position) and DimensionalData.jl (for indexing with symbols, strings, etc.). (But in the case of graphs, I'm not sure what the natural ordering on those would be?)
Good point that this is related to DimensionalData.jl and other packages with named dimensions/indices (like AxisKeys.jl, which seems a bit more flexible?).
Here's a fun demonstration:
using AxisKeys
using BenchmarkTools
using SparseArrays
using Dictionaries
using UniqueVectors
struct IndicesVector{I} <: AbstractVector{I}
inds::Indices{I}
end
IndicesVector(data::Vector) = IndicesVector(Indices(data))
Base.size(inds::IndicesVector) = (length(inds.inds),)
Base.getindex(inds::IndicesVector, index::Integer) = inds.inds.values[index]
Base.findfirst(pattern::Base.Fix2{typeof(isequal)}, inds::IndicesVector) = gettoken(inds.inds, pattern.x)[2][2]
Which allows us to do:
julia> A = KeyedArray(sprandn(4, 4, 0.5), (IndicesVector(["a", "b", "c", "d"]), IndicesVector(["a", "b", "c", "d"])))
2-dimensional KeyedArray(...) with keys:
↓ 4-element IndicesVector{String}
→ 4-element IndicesVector{String}
And data, 4×4 SparseMatrixCSC{Float64, Int64} with 7 stored entries:
("a") ("b") ("c") ("d")
("a") 0.0 0.0 0.0 -0.510721
("b") 0.0 0.0 0.600002 0.653016
("c") 0.0 0.0592783 2.38575 0.0
("d") -0.899354 0.0 -0.101834 0.0
julia> A("a", "d")
-0.5107207258723061
And a benchmark:
function main(d)
inds_data = string.(1:d)
data = sprandn(d, d, 0.1)
inds = IndicesVector(inds_data)
A = KeyedArray(data, (inds, inds))
@btime $A($("50"), $("50"))
inds = UniqueVector(inds_data)
A = KeyedArray(data, (inds, inds))
@btime $A($("50"), $("50"))
A = KeyedArray(data, (inds_data, inds_data))
@btime $A($("50"), $("50"))
end
which outputs:
julia> main(100)
66.224 ns (0 allocations: 0 bytes)
53.414 ns (0 allocations: 0 bytes)
314.233 ns (0 allocations: 0 bytes)
Basically a labelled/named graph (i.e. a graph with general vertex labels or names) can be represented by the object I'm creating above, a sparse array encoding the adjacency matrix and then labelled or named dimensions.
UniqueVectors.jl seems to be closely related to the Indices
type from this package but it has an AbstractVector
interface and uses a Base.Dict
for fast lookup. The IndicesVector
type let's us use Indices
as an index structure of a KeyedArray
with fast lookup.
Yes, sometimes you want an "array that can be either indexed with a integer or a key" rather than a "dictionary that can be either indexed with a key or an integer" :) Especially if you want the integer indices to be dense.
Another package in this space that you might consider is AcceleratedArrays.jl (shameless plug; note it hasn't received much attention for a while).
Good point that this is related to DimensionalData.jl and other packages with named dimensions/indices (like AxisKeys.jl, which seems a bit more flexible?).
Both packages are almost identical, except that DimensionalData.jl is more fully-featured and fully-developed at this point; I generally recommend it because it's more widely used throughout the rest of the ecosystem. (And if you only need to label columns, DataFrames.jl and StructArrays.jl are better-supported.)
The difference between array-labeling and a Dictionary is whether you plan to be doing a lot of insertions at runtime. DimensionalData.jl et al. are fundamentally array packages, which means they're designed for (mostly) static axes. When you index an array with something like x[At(:a)]
, the compiler is doing the job of figuring out that :a
means the 1st position, so you don't have to pay for that cost at runtime. But if you don't know what :a
means at compile time, you need a dictionary instead.
Could be useful for something like
DictTables
--it'd be super useful if I could index thei
th row of aDictTable
with an integer, or instead index it with a symbol. (For this type, integer keys shouldn't be allowed, as that obviously causes ambiguity.)