Define a token API for Associative types

kmsquire commented 6 years ago

Consider

d = Dict{String, Int}()
...
if haskey(d, key)
    d[key] += 1
end

In this code, the hash of the key is calculated 3 times. Ideally, this would happen only once.

Tokens provide one way to handle this. Two possible interfaces:

#######
# API 1
#######
d = Dict{String, Int}()
...
token = gettoken1(d, key) # returns a DictToken object
d[token] += 1             # O(1) get and O(1) set

#######
# API 2
#######
d = Dict{String, Int}()
...
token = keytoken(d, key)        # essentially the same as gettoken1 in API 1
value = gettoken2(d, token)
settoken!(d, token, value + 1)

(gettoken1 and gettoken2 are named as such to distinguish them here. The actual name for whatever API is chosen should be gettoken)

I have a slight preference for API 1. API 2 was proposed by @StefanKarpinski here. (@StefanKarpinski, please fix if incorrect!)

Related solutions/issues:

using tokens in SortedDicts in DataStructures.jl, to allow O(1) access to a value that was looked up
using get!(d, key, default) to store and return a default value, if the key is not in the dictionary
using DefaultDicts in DataStructures.jl, for a similar purpose
proposal to use ?= as a get-or-set operation (#2108)
proposal to use Refs to access elements (#12157)

Cc: @StephenVavasis @andyferris

xiaodaigh commented 6 years ago

Why not just

updateindex!(dict, key, function_to_apply_to_value)

the outcome of the above should be the same as

dict[key] = get(dict, key, default_value) |> function_to_apply_to_value

so it's in one line and faster?

StephenVavasis commented 6 years ago

The "token" feature of SortedDict is meant to address two problems: multiple key lookups for an update operation (the topic of this issue) and also iterating over a sorted container in the sort-order of the keys. The token serves as the state-variable for this iteration. The updateindex! proposal in the above comment would solve the first problem for both Dict and SortedDict, but tokens or something similar would still be needed to solve the second problem for the sorted containers.

andyferris commented 6 years ago

I favor API 2.

I've been hoping that we can copy the IndexStyle type of idea from arrays and have iteration and methods like map (of values) using tokens automatically, without knowing the details of the tokens of a given dictionary. (I'm hoping this will work together with the semantics of similar in #24390 to make functions like map (of values) really fast using simple, generic code).

I'd also like to point out that linear indexing is a kind-of token system for dense arrays.

mbauman commented 6 years ago

Wouldn't that be API 1, then? That'd more transparently give you a "fast" key to use with indexing syntaxes. The downside with API 1 is that then dictionaries can no longer support all types in the language as keys — the DictToken type (or whatever) would be special. I think that's fine. It's also worth noting that it'd be easy to define a gettoken1 fallback for ::Associative — you just return the (likely slow) key itself by default.

andyferris commented 6 years ago

I guess I was imagining methods of map that use gettoken and settoken! and a separate slower (default) method that uses getindex and setindex!.

Wrapping in a DictToken would be fine, also.

In any case, the tricky bit is when you do operations like map! you need to figure out if the tokens of the input and output match, and what to do when they don't, etc.

JeffBezanson commented 6 years ago

I favor adding update!. It's simple and handles the most common case. Tokens are more complex since you have the problem of invalidating them if the Dict changes.

stevengj commented 6 years ago

API 1 seems a lot clearer to me.

I don't think it's correct that API 1 wouldn't be able to support all types in the language, because you would parameterize the token type on the dictionary type.

That is, for d::Dict{K,V}, the gettoken(d, k) function would return an AssociativeToken{Dict,K,V} instance.

If you then wanted a new dictionary d2 whose keys were AssociativeToken{Dict,K,V}, then d2 would be of type Dict{AssociativeToken{Dict,K,V}, V}. gettoken on d2 would return an Associative{Dict, AssociativeToken{Dict,K,V}, V} instance. And so on.

stevengj commented 6 years ago

Note that you'd want the function to be the first argument of update! so that you could use it with do syntax. But how would you avoid the redundant hashing in the following pattern:

if haskey(dict, k)
    .... do something...
else
    ... do something else with dict[k] ....
end

?

stevengj commented 6 years ago

Also, without tokens, how would you swap two elements, i.e.

d[k1], d[k2] = d[k2], d[k1]

without hashing the keys twice? The proposed update! would only give you access to a single element, no?

JeffBezanson commented 6 years ago

update! could have an optional default value to use if the key doesn't exist yet.

API 1 is ambiguous for Dicts of Any, e.g. for memoization.

It's true, tokens are more powerful, but therefore more difficult to support correctly.

andyferris commented 6 years ago

I'm still not convinced that update! is sufficient for the kinds of things I was envisaging, though it is certainly helpful.

One obvious but tricky thing is that map!(f, dict1, dict2) should be fast whenever dict1 and dict2 share compatible tokens (which will hopefully be likely after calling similar). Ideally, we can also implement map! with generic code like we do for AbstractArray which as I understand will use linear index "tokens" whenever it can.

nalimilan commented 6 years ago

update! can be useful on its own, so we could start with that and see whether people still need a lower-level system based on tokens.

chethega commented 6 years ago

Regarding update!, my favorite API would be

merge!(combine, d::Dict{K,V}, kv::Pair{K,V})
_merge!(combine, d::Dict{K,V}, k::K,v::V)

The _merge! is necessary for types where the pair formation is expensive (e.g. causes an allocation).

JuliaLang / julia

Define a token API for Associative types #24454