JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.65k stars 5.48k forks source link

julep/rfc: proposal for unified Dict get/insert/has interface #12157

Open vtjnash opened 9 years ago

vtjnash commented 9 years ago

problem statement:

a common idiom when working with Associative objects is to do a set of has / insert / lookup operations:

d = Dict()
if !(key in keys(d))
  d[key] = initial_value
end
value = d[key]

this syntax has the advantage of only using basic operations, for clarity, but it's also 3x slower than necessary since the dict lookup gets repeated three times. that concern has lead to the introduction of a get! method that accepts either an object or function, to insert a value if the lookup fails, so that the above code can be rewritten as:

d = Dict()
value = get!(d, key, initial_value)

or

d = Dict()
value = get!(d, key, () -> initial_value)

however, this implementation is still not necessarily faster, since it involves precomputing initial_value (fine it it's just a constant, but bad if it expensive to compute or creates unnecessary garbage) or invoking a lambda.

the code to support this one code pattern isn't exactly short, requires a fair amount of code duplication, and even a macro definition is provided (https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L632-L690)

proposal

define the Ref operation on a Dict to return an intelligent indexer object. the above code pattern could then be written as:

d = Dict()
bp = Ref(d, key)
if isempty(bp)
  bp[] = initial_value
end
value = bp[]

implementation sketch

type RefDict{K, V, D <: Dict} <: Ref{V}
  d::D
  k::K
  last_htindex::Int
  last_state::Int
end
function RefDict{K,V}(d::Dict{K,V}, k)
  key = convert(K, k)
  isequal(key, k) || throw(ArgumentError("k is not a usable key"))
  return RefDict{K,V,typeof(d)}(d, k, 0, 0)
end
function isempty(r::RefDict)
  if (d.last_state != d.changecounter)
    r.last_htindex = ht_keyindex2(d, key)
    r.last_state = d.changecounter
  end
  return r.last_htindex > 0
end

ref https://github.com/JuliaLang/julia/pull/12035#issuecomment-120614758

yuyichao commented 9 years ago

Is this the one you briefly mentioned on saturday night during the JuliaCon?

carnaval commented 9 years ago

Yep. And yes please (again) !

(btw s/in/haskey/ in your first example)

yuyichao commented 9 years ago

Also would this still has benefit over the get! version if we have fast anonymous function?

vtjnash commented 9 years ago

Is this the one you briefly mentioned on saturday night during the JuliaCon?

yes. i figured i should finally try to see what it would look like

carnaval commented 9 years ago

Also would this still has benefit over the get! version if we have fast anonymous function?

It is more general since you can use the hash table before the insertion. Like

if isempty(bp)
    println("Inserting into ", d)
    bp[] = x
end

We could also still have the lambda version for compactness I guess.

carnaval commented 9 years ago

(well in this example I guess you could capture d in the lambda, but my point is you can carry the ref around for some time before committing to the insertion)

JeffBezanson commented 9 years ago

Good idea!

I think my main concern would be allocating an extra object for each dictionary operation. It could get quite expensive. Also unfortunately the code for the common idiom is not shorter.

yuyichao commented 9 years ago

(well in this example I guess you could capture d in the lambda,

Was just going to say that.

but my point is you can carry the ref around for some time before committing to the insertion)

I see. (not sure if it is good to bring up but I guess it should be as useful as c++ reference.....)

JeffBezanson commented 9 years ago

Related: #2108

kmsquire commented 9 years ago

I'll point out that @StephenVavasis implemented a Token interface for SortedDicts in DataStructures.jl, which plays a similar roll. This one sounds better, however.

(He even started to use Ref in a similar way, before I discouraged him because it wasn't used yet in Base (sorry, Stephen).)

kmsquire commented 9 years ago

my point is you can carry the ref around for some time before committing to the insertion

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

yuyichao commented 9 years ago

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

I think in the example above ref[] = value basically fall back to ref.dict[ref.key] = value if it detects that the dictionary have changed so we should just be as careful as doing regular dict assignment?

There might be one additional race possibility but IMHO it shouldn't need more care than the dict itself.

carnaval commented 9 years ago

Yes, Jameson's proposal handles the single thread rehash case by graceful performance degradation. Multi threaded modification of the same data structure will likely be defined as a race condition => undefined behavior. We'll have to implement a family of concurrent data structures. (the java std lib has state of the art implementation of those things, probably good inspiration).

StephenVavasis commented 9 years ago

Since Kevin pinged me on this topic, I guess I can add my two cents about tokens and Ref.

(1) The primary purpose of tokens in my code is to provide a mechanism for stepping through a SortedDict (or SortedSet or SortedMultiSet) in the sort-order of the keys. Obviously, this is not an issue for Dict.

(2) A second purpose of a token is to provide a handle to an entry that can be dereferenced in O(1) time. Again, this is not so much an issue for Dict because the hash-lookup is already supposed to be O(1).

(3) Using a Ref object to point to a spot in the structure where an item would go if it would be inserted later seems like a reasonable idea; as mentioned earlier, there is a performance hit because the Ref object is heap-allocated. However, I am thinking of changing my balanced-tree implementation so that the most recent look-up is cached because often the next lookup is nearby. Couldn't Dict do the same with caching?

(4) Going off on a tangent, I have an issue with the equivalence between haskey(d,k) and in(k, keys(d)) ... see my recent posting here:

https://github.com/JuliaLang/DataStructures.jl/issues/103

(5) Finally, with regard to the problem of the awkward if-block versus the need to evaluate 'initial_value' in a get! function, wouldn't it be OK to use a macro and to ask anyone who writes an Associative type to support a macro version of get! that solves this problem?

StefanKarpinski commented 9 years ago

I don't really like the syntax much. I've proposed adding a syntax for get! like d[k] ?= v. However, that begs the question of what we should lower that to. If we lower it to the form that uses a closure, then we haven't solved the performance issue (although I'm not entirely convinced that this matters since it will be fast eventually, right?), so maybe it could lower to something like what you're proposing. There certainly are cases within the implementation of Dict where having a more usable "intelligent index" type would be helpful in implementing higher level operations.

dbeach24 commented 8 years ago

I have an alternate proposed solution to this which I (unwittingly) opened in a separate ticket #18282.

TLDR: What about building a function that handles the "update this value in a dictonary" case, as follows:

update!(f::Function, a::Associative, key, default) = (a[key] = f(get(a, key, default)))

(except that it would be specialized to avoid recomputation of the hash/slot.)

This would allow code like:

update!(x -> x+1, my_counters, key, 0)  # where 0 is the default value for a counter, or
update!(x -> x+value, my_totals, key, 0.0) # where value is accumulated
KristofferC commented 8 years ago

Sounds like https://github.com/JuliaLang/julia/issues/15630

KristofferC commented 8 years ago

See also https://github.com/KristofferC/UpdateIndex.jl

dbeach24 commented 8 years ago

@KristofferC Pardon my ignorance. Do LinearSlow AbstractArrays and Associatives share a common interface / behavior pattern?

EDIT: I think I understand a bit better now from seeing UpdateIndex.jl

KristofferC commented 8 years ago

It is similar since you want to "hold on" to the result of the lookup thas has to be made when indexing.

dbeach24 commented 8 years ago

@KristofferC Thanks for sharing these links. FWIW, I agree that Julia would benefit from a way to directly overload the various ?= operators. (Python has this capability, and numpy leverages it heavily. Why would Julia want to skip over this functionality?) However, I suppose that discussion belongs on #15630.

I think the dictionary case differs slightly because, unlike arrays, dictionaries are not guaranteed to have values over any particular range of keys, and there is no default value that can be assumed for missing keys, at least not in the general case.

In your proposal:

updateindex!(A, op, s, i...)

Could these be combined into a single function, f(x) = op(x, s)? By placing this first, we can permit "do" syntax, and this makes our proposals look very similar, indeed.

updateindex!(f, A, i...)
update!(f, dict, key, default)

Not sure how to deal with the fact that arrays need multiple indices and dicts have just one. Can the array indices be grouped into a tuple, or will this create overhead?

eschnett commented 8 years ago

In general, creating the new entry (i.e. what is called default here) might be expensive, or might be a transaction that should only occur when the new value is actually needed. get has a form that accepts a function instead of a default value, and that function is only called when needed. This function is also the first argument. So we can do this:

update!(op, mkdefault, coll, ids...)

where both op and mkdefault are functions, coll is a collection, and ids are indices or keys.

dbeach24 commented 8 years ago

@eschnett That's a good point.

  1. Is it confusing to have update! accept two functions?
  2. Which one of them should be available for "do" syntax.
  3. When the default value is not supplied by a function, it typically comes last. Should the two functions really be the first two arguments?

Current proposal would look like this:

update!(init_counter, counters, key) x do
    x + 1
end

What about moving the default function to the end?

update!(counters, key, init_counter) x do
   x + 1
end

Edit: I like the way update!(counters...) reads in this last example. It clearly states which collection is being updated. (But of course that only holds if you use the "do" syntax.)

Edit 2: Could we support both cases where the default is an object or a function? i.e.

update!(f::Function, coll::Associative, key, default)
update!(f::Function, coll::Associative, key, default::Function)

...or is that an abuse of overloading?

martinholters commented 8 years ago

Could we support both cases where the default is an object or a function

For a dict holding functions as values, that might give surprising/undesired behavior.

dbeach24 commented 8 years ago

@martinholters Thank you for speaking up, I was looking for someone who might disagree.

Does that use case compel us to move the default function to another argument position? (arguably less clear and less consistent?)

Of course, if someone was keeping a dict of functions, they could still write:

update!(mydict, key, () -> my_default_value_which_is_a_function) do x
    decorate_function(x)
end
algorithmx commented 3 years ago

See also https://github.com/KristofferC/UpdateIndex.jl

Hi Kristofer, are there any follow-up improvements on this ? I am hitting a performance barrier now. The bottleneck is to count the number of occurrence of various patterns in a "string". I used a Dict{pattern_type,Int} to do the job, and I wish to make it faster. What is the state-of-art technique for that ? Otherwise I'll just patch my code with this one.