andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
282 stars 28 forks source link

Shared indices not properely updated? #68

Closed phipsgabler closed 3 years ago

phipsgabler commented 3 years ago

Awesome package!

I was trying out how sharing of the indices works, and came upon the following weird example:

julia> d1 = Dictionary([:a, :b], 1:2)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2

julia> d2 = d1 .+ 10
2-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12

julia> set!(d2, :c, 20)
3-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12
 :c │ 20

julia> d1
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1[:c]
0

julia> set!(d1, :c, 42)
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1[:c]
42

What's going on there? Is it just printing?

andyferris commented 3 years ago

Thanks @phipsgabler!

The behaviour is unfortunate, but it is expected. What is happening is when you modify a dictionary you add to both the Dictionary's internal value storage Vector as well as modify the underlying Indices. The other dictionary will have an internal value storage vector which has become out of sync with the Indices. Luckily show will check for isassigned but at this point the data structure is corrupted and is unsafe to use. (It is not just the printing).

This works similarly to views of arrays via SubArray. If the indices and parent arrays are modified, they can get out of sync, even potentially resulting in segfaults. In a way, a Dictionary can be thought of a Vector seen through a view with (the tokens of) an Indices, and in the end we suffer the exact same issue as with SubArray. (I'm not sure if that makes sense at all?).

The way to solve these problems is to use more sophisticated tracking of mutable state, such as freeze/unfreeze, copy-on-write, borrow checking, or other techniques that aren't (yet) available in Julia. We could make the system create many more copies by default but then we'd loose most of the performance gains. We could maybe possibly try to implement copy-on-write semantics with the current functionality available in Julia, but I'm worried about the complexity, and I haven't had time to invest into that.

phipsgabler commented 3 years ago

Alright, thanks -- when I think once more about it, it really makes sense that the non-updated dictionary with shared indices is in undefined state because there is nothing at the new key.

That was exactly what I was testing -- having two dictionaries with shared keys, and updating them in parallel, and I naively expected copy-on-write or something consistent.

Can you say a few words about up to which point the behaviour is consistent? I know too little about the internals to answer this myself.

julia> d1 = Dictionary([:a, :b], 1:2)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2

julia> d2 = d1 .+ 10
2-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12

julia> insert!(d2, :c, 20)
3-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12
 :c │ 20

Here I expect d2 to be in a consistent state, and d1 undefined. I now expected that I could "patch" d1 to contain a new value and make the values arrays align again, but it only works half-way:

julia> set!(d1, :c, -20)
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1[:c]
-20

And here I can't even see where the -20 comes from, because

julia> d1.values
2-element Vector{Int64}:
 1
 2

Where is it stored? Can you even patch d1?

(Sorry for nagging, I don't even have a real use case here, it's just to become familiar with the internals. I'm thinking of using a similar design in a data structure of my own.)

andyferris commented 3 years ago

Let me prefix all this by saying - keeping multiple dictionaries in sync is an important problem, and the best way to do that is to use DictTable from TypedTables.jl. This stores multiple dictionaries in different, named columns, and you can add rows to the table safely and easily.

Here I expect d2 to be in a consistent state, and d1 undefined.

Correct.

I now expected that I could "patch" d1 to contain a new value and make the values arrays align again

Not via any officially exposed API (like set!).

Can you even patch d1?

What you could do is push!(d1.values, -20) but that only works because of specifics of how the internals work, and since you are using set! not insert! there's a chance the key already existed and then you'd need to find the token and use setindex! instead of push!. Furthermore, everything could still get out of sync because occassionally rehash! will get called and if there are any "holes" caused by deletion, everything would get out of sync regardless.

it's just to become familiar with the internals

Awesome! I'm happy to explain how it all works.

To support keeping multiple dictionaries in sync (e.g. for DictTable), I have an undocumented, semi-complete API as a part of gettoken! and deletetoken! interface. There is an optional third parameter for gettoken! on Indices that gives a collection of values storage of any corresponding dictionaries, and these are resized here. The Dictionary passes its values vector through here. Afterwards, for insert! (or set!), the abstract interface takes over and fills the value in e.g. here. This way the concrete containers only need implementations at the token level - all the user-facing interface is generic code and does not need to rewritten for each new dictionary (it's actually really easy to prototype new indices and dictionaries!).

In the future it would be good to stabilize the API. There are a wide range of possible dictionary implementations and I don't think we've been exposed to enough variations to lock this in yet, so things might change. In practice, I will be keeping TypedTables.jl in sync with any changes here, so if anyone is looking for the "official" solution please use DictTable.

Does that make sense?

phipsgabler commented 3 years ago

Thanks @andyferris -- yeah, you helped my sense making process a lot! I'll close this now since the original question is resolved.

In case you're interested, what I'm really trying to design is essentially a dictionary with hierarchical keys (namely, Setfield.jl lenses), for usage in Turing.jl (https://github.com/TuringLang/AbstractPPL.jl/blob/5347a71e1d08069ec130d6262b68fa84a5adf80a/varinfo-interface.md#varname-based-axioms).

IMO there is nothing fundamentally complicated there from the perspective of a dictionary interface -- you just allow more keys in get/set than you inserted, namely prefixes and postfixes -- but the rest stays the same. And most of the work would consist of replacing the indices hash table with an appropriate trie-like thing. My interest in sharing comes from the fact that we might want to have multiple such dictionaries, with the hierarchy shared between them.

I'd be happy to hear any insights you might have, if you care :) (You can also DM me on Zulip, or Slack.)