Closed ebelnikola closed 3 days ago
Hi Nikolay,
Great questions! The issue with broadcast (and map
and some constructors) is that, for efficiency, the source and destination dictionaries share exactly the same indices, with no copying.
The dictionary object is essentially like:
struct Indices{I}
values::Vector{I}
# other hash lookup stuff
end
struct Dictionary{I, T}
indices::Indices{I}
values::Vector{T}
end
One invariant is that the dictionary values
vector and the indices values
need to be the same length. The same "token" is used as an index to both. In your example, if you insert or delete keys of d2
then d.values
and d.indices.values
will have different lengths and different sets of valid tokens, hence we say that d
is corrupt. Lookups in d
might result in reading past the end of the array in d.values
, for example, and that uses @inbounds
since we have already checked the index (token) is valid in d.indices
.
We save a lot of time not copying all the keys and hash information in indices
on broadcast and map, where the common case for doing so is in bulk analytical pipelines. In my software I've noticed you tend to have "transactional" components with insertions and deletions of single elements (perhaps interactively), and "analytic" components where you are doing bulk calculations, and generally you don't go back to interactive/transactional workloads with those results (as they are derived data not source data).
Note that d2 .*= 2
is ok because it doesn't change d2.indices
(and therefore d.indices
) at all.
Note that we have the same issue with views in Base
, where various lazy operations assume that the set of valid indices to arrays do not change:
v = ["a", "b", "c"]
v2 = @view v[1:3]
pop!(v);
v2[3] # what does this do?
Does that help? I agree the documentation could spell this out more clearly.
Also note that you can use copy
to copy both the indices and values of a dictionary, so that it does become safe to insert/delete indices.
Thank you very much for your answer! It does clarify what is going on. I have, however, one more question, if you don't mind. You said that the issue may arise, in particular, for broadcasting and for mapping. For mapping, I see the issue clearly, as the following code returns true:
a=Dictionary([1,2,3],[1,2,3])
b=map(x->x^2,a)
b.indices===a.indices # returns true
For broadcasting, however, the analogous code returns false:
a=Dictionary([1,2,3],[1,2,3])
b=a .^2
b.indices===a.indices # returns false
How can broadcasting then lead to the corruption of the initial dictionary?
Yes, good catch. I actually noticed this the other day - possibly something changed with Julia’s internal broadcast machinery and we need to adjust the code here (so that it behaves like map).
I see, thanks a lot!
The broadcasting issue is fixed in 0.4.3 - it once again references the existing indices, as intedended. (The issue was I had made an unrelated change to copy
).
Hi!
First of all, thank you for providing this nice package.
I have some, perhaps naive, questions regarding how broadcasting works. I think it would be nice to have this clarified in the documentation. I've read the documentation, but if I missed something, please let me know.
In the current documentation, it is written that the dictionary obtained through broadcasting is not insertable, and attempts to insert values into it may lead to the corruption of the original dictionary. Could you please clarify what "corruption" means in this context? For example, suppose I defined dictionaries
d=Dictionary(1,1)
andd2=2.*d
. What exactly can happen withd
if I start to insert indices intod2
? Also, if I don't care aboutd
anymore, willd2
work fine on its own?Finally, what if I want to mutate
d
using broadcasting, e.g., byd.*=2
. Will it be safe to insert new indices intod
after this?Kind regards, Nikolay