Documentation clarification request: broadcasting and data corruption

ebelnikola commented 1 week ago

Hi!

First of all, thank you for providing this nice package.

I have some, perhaps naive, questions regarding how broadcasting works. I think it would be nice to have this clarified in the documentation. I've read the documentation, but if I missed something, please let me know.

In the current documentation, it is written that the dictionary obtained through broadcasting is not insertable, and attempts to insert values into it may lead to the corruption of the original dictionary. Could you please clarify what "corruption" means in this context? For example, suppose I defined dictionaries d=Dictionary(1,1) and d2=2.*d. What exactly can happen with d if I start to insert indices into d2? Also, if I don't care about d anymore, will d2 work fine on its own?

Finally, what if I want to mutate d using broadcasting, e.g., by d.*=2. Will it be safe to insert new indices into d after this?

Kind regards, Nikolay

andyferris commented 1 week ago

Hi Nikolay,

Great questions! The issue with broadcast (and map and some constructors) is that, for efficiency, the source and destination dictionaries share exactly the same indices, with no copying.

The dictionary object is essentially like:

struct Indices{I}
    values::Vector{I}
    # other hash lookup stuff
end

struct Dictionary{I, T}
     indices::Indices{I}
     values::Vector{T}
end

One invariant is that the dictionary values vector and the indices values need to be the same length. The same "token" is used as an index to both. In your example, if you insert or delete keys of d2 then d.values and d.indices.values will have different lengths and different sets of valid tokens, hence we say that d is corrupt. Lookups in d might result in reading past the end of the array in d.values, for example, and that uses @inbounds since we have already checked the index (token) is valid in d.indices.

We save a lot of time not copying all the keys and hash information in indices on broadcast and map, where the common case for doing so is in bulk analytical pipelines. In my software I've noticed you tend to have "transactional" components with insertions and deletions of single elements (perhaps interactively), and "analytic" components where you are doing bulk calculations, and generally you don't go back to interactive/transactional workloads with those results (as they are derived data not source data).

Note that d2 .*= 2 is ok because it doesn't change d2.indices (and therefore d.indices) at all.

Note that we have the same issue with views in Base, where various lazy operations assume that the set of valid indices to arrays do not change:

v = ["a", "b", "c"]
v2 = @view v[1:3]
pop!(v);
v2[3] # what does this do?

Does that help? I agree the documentation could spell this out more clearly.

andyferris commented 1 week ago

Also note that you can use copy to copy both the indices and values of a dictionary, so that it does become safe to insert/delete indices.

ebelnikola commented 4 days ago

Thank you very much for your answer! It does clarify what is going on. I have, however, one more question, if you don't mind. You said that the issue may arise, in particular, for broadcasting and for mapping. For mapping, I see the issue clearly, as the following code returns true:

a=Dictionary([1,2,3],[1,2,3])
b=map(x->x^2,a)
b.indices===a.indices # returns true

For broadcasting, however, the analogous code returns false:

a=Dictionary([1,2,3],[1,2,3])
b=a .^2
b.indices===a.indices # returns false

How can broadcasting then lead to the corruption of the initial dictionary?

andyferris commented 3 days ago

Yes, good catch. I actually noticed this the other day - possibly something changed with Julia’s internal broadcast machinery and we need to adjust the code here (so that it behaves like map).

ebelnikola commented 3 days ago

I see, thanks a lot!

andyferris commented 1 day ago

The broadcasting issue is fixed in 0.4.3 - it once again references the existing indices, as intedended. (The issue was I had made an unrelated change to copy).

andyferris / Dictionaries.jl

Documentation clarification request: broadcasting and data corruption #149