andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

iterate over hashes #120

Open maartensandbox opened 1 year ago

maartensandbox commented 1 year ago

I have two dictionaries. I iterate over the keyvalue pairs of the first one, and if the second dict also contains that key then I assign it a new value.

It would be nice if I was able (maybe I already am?) to iterate over the (key,hash,value) pairs, and have a faster lookup in the second dict that avoids rehashing the key.

andyferris commented 1 year ago

Hmm, yes that is an interesting use case. The "inner join" and "left join" patterns are pretty common.

Note this is getting very low level. Maybe something like the "tokens" dictionary defined for hash-based dictionaries, that iterates (token, hash), could be helpful, together with an extension of gettoken where you provide the key and precomputed hash. For example, if keyword arguments are fast we enough we could try for a signature like gettoken(key; hash), or else create a new generic functions.

We also overwrite and utilize a bit or two of the hash for internal purposes, so we need to take care with that (ideally this would work between Dictionary and UnorderedDictionary without problems).

Do you see any other nice ways of implementing this?

maartensandbox commented 1 year ago

One could also add a specialized gettoken that dispatches on Pair{I,Uint} for Indices{I}? It will anyway probably only be something implemented for Dictionary, as those are the only indices that currently store the hash. Although one could probably do something similar for unordereddictionary, the index reveals the first n bits of the hash, and a lookup in a different unordereddictionary can then sometimes avoid having to recompute the hash, but I don't think it's worth the hassle.

The internal bits can be somewhat confusing when iterating (token,hash), and then not actually having hash(token) == hash, but I don't think that's a real problem. When users start supplying their own hash to get faster lookups, then they probably will also have read the documentation, and the lookup function will anyway "&hashmask" whatever gets passed in.