JuliaText / CorpusLoaders.jl

A variety of loaders for various NLP corpora.
Other
32 stars 13 forks source link

String interning idea #4

Closed oxinabox closed 6 years ago

oxinabox commented 6 years ago

String interning would be very useful to decrease the memory used to hold tokens. One existing option is PooledStrings. Another is to make a string type backed by Symbols.

PooledStrings are probably better, but I haven't checked the still work with julia 0.6 Newer maintained packages eg http://juliadata.github.io/CategoricalArrays.jl/stable/ want your pooled data type to definately be in an array, Which we don't want. Also IIRC garbage collecting any PooledString requires garbage collecting the pool. i.e. it ain't going to happen.

So here is the proposal: For what I will call InternedStrings. Weakly interned strings might be a better name.

There exists a global Pool (per type, so if only considering Pooled Strings then just one) That globe pool holds a collection of Weak references to every string. (I think maybe a weakykeyed dictionary to Strong refs for the actual values?).

When constructing an InternedString, the pool is checked to see if it already has one with this value. If it doesn't then it is added. The InternedString contains a Strong Reference to then the data that is in the pool.

So once all InternedStrings for a particular string have be garbage collected, then the copy (that is to say reference) in the pool is also garbage collected.

This gives us interning, but without giving up garbage collection. i.e. no leaking memory like crazy

oxinabox commented 6 years ago

Could call it StrongRefStrings, because it would be the opposite of https://github.com/quinnj/WeakRefStrings.jl/blob/master/src/WeakRefStrings.jl

But also very similar. Since both allow the creation of strings without copying.

But in the interned case each "copy" is a strong reference, and it is only the backing pool that is weak. Vs in WeakRefStrings each "copy" is strong, and the backing pool (which is the original string, and not a pool at all noramlly) is Strong.

oxinabox commented 6 years ago

Started work see: https://github.com/oxinabox/StringInterning.jl

oxinabox commented 6 years ago

Done