chengchingwen / DoubleArrayTries.jl

MIT License
4 stars 0 forks source link

Question #1

Open PallHaraldsson opened 2 years ago

PallHaraldsson commented 2 years ago

there are many real examples that the size of string dictionaries becomes critical problems for very large datasets

This seems very intriguing, but would it also be helpful for small[er] string dictionaries (e.g. for Julia Base and the Julia compiler)?

chengchingwen commented 2 years ago

It depends on what operation we need for a string dictionaries. As their paper mentioned, the goal of the xcdat implementation is to support both string-to-unique-index and index-to-string lookup with acceptable memory-speed trade-off, and you also get those operation supported by a Trie (like the common prefix). FWIW it is already faster than the normal DataStructures.Trie with small dataset.

As for string dictionaries in Julia Base/Compiler, I'm not sure what we would need there. If the dataset is really small, DAT would probably allocate more than the Base.Dict and a single hash could still be faster. Another issue I can think of is this implementation does not support insertion/deletion, so you need the full dataset to construct a DAT and the use case might be limited.