Open fgregg opened 7 years ago
lib | alphabet size | custom iteration | static |
---|---|---|---|
datrie |
255 | True | False |
marissa-trie |
136755* | False | True |
DAWG |
136755* | False | True |
Hat-Trie |
136755* | False | False |
*Current size of unicode
Of the libs in https://github.com/pytries/, only datrie
provides custom traversal, but datrie
only has an alphabet of 255 characters, which is much too small for my purposes.
Since I don't need a dynamic data structure, marissa-trie
or DAWG
would seem to be the candidates for extending to support our case.
@kmike, is this a fair characterization? If so, which would you recommend marissa-trie
or DAWG
?
@thatdatabaseguy explained that the 255 char alphabet for datrie
can be hacked around.
The elements of my sets are all integers. If I encode the set like so
encode = lambda x : '-'.join(str(each) for each in sorted(x))
s = {1, 10, 100}
trie = datrie.Trie("01234567890-")
trie[encoded(s)] = 'foo'
state = datrie.State(trie)
state.walk(encode({1}))
it = datrie.Iterator(state)
element = ''
while it.next():
node = it.key()
if node != '-':
element += node
else:
print(int(element))
element = ''
The main issue with datrie is that its building speed is in fact O(N^2). There are "tails" which sometimes need to be moved during insertion, and in the worst case (which is not too uncommon) insertion time is linear in the size of the trie.
With marisa-trie arbitrary walking (as in a trie) is not possilbe AFAIK, it is a limitation of a data structure, not just something which is not implemented (https://code.google.com/archive/p/marisa-trie/issues/9).
If static data is OK, DAWG can be a good choice. Custom iteration is not exposed to Python, but nothing prevents it from being implemented, it is straightforward once you understand data format. If you need something custom it could be better to implement the whole thing in Cython, to make it fast; that's the reason custom iteration was low priority, and we were adding more methods to DAWG class itself instead, implementing them in Cython. That said, custom iterator objects like in datrie are a welcome addition to DAWG package. There is also https://github.com/pytries/DAWG-Python package which implements a pure-Python reader for DAWG structure; it can be easier to hack on.
I think walking can be also implemented for HAT-Trie, but it can be a bit less straightforward. HAT-Trie is not really a trie, there are levels of small hash maps which are sorted on demand when needed.
There is no vanilla Trie in pytries :) You can also check trie from BioPython; it is also not vanilla though.
Alphabet size is 255 (or 254) in all of these tries; unicode data is encoded to UTF-8 for all tries other than datrie.
If you're storing binary data null bytes can be a problem, as they have special meaning in some of the trie implementations.
@kmike, thanks for your comprehensive reply. Very much appreciated.
Right now the slowest step seems to be recursing through the tries. One way to speed that up would be to use a C or C++ trie lib for that, instead of the pure Python trie we make now.
@kmike and @superbobry have a whole mess of python bindings to different trie libs.
Next step would be to investigate them and see if they