datamade / datamade-pysettrie

Set-Tries
GNU Lesser General Public License v3.0
1 stars 0 forks source link

Back with a C(++) trie lib #7

Open fgregg opened 7 years ago

fgregg commented 7 years ago

Right now the slowest step seems to be recursing through the tries. One way to speed that up would be to use a C or C++ trie lib for that, instead of the pure Python trie we make now.

@kmike and @superbobry have a whole mess of python bindings to different trie libs.

Next step would be to investigate them and see if they

  1. Allow mapping where the values can be a sequence of arbitrary python objects
  2. Allow control over how to traverse the tries.
  3. (Nice to have) control over how to construct the tries.
fgregg commented 7 years ago
lib alphabet size custom iteration static
datrie 255 True False
marissa-trie 136755* False True
DAWG 136755* False True
Hat-Trie 136755* False False

*Current size of unicode

Of the libs in https://github.com/pytries/, only datrie provides custom traversal, but datrie only has an alphabet of 255 characters, which is much too small for my purposes.

Since I don't need a dynamic data structure, marissa-trie or DAWG would seem to be the candidates for extending to support our case.

@kmike, is this a fair characterization? If so, which would you recommend marissa-trie or DAWG?

fgregg commented 7 years ago

@thatdatabaseguy explained that the 255 char alphabet for datrie can be hacked around.

The elements of my sets are all integers. If I encode the set like so

encode = lambda x : '-'.join(str(each) for each in sorted(x)) 
s = {1, 10, 100}
trie = datrie.Trie("01234567890-")
trie[encoded(s)] = 'foo'
state = datrie.State(trie)
state.walk(encode({1}))
it = datrie.Iterator(state)
element = ''
while it.next():
    node = it.key()
    if node != '-':
        element += node
    else:
         print(int(element))
         element = ''
kmike commented 7 years ago

The main issue with datrie is that its building speed is in fact O(N^2). There are "tails" which sometimes need to be moved during insertion, and in the worst case (which is not too uncommon) insertion time is linear in the size of the trie.

With marisa-trie arbitrary walking (as in a trie) is not possilbe AFAIK, it is a limitation of a data structure, not just something which is not implemented (https://code.google.com/archive/p/marisa-trie/issues/9).

If static data is OK, DAWG can be a good choice. Custom iteration is not exposed to Python, but nothing prevents it from being implemented, it is straightforward once you understand data format. If you need something custom it could be better to implement the whole thing in Cython, to make it fast; that's the reason custom iteration was low priority, and we were adding more methods to DAWG class itself instead, implementing them in Cython. That said, custom iterator objects like in datrie are a welcome addition to DAWG package. There is also https://github.com/pytries/DAWG-Python package which implements a pure-Python reader for DAWG structure; it can be easier to hack on.

I think walking can be also implemented for HAT-Trie, but it can be a bit less straightforward. HAT-Trie is not really a trie, there are levels of small hash maps which are sorted on demand when needed.

There is no vanilla Trie in pytries :) You can also check trie from BioPython; it is also not vanilla though.

Alphabet size is 255 (or 254) in all of these tries; unicode data is encoded to UTF-8 for all tries other than datrie.

If you're storing binary data null bytes can be a problem, as they have special meaning in some of the trie implementations.

fgregg commented 7 years ago

@kmike, thanks for your comprehensive reply. Very much appreciated.