Open codepr opened 5 years ago
Best solutions so far seems two:
char children_idx[ALPHABET_SIZE]
which store the index in the main vector for each children;Still thinking.
First attempt with cf25fb0bf54f1a04183a5b5950419e7f2f63e05f by simply replacing the fixed length array on each node with a singly-linked linked list
, maintained sorted on each insertion, this way there's an average performance of O(n/2) on each search, which is the best case possible with the linked list
data structure.
After some tests with little millions of keys of various length between 6 and 30 chars each the performance loss seems negligible, while the reduction of the memory footprint is considerably high.
Looking forward for an even better handling, maybe a BTree
or something similiar.
As a middle step, it could be interesting to replace the linked list
with a vector
, in order to leverage spatial locality and decrease cache miss/CPU cycles per operations, contiguous memory is much more performant.
As a basic idea each node should have a char
array or a bitmap, maybe an array of 4 unsigned long
to track the indexes inside the vector
for each child pointer.
Trie has great advantages in terms of average performances over hashtables or B-Trees giving on worst case O(L) insert, delete and search time complexity where L is the length of the key. This comes at a cost, the main drawback is that the structure itself is really memory hungry. In TriteDB case with an alphabet of size 96, starting from the
<space>
character and ending with the~
each node has 96 NULL pointer to their children, this means that on a 64 bit arch machine with 8 bytes per pointer we effectively have 768 bytes of allocated space per node. Let's briefly analyze a case:So we have the root
f
which have 1 non-null pointero
, the children have another 1 non-null pointero
, here lies our first value for keyfoo
, and the last childreno
have 1 non-null pointer fort
, which will also store our second value forfoot
key. So we have a total of 4 nodes, that means 4 * 96 = 384 pointers, of which only 4 are used. Now that's a lot of wasted space! There's some techniques to mitigate this homongous amount of wasting bytes while maintaining good time-complexity performances, called compressed trie and adaptive trie.