Reduce memory footprint of the Trie

codepr commented 5 years ago

Trie has great advantages in terms of average performances over hashtables or B-Trees giving on worst case O(L) insert, delete and search time complexity where L is the length of the key. This comes at a cost, the main drawback is that the structure itself is really memory hungry. In TriteDB case with an alphabet of size 96, starting from the <space> character and ending with the ~ each node has 96 NULL pointer to their children, this means that on a 64 bit arch machine with 8 bytes per pointer we effectively have 768 bytes of allocated space per node. Let's briefly analyze a case:

insert key foo
insert key foot

So we have the root f which have 1 non-null pointer o, the children have another 1 non-null pointer o, here lies our first value for key foo, and the last children o have 1 non-null pointer for t, which will also store our second value for foot key. So we have a total of 4 nodes, that means 4 * 96 = 384 pointers, of which only 4 are used. Now that's a lot of wasted space! There's some techniques to mitigate this homongous amount of wasting bytes while maintaining good time-complexity performances, called compressed trie and adaptive trie.

codepr commented 5 years ago

Best solutions so far seems two:

Use a single dynamic array (vector) in the Trie structure, each node must have a pointer to that vector and an array char children_idx[ALPHABET_SIZE] which store the index in the main vector for each children;
Use sized node based on the number of children and adapting lookup algorithm accordingly e.g. with # children <= 4 use a fixed length array of 4 pointers and linear search for children, growing up set fixed steps of size and use a different mapping for characters on the array of children pointers.

Still thinking.

codepr commented 5 years ago

First attempt with cf25fb0bf54f1a04183a5b5950419e7f2f63e05f by simply replacing the fixed length array on each node with a singly-linked linked list, maintained sorted on each insertion, this way there's an average performance of O(n/2) on each search, which is the best case possible with the linked list data structure.

After some tests with little millions of keys of various length between 6 and 30 chars each the performance loss seems negligible, while the reduction of the memory footprint is considerably high. Looking forward for an even better handling, maybe a BTree or something similiar.

codepr commented 5 years ago

As a middle step, it could be interesting to replace the linked list with a vector, in order to leverage spatial locality and decrease cache miss/CPU cycles per operations, contiguous memory is much more performant.

As a basic idea each node should have a char array or a bitmap, maybe an array of 4 unsigned long to track the indexes inside the vector for each child pointer.

codepr / triedb

Reduce memory footprint of the Trie #2