codepr / triedb

Key-value store based on a trie data structure to handle the keyspace.
BSD 2-Clause "Simplified" License
12 stars 2 forks source link

Reduce memory footprint of the Trie #2

Open codepr opened 5 years ago

codepr commented 5 years ago

Trie has great advantages in terms of average performances over hashtables or B-Trees giving on worst case O(L) insert, delete and search time complexity where L is the length of the key. This comes at a cost, the main drawback is that the structure itself is really memory hungry. In TriteDB case with an alphabet of size 96, starting from the <space> character and ending with the ~ each node has 96 NULL pointer to their children, this means that on a 64 bit arch machine with 8 bytes per pointer we effectively have 768 bytes of allocated space per node. Let's briefly analyze a case:

So we have the root f which have 1 non-null pointer o, the children have another 1 non-null pointer o, here lies our first value for key foo, and the last children o have 1 non-null pointer for t, which will also store our second value for foot key. So we have a total of 4 nodes, that means 4 * 96 = 384 pointers, of which only 4 are used. Now that's a lot of wasted space! There's some techniques to mitigate this homongous amount of wasting bytes while maintaining good time-complexity performances, called compressed trie and adaptive trie.

codepr commented 5 years ago

Best solutions so far seems two:

Still thinking.

codepr commented 5 years ago

First attempt with cf25fb0bf54f1a04183a5b5950419e7f2f63e05f by simply replacing the fixed length array on each node with a singly-linked linked list, maintained sorted on each insertion, this way there's an average performance of O(n/2) on each search, which is the best case possible with the linked list data structure.

After some tests with little millions of keys of various length between 6 and 30 chars each the performance loss seems negligible, while the reduction of the memory footprint is considerably high. Looking forward for an even better handling, maybe a BTree or something similiar.

codepr commented 5 years ago

As a middle step, it could be interesting to replace the linked list with a vector, in order to leverage spatial locality and decrease cache miss/CPU cycles per operations, contiguous memory is much more performant.

As a basic idea each node should have a char array or a bitmap, maybe an array of 4 unsigned long to track the indexes inside the vector for each child pointer.