Implement a prefix tree for fast string search

MatthiasKillat commented 3 years ago

Brief feature description

Prefix Tree (Trie) Allows fast searching of strings, to be used in Service Registry (but can generally used everywhere where data is indexed by strings). String search is then performed in time proportional to the string to be searched.

General Constraints

relocatable - we want to use it in shared memory
not thread safe - too much effort / runtime cost
keys are always strings
values are generic
multiple values for a key (hence it is not really a unique key, but the user is free to ensure keys are unique)
self contained memory, static memory

Operations

Insert key, value O(#key + #values to key)
Find all values to a key O(#key + #values to key)
Remove all values to a key O(#key + #values removed)
Remove a specific value to a key O(#key + #values to key)

# denotes the size of the specific quantity, i.e. #key the length of the key and #values to key the number of values corresponding to a key (some abuse of notation for simplicity)
generally the effort is proportional to the length of the key plus the size of the result set (which is theoretically optimal)
this is a substantial improvement compared to (multi)maps with O(log(#elements in map)) where each key comparison is also a string comparison of cost O(#key)

Detailed information

string type for API? - probably template parameterized cxx::string, internally we should use char* for efficiency
uses tree tree structure internally for fast search
tree nodes must be available to support insert operations up to specified capacity (this leads to wasted memory but is inevitable for static memory setup)
tree nodes must be as compact as possible to reduce memory overhead

elfenpiff commented 3 years ago

@MatthiasKillat why not implement a cxx::map which is more generic and can be used in more places in iceoryx?

MatthiasKillat commented 3 years ago

@elfenpiff I will do that, but a good map is more complicated and does not sufficiently solve the problems ahead. A trie fits the problem better and apart from static memory and some interface convenience I already have it implemented (still a way to go, but doable).

The prefix tree is actually very useful also for introspection (where we also currently use a map) and other applications, I would say. It is useful whenever we map strings (theoretically any kind of array with equality-comparable symbols like characters, but I will not generalize this for now) to something else, like ids or any object that describes the data to the key. It locates this data much faster than a map (in theory and I will measure it also in practice against std::map).

As for use cases, there are many more than we may think. In ROS or other communication standards (AUTOSAR), user ids are often defined as strings on user level and this cannot be changed (without changing the standard). Working with those ids internally is problematic from a storage and lookup point of view. This is even more so in the memory static case, where everything has to be pre-allocated for some worst-case. I remember how this was in some previous internal project with nested vectors, until I changed this with some shared pooling, still not very good. I want to avoid this, but static memory is a headache, however you turn it.

But you can use a trie as some kind of very fast string lookup mechanism, which maps to internal ids (such as system-wide unique numbers) or the data itself. It can be used to build a system-wide entity-dictionary for fast lookup like this. The service registry is exactly this, but only for services. It also comes with prefix search for free, unlike maps (find all entities that begin with a certain prefix).

I think in the future we must seriously think about organizing the data in Roudi in a topic/service-centric way, with those having ports attached (not the other way around, essentially as it is now). This allows us to internally find specific services and ports associated with them very fast (say to delete them, or add a publisher port to a topic). To do so, the service registry must be efficient. The reason for this is mainly that it can help clean-up inconsistencies in the multi-publisher case (dealing with the history and others) and also is more in line with the entity management in other systems, like CycloneDDS.

Currently it only has one drawback in our memory static environment: it has to reserve a relatively large amount of memory (but it would be worse with nested maps from my calculation, depending on how the capacities are chosen). Some of this can also be done by a hash-map (not a tree-based map), but this requires a good hash function, also has considerable memory-consumption and has other problems (and is not faster in general).

From this I concluded that whenever we want to build some dictionary from strings, this is the way to go. If we want to map more general keys with an ordering, an ordered map is required and I will build this afterwards (and a set, as a by-product). This is more difficult though, unless we use something entirely new (unlikely), a balanced tree like AVL or red-black tree is needed. If we do not care for ordering and have non-string keys, a hashmap is the way to go (I am unfamiliar with how to choose hash-functions though). Curiously enough, a trie can be improved by using a hashmap internally, but this is only optimization.

eclipse-iceoryx / iceoryx