eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.62k stars 383 forks source link

Implement a prefix tree for fast string search #859

Open MatthiasKillat opened 3 years ago

MatthiasKillat commented 3 years ago

Brief feature description

Prefix Tree (Trie) Allows fast searching of strings, to be used in Service Registry (but can generally used everywhere where data is indexed by strings). String search is then performed in time proportional to the string to be searched.

General Constraints

Operations

  1. Insert key, value O(#key + #values to key)
  2. Find all values to a key O(#key + #values to key)
  3. Remove all values to a key O(#key + #values removed)
  4. Remove a specific value to a key O(#key + #values to key)

Detailed information

elfenpiff commented 3 years ago

@MatthiasKillat why not implement a cxx::map which is more generic and can be used in more places in iceoryx?

MatthiasKillat commented 3 years ago

@elfenpiff I will do that, but a good map is more complicated and does not sufficiently solve the problems ahead. A trie fits the problem better and apart from static memory and some interface convenience I already have it implemented (still a way to go, but doable).

The prefix tree is actually very useful also for introspection (where we also currently use a map) and other applications, I would say. It is useful whenever we map strings (theoretically any kind of array with equality-comparable symbols like characters, but I will not generalize this for now) to something else, like ids or any object that describes the data to the key. It locates this data much faster than a map (in theory and I will measure it also in practice against std::map).

As for use cases, there are many more than we may think. In ROS or other communication standards (AUTOSAR), user ids are often defined as strings on user level and this cannot be changed (without changing the standard). Working with those ids internally is problematic from a storage and lookup point of view. This is even more so in the memory static case, where everything has to be pre-allocated for some worst-case. I remember how this was in some previous internal project with nested vectors, until I changed this with some shared pooling, still not very good. I want to avoid this, but static memory is a headache, however you turn it.

But you can use a trie as some kind of very fast string lookup mechanism, which maps to internal ids (such as system-wide unique numbers) or the data itself. It can be used to build a system-wide entity-dictionary for fast lookup like this. The service registry is exactly this, but only for services. It also comes with prefix search for free, unlike maps (find all entities that begin with a certain prefix).

I think in the future we must seriously think about organizing the data in Roudi in a topic/service-centric way, with those having ports attached (not the other way around, essentially as it is now). This allows us to internally find specific services and ports associated with them very fast (say to delete them, or add a publisher port to a topic). To do so, the service registry must be efficient. The reason for this is mainly that it can help clean-up inconsistencies in the multi-publisher case (dealing with the history and others) and also is more in line with the entity management in other systems, like CycloneDDS.

Currently it only has one drawback in our memory static environment: it has to reserve a relatively large amount of memory (but it would be worse with nested maps from my calculation, depending on how the capacities are chosen). Some of this can also be done by a hash-map (not a tree-based map), but this requires a good hash function, also has considerable memory-consumption and has other problems (and is not faster in general).

From this I concluded that whenever we want to build some dictionary from strings, this is the way to go. If we want to map more general keys with an ordering, an ordered map is required and I will build this afterwards (and a set, as a by-product). This is more difficult though, unless we use something entirely new (unlikely), a balanced tree like AVL or red-black tree is needed. If we do not care for ordering and have non-string keys, a hashmap is the way to go (I am unfamiliar with how to choose hash-functions though). Curiously enough, a trie can be improved by using a hashmap internally, but this is only optimization.