Closed nevali closed 9 years ago
A cluster:name
configuration setting should be added, which will be used consistently across all members of the crawl cluster.
A directory named ${cluster:name}
will be created in the registry service.
Keys within the directory will be generated by each instance. The key name will be a UUID generated at process start, and the key value will be the number of threads in that instance.
The TTL on the keys will be set such that each crawler instance will refresh the key frequently (e.g., every 60 seconds).
When the directory changes (either because a new entry has been added, or the TTL expires on an entry), each node in the cluster will fetch the directory contents.
The list of keys will be sorted lexicographically. The instance will loop through the keys sequentially, adding the thread count (the key value) to an accumulator, until the key whose name is the UUID of the instance is encountered. When it's found, the value of the accumulator is the base ID used by the instance (with each crawler thread incrementing that ID sequentially).
Currently a cluster must be reconfigured and restarted before adding new nodes (or removing existing nodes) has the desired effect, which in a configuration-managed environment results in very long lead times for scaling events.
Anansi should be able to make use of etcd, a distributed key-value store, in order to dynamically reconfigure and rebalance itself when nodes are added and removed.
Work to support this (in clusters of up to 256 nodes, via
tinyhash
) was added in b470d1ee