bbcarchdev / anansi

A Linked Open Data Web crawler
https://bbcarchdev.github.io/anansi/
Apache License 2.0
0 stars 0 forks source link

Enable use of etcd for on-the-fly rebalancing #9

Closed nevali closed 9 years ago

nevali commented 9 years ago

Currently a cluster must be reconfigured and restarted before adding new nodes (or removing existing nodes) has the desired effect, which in a configuration-managed environment results in very long lead times for scaling events.

Anansi should be able to make use of etcd, a distributed key-value store, in order to dynamically reconfigure and rebalance itself when nodes are added and removed.

Work to support this (in clusters of up to 256 nodes, via tinyhash) was added in b470d1ee

nevali commented 9 years ago

A cluster:name configuration setting should be added, which will be used consistently across all members of the crawl cluster.

A directory named ${cluster:name} will be created in the registry service.

Keys within the directory will be generated by each instance. The key name will be a UUID generated at process start, and the key value will be the number of threads in that instance.

The TTL on the keys will be set such that each crawler instance will refresh the key frequently (e.g., every 60 seconds).

When the directory changes (either because a new entry has been added, or the TTL expires on an entry), each node in the cluster will fetch the directory contents.

The list of keys will be sorted lexicographically. The instance will loop through the keys sequentially, adding the thread count (the key value) to an accumulator, until the key whose name is the UUID of the instance is encountered. When it's found, the value of the accumulator is the base ID used by the instance (with each crawler thread incrementing that ID sequentially).