Open jasonkeene opened 6 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/157208717
The labels on this github issue will be updated when the story is started.
@jasonkeene How do other systems manage this? Some form of a gossip protocol?
@apoydence I was thinking of doing something similar to what the loggregator agent does.
This would be compatible with a LOT of service discovery systems, including kube-dns and bosh-dns.
Alternatively, you can query for SRV records which allows for port, weighting, transport and service name to be discovered. kube-dns supports this but I am not sure about bosh-dns.
@jasonkeene This sounds reasonable.
I would want to do more experiments with multiple schedulers after this is completed to see if there is any thrashing. I suspect each scheduler would be looking at a different subset of log cache nodes and would be instructing them differently.
That is a possibility. We can have an algo that is resistant to sudden changes to prevent nodes dropping out when they come right back. Something like a TTL for a node to stop being scheduled to. This would help with thrashing, allow for nodes that are truely gone to expire, and allow for new nodes to come online.
This would allow for log-cache instances to come and go and the cluster to adapt dynamically to scaling events, outages, etc.