cloudfoundry / cloud_controller_ng

Cloud Foundry Cloud Controller
Apache License 2.0
193 stars 358 forks source link

CC doesn't properly manage nats server failure #403

Closed aalbanes closed 9 years ago

aalbanes commented 9 years ago

It seems to me that we have the same issue on CC as per the one I reported below for DEA: https://github.com/cloudfoundry/dea_ng/issues/169

cf-gitbot commented 9 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/99159450.

ships commented 9 years ago

Good afternoon @aalbanes ,

Thanks for raising this issue. We are currently working on phasing NATS out of Cloud Foundry as part of larger architectural changes. A more extensive explanation was raised in the story on our public Tracker project (see below). While this does mean your problem will disappear once these changes have rolled, it also prevents us from investing time in solving your issue directly.

Best wishes @Quintaminant && @rmasand from CF CAPI

There have been observed issues with NATS servers synchronizing subscriptions and delivery on lossy networks.
It has also been observed that when NATS servers desynchronize it can take anywhere from seconds to thirty minutes to synchronize again.
The most stable NATS deployment is a single server. However, this prevents zero-downtime rolling deploys, so we generally deploy with two NATS servers. Although it can take long for NATS servers to synchronize, on a healthy network they normally synchronize on the order of seconds.
The multi-AZ issues for this component are being mitigated by work towards removing it. There is work for a routing-api that will remove the reliance on NATS for component route registration. Diego will remove the reliance on NATS for app management and route registration. Metron metrics will remove the reliance of collector on NATS to provide system metrics.

https://www.pivotaltracker.com/story/show/99159450

Amit-PivotalLabs commented 9 years ago

Hi @aalbanes,

Just a follow up to what @Quintaminant and @rmasand said: while we are making larger architectural changes (Diego, Routing API, internal service discovery) to phase out NATS, it will require further investigation to understand the current impact of the issue you raised, and the cost of solving it especially as it pertains to delaying the longer-term solution of phasing NATS out. @fraenkel and @sykesm will investigate further, and open up new Issues/PRs as appropriate.

This comment is cross-posted here

Thanks, Amit, CF Release Integration team