cloudfoundry-attic / etcd-release

Apache License 2.0
3 stars 17 forks source link

Feedback only: Infrastructure team common issues w/ etcd? #24

Closed markstgodard closed 8 years ago

markstgodard commented 8 years ago

Hi @cloudfoundry-incubator/cf-infrastructure team

This is Mark from the CF Container Networking team and (as you may know) we are currently using flannel as our default “batteries included” overlay network. Flannel uses etcd and I’m looking at a story in our backlog about possible failure scenarios or common issues that may arise when using an etcd cluster: https://www.pivotaltracker.com/story/show/121713061

I talked to Amit on Slack and he suggested I open a story to capture this work.

Basically, I'd like to get feedback from your team about any common issues you have faced w/ etcd. If there are issues that we have not covered in our story, then I'd like to incorporate into our testing scenarios.

cc @Amit-PivotalLabs

Cheers

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/129873963

The labels on this github issue will be updated when the story is started.

Amit-PivotalLabs commented 8 years ago

Hey @markstgodard

I spoke with a couple @cloudfoundry-incubator/cf-container-networking folks in Santa Monica to better understand your use case.

  1. Do not rely on data persisting in the etcd cluster. Of course, etcd generally shouldn't lose data in a production configuration, but your system to be resilient to cases where it does. We are likely to change etcd so that rolling a 1-node (e.g. in BOSH-Lite) will definitely lose data, and the failure recovery steps recommended for etcd suggest removing data. You can see the README for more details.
  2. I'd recommend not having a separate etcd cluster, it would be nice to re-use the existing CF etcd cluster. Users don't want to have to maintain 2 or 3 separate etcd clusters, plus a consul cluster, to run CF.
  3. Your stuff should work with talking to etcd over TLS. While TLS is optional in the current OSS CF deployment of etcd, it's strongly recommended.
  4. Following the first point, if you use the shared etcd cluster, keep in mind that you'll have make sure your keys don't collide with other clients of the etcd cluster.
  5. Since you are not operating your own etcd cluster, you don't need to worry about complexities around orchestrating (i.e. BOSH-ifying) etcd.
  6. etcd has issues under heavy write load, and even read load when storing large amounts of data. It sounds like you're reading/writing very rarely (once a day), and hardly putting any data in (one "record" per Cell, maybe at most 1MB of data) so this should not be a concern.
markstgodard commented 8 years ago

Thanks @Amit-PivotalLabs