Docs / Ops Guide - Githubissues

GETandSELECT commented 7 years ago

Hi

is there really only one read-only command available (API with stats) and a dashboard?

as an operator I need more docs. For example:

how to mark one (healthy) Galera node a down? No Switchboard traffic
how to force Switchboard route traffic to other node (some kind of switch) (even if node healthy)

thanks

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/149166051

The labels on this github issue will be updated when the story is started.

menicosia commented 7 years ago

Hi @GETandSELECT,

Nope, there are almost no "administrative" API options through Switchboard. That's because Switchboard is intended to have all of the necessary logic for making decisions about how to route traffic built-in. This is possible because Switchboard relies on the galera-healthcheck service that runs on the MySQL nodes themselves.

So far, people have been pretty satisfied to allow Switchboard to make decisions about where to send traffic. Switchboard makes decisions based on the Galera state of each of the nodes (OPEN, JOINED, SYNCED, etc), which changes dynamically. If we introduce management APIs, we'd need to introduce a "non-automatic" mode which doesn't exist in SB today.

Can you tell me a little bit about what you are trying to do that requires more control(s)?

I should note that there are configurables for Switchboard to be aware of a load-balancer above it, so that it can be intelligent about the LB's health-check intervals.

There is a single undocumented API for cutting off traffic to the entire cluster; it's meant only as a panic switch to stop clients from accessing the cluster while trying to do advanced debugging.

Marco Nicosia Product Manager Pivotal Software, Inc.

GETandSELECT commented 7 years ago

Hi @menicosia

In MariaDB/MySQL world ProxySQL is the hyped and loved open source project. I don't have Ops experience with ProxySQL, but the feature list is very long. Before cf-mysql-release we used HAProxy with our own MariaDB/Galera automation. I don't really trust "Why switchboard?" arguments, sorry. So to say it direct: why you developed your own stuff?

We run cf-mysql-release on OpenStack. Sometimes the situation that one hypervisor is very slow. One healthy Galera node is very slow, so I wish to exclude this healthy node from proxy. The same situation we also have with MongoDB - MongoDB doesn't use proxy and is able to do primary switch.

Other situation is that customer reports issues with our DBaaS offer. Sometimes the customer doesn't understand MariaDB and sometimes we have a not optimal parameter value. I am a lazy guy and try to reproduce it on production (with customer schema and data). During this time I wish no other customers on the same node.

Can you please post the "single undocumented API for cutting off traffic to the entire cluster"? This is very useful for maintenance windows. At the moment I do a monit stop on proxy VMs.

thanks

menicosia commented 7 years ago

Hi @GETandSELECT,

I got the same question from @jriguera over on cf-mysql-release the other day, and it reminded me that I never followed up here.

I'll expand on the "Why switchboard?" explanation. We noticed that HAProxy and friends all had a flaw: When they detected that a back-end had gone bad, they'd start directing new connections to a different back-end. But they'd allow the existing connections to die off organically. We didn't like that, because we found that there were conditions which could linger for a long time in a non-functional state. Instead, we wrote switchboard to aggressively kill all connections upon detecting a back-end isn't healthy.

Although it's purpose-built for MySQL, ProxySQL doesn't seem to have this feature either:

While ProxySQL does not offer support failover as a feature, it collaborates smoothly with the existing tools that enable it. It monitors the health of the backends it communicates with and is able to temporarily shun them based on configurable error rates. [1]

Further, Switchboard is purpose-built to talk to Galera clusters provisioned by cf-mysql-release. Switchboard talks to the galera-healthcheck job, which is co-located on the MySQL servers. galera-healthcheck provides an out-of-band mechanism for Switchboard to detect the state of each individual back-end. This allowed us to build in additional logic to make smart decisions about which back-end to direct traffic.

This is really nice because it means you can deploy multiple instances of Switchboard, and be relatively assured that when sending traffic to both of them, they'll be using the same back-end.

Why is this important? We've learned that you can avoid many of Galera's inherent limitations if you guarantee that all instances of an application are talking to the same back end. One of the side-effects is that we can loosen the limitations usually required when talking with Galera, see the poll I included at the bottom of the v36 release notes. [2]

You're right that we could probably build all of this around ProxySQL, but for now, we're pretty confident that we understand Switchboard's behavior, and it meets the needs of our existing users. If there's sufficient interest in moving from Switchboard to ProxySQL, we'd be open to consider it.

Does this explanation help?

I'll leave this issue open, and try to get to the documentation of the API soon.

Marco Nicosia Product Manager Pivotal Software, Inc.

menicosia commented 7 years ago

Update:

It does look like the world has gotten better at this since we've written switchboard. HAProxy has the feature to actively cut connections [1] and ProxySQL has a Galera-specific script. [2]

So, if we had to do it all over from scratch, we might look at one of these options.

Again, though, what we have is working well for cf-mysql-release's purposes. So unless there's a set of use cases coming from the community that would make it worth the investment, we'll stay on Switchboard for the time-being.

Marco Nicosia Product Manager Pivotal Software, Inc.

jriguera commented 7 years ago

Hi @menicosia,

(This comes from https://github.com/cloudfoundry/cf-mysql-release/issues/175) My questions regarding the proxy are not exactly in the same direction of this thread, I understand the reasons which drove you to create Switchboard project, but anyway, I think we can contribute with a (good) use case of cf-mysql-boshrelease. I think the release offers people everything needed to deploy (and maintain) a production MySQL (MariaDB) cluster, which is really good. The topic behind the issue I had opened is to be able to use a different (in this case, external) proxy implementation. Our external implementation is a F5 BigIP device, but if a user wants to use ProxySQL instead of Switchboard should be possible (they could make use of bosh links, probably inter-deployments).

I will try to explain our situation to see if we can find out the best approach. I think the architecture we define is quite common in a lot of companies, at least if they have their own data-centers with some kind of load-balancer network device (like BigIP F5).

We are deploying a MySQL-Galera cluster using cf-mysql-boshrelease across different VMware clusters meant as availability zones, we are still testing it, but for now, everything works fine. In our company we have a cluster of BigIP F5 devices to provide load-balancing features with HA to avoid having a SPOF. I have to admit that the F5 works quite well and we are using it also for CF (on top of the go-routers). With this architecture, we do not need to setup other implementations like Pacemaker/Keepalived for the endpoints, because the F5 is in charge of handling those situations. To setup the F5 configuration automatically we are using ansible (we create a kind of add-on release to do it automatically with bosh from the back-end nodes). Initially we deployed the mysql cluster using 2 proxies, 2 mysql and 1 arbitrator node. We setup the F5 back-end pool pointing to the proxies (we can also define a HTTP health-check pointing to the proxies health-check endpoint) . Everything keeps running fine, but for now as we do not have production traffic, we are analyzing the architecture.

Our main topic is regarding what switchboard does and the documentation in cf-mysql-boshrelease. Our understanding of the term "consistent routing" is something deterministic across all switchboard proxies, in the same way as graphite carbon proxy works (the idea is the same metric ends up on the same back-end node, because all proxies have defined the same hashing algorithm, so there is a pool of carbon proxies, all of them redirecting the metric to the same back-end node), but in this case the meaning is like "persistent routing", so it seems there is no guaranty that 2 instances of the proxy will be redirecting traffic to the same mysql node. This behavior with transactions or some (bad/weird) applications can drive to deadlocks. Thats why you recommend only one proxy instance, right? ... but with only one we cannot use it for production purposes (I have not analyzed the code of switchboard, I just have a quick look).

We decided to get rid of the proxies (it simplifies the architecture -less vms-) by pointing the F5 back-end pool directly to mysql nodes and the F5 health-check to the HTTP galera-healthcheck endpoint running on each mysql node and define sticky persistence in the F5 (we have setup the same parameters in the F5 vserver as the ones on switchboard configuration), the persistence is based on a source-ip irule so all request coming from the same IP will be sent to the same back-end node (it could be a bit more flexible and use the source-ip:port pair) We are also not allowing any kind of smart features to maintain the sessions if one mysql node goes down, all connections should be reset/closed and the client has to do the connection again.. We also have another option, active-passive setup: making one node active to get all connections, and only switch to the other one when the first one is down.

Sorry, it is a bit long!, but I hope it is clear why we open the issue in cf-mysql-boshrelease and also if you guys have other ideas, maybe it can be used for both solutions.

Thanks!

menicosia commented 7 years ago

Hi @jriguera @GETandSELECT,

I've been pretty busy and haven't had a chance to catch up on this thread. While looking into a different issue, however, I did come across the original story that specifies the start/stop API for Switchboard. So, I haven't had a chance to turn this into docs yet, but here's the raw material: #124896531.

Hope that helps in the meanwhile.

Marco Nicosia Product Manager Pivotal Software, Inc.

cloudfoundry / switchboard

Docs / Ops Guide #10