autopilotpattern / mongodb

A robust and highly-scalable implementation of MongoDB in Docker using the Autopilot Pattern
Mozilla Public License 2.0
42 stars 20 forks source link

[WIP] #1

Closed tgross closed 7 years ago

tgross commented 8 years ago

@yosifkit @misterbisson I'm opening this PR with the work in the wip branch so that we have a place for review and comment.

yosifkit commented 8 years ago

Two most recent commits to address the following:

yosifkit commented 8 years ago

Improved mongo_update_replset_config; added a preStop function so that a primary can step down.

yosifkit commented 8 years ago

So I have a case that I think we need to handle:

  1. have a replica set of 3 or more nodes
  2. docker kill -s TERM current-primary
  3. primary does stepdown and the other nodes begin election
  4. primary removes itself from consul and exits
  5. container pilot delivers onChange
    1. there is not yet a primary node so it is ignored
  6. primary is elected
  7. mongo replica config still contains stopped node
    1. this might affect a vote later (since at least 50% are needed to vote for a primary)

So the question is, what should we do about the mongo config being out of sync with what is in consul?

The worst case is if someone is scaling down their cluster from say 5 to 3 and one of the containers stopped is the primary with the other being stopped while there is still no elected primary, then the end result is a 3 node cluster that has 5 nodes of config. If another node goes down it will fail to have a primary anymore. If a new node is added or if a node is removed while there is an active primary, then the replica config will become consistent with the config in consul.

One solution is to just add the mongo_update_replset_config to the health check. A second solution is to add a periodic task that does the mongo_update_replset_config on the primary node just like onChange, but with a longer period.

misterbisson commented 8 years ago

what should we do about the mongo config being out of sync with what is in consul? The worst case is if someone is scaling down [...] and one of the containers stopped is the primary...

Good question. Can we add a preStop behavior that blocks if the instance is the primary while it demotes the primary and waits for a new primary to be elected? If we set the stop timeout in Docker to a relatively long value (ten minutes, an hour?), we can give the cluster plenty of time to reconcile itself in this condition.

It's also maybe fair to ticket that as a bug and return to it. This is shaping up nicely and looks close (or nearly close) enough to MVP to be ready for merging.

yosifkit commented 8 years ago

@misterbisson, I implemented a "wait for election" after the primary node steps down (all within the pre_stop function, which I will push up shortly). But the problem is that the other mongodb containers get the on_change only ~200 milliseconds after the primary pre_stop is called and has not finished, so the election also is still unfinished. From my perspective, it looks like Container Pilot calls the pre_stop asynchronously and then immediately deregisters from consul. Am I doing something wrong or unexpected with pre_stop?

On a related note, I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop. Is the stopTimeout supposed to limit how long pre_stop takes?

tianon commented 8 years ago

The conversation in and around https://github.com/joyent/containerpilot/issues/200 seems to imply that deregistration prior to invoking preStop is intentional, and unlikely to change. Given that the service is still running at the moment of preStop, this seems kind of strange -- if the service is still running, shouldn't it still be registered? Could this behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

Additionally, given that SIGTERM is a signal, and by itself isn't actually terminal, wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service? This would mean that users who are simply using docker stop blindly would be potentially killing live services, but only in the rare edge case that the new primary election resulted in this node itself being re-elected (which is the main error case it'd be prudent to be able to cancel for here). Thus, users concerned strongly with the consistent, safe state of their cluster would use docker kill -s TERM instead of docker stop for giving their containers the chance to stop gracefully, but still allowing for the possibility that such a thing isn't currently possible and might require manual intervention of some kind (without Container Pilot force-killing their primary node simply because they asked it to shut down gracefully).

misterbisson commented 8 years ago

I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop.

stopTimeout is supposed to start after preStop completes and ContainerPilot attempts to SIGTERM the main process. That's implemented in ContainerPilot's core/app.go.

Your description (preStop doesn't timeout) doesn't sound like a bug, but it probably needs clarification in the docs. Also to clarify: the total of preStop + main process stop + postStop must all complete before the daemon sends the SIGKILL. docker stop -t <seconds> is required for anything that takes time to stop here.

Could [preStop service deregistration] behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

This is a good use case to consider and it deserves more thought/discussion.

wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service [in the situation where the user calls docker stop naively]?

I'm putting words in your mouth with "naively" there, but is it a fair distillation of the concern? How does the application enforce the refusal to shut down back to the scheduler/daemon/infrastructure? Is there any way to tell docker stop to allow a container to talk back like that?

tgross commented 8 years ago

if the service is still running, shouldn't it still be registered?

In stateful applications in particular this means client applications will still be sending writes until their TTL expires. The preStop is your opportunity to wait until writes stop coming so that you don't have a cluster in an inconsistent state before you do an election. It does mean that the client applications need to know what to do if they have no place to send writes temporarily.

Could this behavior perhaps become configurable

I'm certain open to changing the behavior if we think we've got the wrong one but another configuration option at this point has to really sell itself.

Should the "how should ContainerPilot behave?" (vs the "how does ContainerPilot behave?") portion of this discussion get moved over to a ContainerPilot issue?

yosifkit commented 8 years ago

It seems fair to bring the discussion about consul registration elsewhere, though what it means for this image is that if you docker kill -s TERM your primary node, the mongodb replica config will be in an inconsistent state until there is another event that would generate an on_change on the primary node.

The way to work around this is to first docker exec -it primary-node mongo --eval 'rs.stepDown()' before shutting down the primary node.

As for what is left here, there are two major parts I can think of:

I'll see if I can get the docs up this afternoon and then we can stick the backups into a new issue (especially since the recommended backup strategies are MongoDB Inc's "Cloud Manager", "Ops Manager", or file system snapshots).

yosifkit commented 8 years ago

Pushed my docs changes (might want a technical writer to make it better :wink:). Let me know if you want anything else in this PR.

One more thing to note: do not scale down to less than half of your voting members or you will lose quorum and the nodes will fail to have a primary node (assumed netsplit). In other words, if you have 7 and want to only have 3, you must first scale to 5 and make sure the replica config is updated to just 5 nodes then you can scale to 3. It is probably best to scale down one at a time and if the current primary is the node you want to destroy, you should do the stepdown first.

So maybe we need safe scaling down steps:

jonatanblue commented 8 years ago

This is awesome. Thank you for working on this!

At the MongoDB build step when running docker-compose -f local-compose.yml up, pip install is failing when building cryptography:

build/temp.linux-x86_64-2.7/_openssl.c:429:30: fatal error: openssl/opensslv.h: No such file or directory
 #include <openssl/opensslv.h>
                              ^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Adding libssl-dev to the packages installed by apt-get resolves it.

yosifkit commented 8 years ago

@misterbisson and @tgross, this looks fine to me. Let me know if there is anything else you would like me to add.

tianon commented 8 years ago

@tgross @misterbisson anything else you'd like @yosifkit to update here? :smile: :innocent:

jasonpincin commented 8 years ago

The use of socket.hostname() and relying on that same hostname being embedded in the Consul service name is creating problems when scaling to more than one replica set member on Triton. For example:

hostname (as reported by socket.hostname): cae35330f7c7 consul service name: mongodb-replicaset-cae35330f7c7

The manage.py script's consul_to_mongo_hostname function will convert this to member cae35330f7c7:27017 which will resolve locally within the docker container, but will not resolve anywhere else.

If the Triton account has CNS enabled, there will be a DNS records similar to: 31cfb857-c7a6-4c46-9d73-a8c26c3076ee.inst.<account id>.us-east-1.{cns.joyent.com,triton.zone}

This obviously cannot be inferred by the shortened version within the zone. Is there a suitable way to determine at this FQDN from the current Mongo master prior to adding it to the replica set?

The alternative is to use the host's IP Address, but Mongo specifically discourages this for many reasons. In the event the IP changes the replica set will not recover optimally.

Thoughts?