Closed tgross closed 7 years ago
Two most recent commits to address the following:
mongo_update_replset_config
replSetGetStatus
instead of is_primary
Improved mongo_update_replset_config
; added a preStop function so that a primary can step down.
So I have a case that I think we need to handle:
docker kill -s TERM current-primary
onChange
So the question is, what should we do about the mongo config being out of sync with what is in consul?
The worst case is if someone is scaling down their cluster from say 5 to 3 and one of the containers stopped is the primary with the other being stopped while there is still no elected primary, then the end result is a 3 node cluster that has 5 nodes of config. If another node goes down it will fail to have a primary anymore. If a new node is added or if a node is removed while there is an active primary, then the replica config will become consistent with the config in consul.
One solution is to just add the mongo_update_replset_config
to the health check. A second solution is to add a periodic task that does the mongo_update_replset_config
on the primary node just like onChange
, but with a longer period.
what should we do about the mongo config being out of sync with what is in consul? The worst case is if someone is scaling down [...] and one of the containers stopped is the primary...
Good question. Can we add a preStop
behavior that blocks if the instance is the primary while it demotes the primary and waits for a new primary to be elected? If we set the stop timeout in Docker to a relatively long value (ten minutes, an hour?), we can give the cluster plenty of time to reconcile itself in this condition.
It's also maybe fair to ticket that as a bug and return to it. This is shaping up nicely and looks close (or nearly close) enough to MVP to be ready for merging.
@misterbisson, I implemented a "wait for election" after the primary node steps down (all within the pre_stop
function, which I will push up shortly). But the problem is that the other mongodb containers get the on_change
only ~200 milliseconds after the primary pre_stop
is called and has not finished, so the election also is still unfinished. From my perspective, it looks like Container Pilot calls the pre_stop
asynchronously and then immediately deregisters from consul. Am I doing something wrong or unexpected with pre_stop?
On a related note, I set stopTimeout
to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop
. Is the stopTimeout supposed to limit how long pre_stop
takes?
The conversation in and around https://github.com/joyent/containerpilot/issues/200 seems to imply that deregistration prior to invoking preStop
is intentional, and unlikely to change. Given that the service is still running at the moment of preStop
, this seems kind of strange -- if the service is still running, shouldn't it still be registered? Could this behavior perhaps become configurable so that preStop
can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange
triggered by the deregistration of the current "primary"?
Additionally, given that SIGTERM
is a signal, and by itself isn't actually terminal, wouldn't it also make sense for preStop
to be able to exit non-zero and cancel the termination of the service? This would mean that users who are simply using docker stop
blindly would be potentially killing live services, but only in the rare edge case that the new primary election resulted in this node itself being re-elected (which is the main error case it'd be prudent to be able to cancel for here). Thus, users concerned strongly with the consistent, safe state of their cluster would use docker kill -s TERM
instead of docker stop
for giving their containers the chance to stop gracefully, but still allowing for the possibility that such a thing isn't currently possible and might require manual intervention of some kind (without Container Pilot force-killing their primary node simply because they asked it to shut down gracefully).
I set
stopTimeout
to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for justpre_stop
.
stopTimeout
is supposed to start after preStop
completes and ContainerPilot attempts to SIGTERM the main process. That's implemented in ContainerPilot's core/app.go.
Your description (preStop
doesn't timeout) doesn't sound like a bug, but it probably needs clarification in the docs. Also to clarify: the total of preStop
+ main process stop + postStop
must all complete before the daemon sends the SIGKILL. docker stop -t <seconds>
is required for anything that takes time to stop here.
Could [
preStop
service deregistration] behavior perhaps become configurable so thatpreStop
can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?
This is a good use case to consider and it deserves more thought/discussion.
wouldn't it also make sense for
preStop
to be able to exit non-zero and cancel the termination of the service [in the situation where the user callsdocker stop
naively]?
I'm putting words in your mouth with "naively" there, but is it a fair distillation of the concern? How does the application enforce the refusal to shut down back to the scheduler/daemon/infrastructure? Is there any way to tell docker stop
to allow a container to talk back like that?
if the service is still running, shouldn't it still be registered?
In stateful applications in particular this means client applications will still be sending writes until their TTL expires. The preStop
is your opportunity to wait until writes stop coming so that you don't have a cluster in an inconsistent state before you do an election. It does mean that the client applications need to know what to do if they have no place to send writes temporarily.
Could this behavior perhaps become configurable
I'm certain open to changing the behavior if we think we've got the wrong one but another configuration option at this point has to really sell itself.
Should the "how should ContainerPilot behave?" (vs the "how does ContainerPilot behave?") portion of this discussion get moved over to a ContainerPilot issue?
It seems fair to bring the discussion about consul registration elsewhere, though what it means for this image is that if you docker kill -s TERM
your primary node, the mongodb replica config will be in an inconsistent state until there is another event that would generate an on_change
on the primary node.
The way to work around this is to first docker exec -it primary-node mongo --eval 'rs.stepDown()'
before shutting down the primary node.
As for what is left here, there are two major parts I can think of:
I'll see if I can get the docs up this afternoon and then we can stick the backups into a new issue (especially since the recommended backup strategies are MongoDB Inc's "Cloud Manager", "Ops Manager", or file system snapshots).
Pushed my docs changes (might want a technical writer to make it better :wink:). Let me know if you want anything else in this PR.
One more thing to note: do not scale down to less than half of your voting members or you will lose quorum and the nodes will fail to have a primary node (assumed netsplit). In other words, if you have 7 and want to only have 3, you must first scale to 5 and make sure the replica config is updated to just 5 nodes then you can scale to 3. It is probably best to scale down one at a time and if the current primary is the node you want to destroy, you should do the stepdown first.
So maybe we need safe scaling down steps:
docker exec -it node-to-kill mongo --quiet --eval 'rs.stepDown()'
2016-08-22T23:40:27.475+0000 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27017' :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
rs.stepDown@src/mongo/shell/utils.js:1182:12
@(shell eval):1:1
# or
{ "ok" : 0, "errmsg" : "not primary so can't step down", "code" : 10107 }
docker kill -s TERM node-to-kill
docker rm
or whatever you want nowThis is awesome. Thank you for working on this!
At the MongoDB build step when running docker-compose -f local-compose.yml up
, pip install is failing when building cryptography:
build/temp.linux-x86_64-2.7/_openssl.c:429:30: fatal error: openssl/opensslv.h: No such file or directory
#include <openssl/opensslv.h>
^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Adding libssl-dev
to the packages installed by apt-get
resolves it.
@misterbisson and @tgross, this looks fine to me. Let me know if there is anything else you would like me to add.
@tgross @misterbisson anything else you'd like @yosifkit to update here? :smile: :innocent:
The use of socket.hostname()
and relying on that same hostname being embedded in the Consul service name is creating problems when scaling to more than one replica set member on Triton. For example:
hostname (as reported by socket.hostname): cae35330f7c7
consul service name: mongodb-replicaset-cae35330f7c7
The manage.py
script's consul_to_mongo_hostname
function will convert this to member cae35330f7c7:27017
which will resolve locally within the docker container, but will not resolve anywhere else.
If the Triton account has CNS enabled, there will be a DNS records similar to: 31cfb857-c7a6-4c46-9d73-a8c26c3076ee.inst.<account id>.us-east-1.{cns.joyent.com,triton.zone}
This obviously cannot be inferred by the shortened version within the zone. Is there a suitable way to determine at this FQDN from the current Mongo master prior to adding it to the replica set?
The alternative is to use the host's IP Address, but Mongo specifically discourages this for many reasons. In the event the IP changes the replica set will not recover optimally.
Thoughts?
@yosifkit @misterbisson I'm opening this PR with the work in the
wip
branch so that we have a place for review and comment.