Closed gseshagirirao closed 7 years ago
Are you sure that nginx because unresponsive or is it that you get a 502 response for a request to the offending contianer? The reload does validate and nginx.conf will fail to reload if there is a problem with it.
Hmm we currently use a HUP
signal to reload the container (https://github.com/ehazlett/interlock/blob/master/ext/lb/nginx/nginx.go#L77). According to the docs a HUP
should try to reload and only replace the workers if successful:
From the docs: "What happens is that when NGINX receives the HUP signal, it tries to parse the configuration file (the specified one, if present, otherwise the default), and if successful, tries to apply a new configuration (i.e. re-open the log files and listen sockets)."
So according to that it appears that Interlock should be reloading as desired. Do you have any logs to show the fail?
@adpjay you are right, we started getting 502 for all the apps.
@ehazlett - sorry, we dont have logs sice we restarted the containers.
@gseshagirirao - if you get a 502, that usually means that nginx is up and thinks it knows where to route the requests but the target containers are not available. A 503 means it doesn't know where to route the request. Can you give an example of the invalid syntax used for the interlock.hostname? It could be that the nginx.conf did fail to reload after the offending container came up in the swarm on a new ip address and port and therefore the nginx.conf was stale and routing to an IP:port that is not assigned to a container. With the 1.3.2 release of interlock, the nginx logs will contain the ip and port it is routing to for a request (thanks @ehazlett). You can then compare that to your containers in the swarm to see where the requests are going.
@adpjay , here is the example of invalid syntax.. ( two dots in hostname). We are planning to upgrade to interlock 1.3.2 soon.
interlock.hostname: "ad.app..dc2.iat"
@ehazlett - Seems like this is someting which needs to be addressed within the nginx container itself. Would you please add it to defect/feature request?
@gseshagirirao thx -- i'm going to get a repro case as this should be handled as part of the SIGHUP
by nginx itself. How are you running the nginx proxy containers?
Ok so here's the issue. When issuing a docker kill -s HUP
it is not returning an error so Interlock thinks everything worked. I'm not sure why the signal isn't propagating back but I think I have a workaround.
@ehazlett - Actually nginx does validate and NOT reload when it finds a problem with the conf file. But the problem is someone has to fix it (e.g. remove the problematic containers after figuring out what happened after someone reports). Until that is done, any new deployments to swarm will not be reflected in nginx (nginx reload will fail bcs of the error caused by that single container). In this case atleast the currently loaded nginx conf will continue to work.
But worst case happens when the ops guys try to restart the nginx containers (happens usually thinking something is wrong with nginx reload and did happen to us) and all the nginx containers will FAIL to come up and hence bringing down all the environments.
Is there a possibility to include a logic in interlock to revert back to the previously generated nginx conf, if the newly created one is invalid (validate using nginx -t option and check if it is valid, not sure if it is a good idea but just throwing in an idea)
@rajarajanpsj correct. that's what I said in my previous comment. The ContainerKill
API call is returning without error so it appears that nginx is properly reloaded even with a bad config.
Effectively it does keep the old config since it doesn't reload. I've added reporting the error as well as the problem with the config:
DEBU[0325] poller tick
DEBU[0325] reloading proxy container: id=e539d646ac210f417b775010128bb834f898d6a09ab14709e72f59011c28cf75 ext=nginx
ERRO[0325] error reloading container, invalid proxy configuration: f2017/06/06 18:05:01 [emerg] 126#126: invalid server name or wildcard "ad.app..dc2.iat." on 0.0.0.0:80 ext=nginx
INFO[0325] reload duration: 1300.40ms ext=lb
Closing as 425c4dc25e adds this.
@ehazlett Hi Evan, I think a roll back is needed though. If there is no reload - you are correct, that particular application will not be impacted and nginx will stay up. However, this also entails that any NEW events (whether related to the originating application or not) will not be picked up unless the troublesome container is removed. Does that make sense?
Even made more sever - if we don't rollback the nginx.conf change and the container is restarted, we're left with an nginx that has no configuration and all containers within the swarm become unreachable via dns. @ehazlett do you have an image that has the error instrumentation in there? I'm not seeing nginx getting updated in some cases, and your change may give more insight.
@niroowns a rollback would be defined as rolling back to the previous configuration. if you want respond to new updates that's not a rollback. With the latest update, an error is logged and once you remove that erroneous config updates will flow again. It doesn't seem feasible to build the entire nginx config validation logic. Logging and reporting seems the saner approach.
@adpjay the latest ehazlett/interlock:dev
has this.
@ehazlett Agreed that interlock shouldn't duplicate the config validation that nginx is doing. A rollback would be just putting back what was there before. It makes the swarm unchangeable until the offending container is fixed, but at least it continues to be usable in the case where nginx is restarted. Do you think you'd be able to log the container that triggered the config update that caused an error?
@adpjay since nginx isn't reloading it is continuing to use the original config which is effectively the same as a rollback.
the way that the update works is that the config is completely built before triggering the reload so currently there is no way to tell what triggered the invalid config.
@ehazlett Right, but if the nginx container gets restarted while it has a bad config in it, it will use the bad config on restart. We'd feel better if it had the last-known good config in case the nginx container gets restarted, which seems to happen often.
As far as knowing the bad container - do you know the container associated with the event that triggered the config rewrite? Let me know what you think about something like the steps below. I'm taking a guess on how some of this works, so take them with a grain of salt:
@adpjay i don't really like managing backup/restore config in the container but you have a point. i've updated ehazlett/interlock:dev
with this functionality. Can you try it and see if that's what you are looking for? If so, I'll open a PR.
@ehazlett I'm not seeing the rollback behavior. The build looks ok to me: interlock_1 | time="2017-06-06T16:48:48Z" level=info msg="interlock 1.4.0-dev (18e36cc)"
Can you post the logs please?
@ehazlett OK, we figured out the issue. We'll log another issue to track it but we have a work-around. Your config restore is working very well and the error message printed out is very useful:
time="2017-06-06T20:55:20Z" level=error msg="error reloading container, invalid proxy configuration: �2017/06/06 20:55:20 [emerg] 47#47: invalid server name or wildcard "interlock-route-a-test..a.a.example.com" on 0.0.0.0:80" ext=nginx
The reason why I wasn't seeing the log message yesterday and why I've been struggling lately to get the reloads to be triggered by interlock at all in my environment is because I have old and not-running interlock and nginx containers. We couldn't quite follow the logic in the proxyContainersToRestart() function of lb.lb.go, but the end result is that 6 interlock containers were found and 5 nginx containers were found, and the reload was done for 0 nginx containers (even though the running nginx container did get its nginx.conf file updated). Note that of the 6 interlock containers found, only 1 was running and I also only had 1 nginx container running.
One of our developer had updated the docker-compose with wrong syntax for interlock.hostname & redeployed the container. Which in turn, updated the ngnix.conf - as a result ngnix became unresponsive. Is there a way to validate the ngnix.conf file automatically?