Validataion for nginx.conf file

gseshagirirao commented 7 years ago

One of our developer had updated the docker-compose with wrong syntax for interlock.hostname & redeployed the container. Which in turn, updated the ngnix.conf - as a result ngnix became unresponsive. Is there a way to validate the ngnix.conf file automatically?

adpjay commented 7 years ago

Are you sure that nginx because unresponsive or is it that you get a 502 response for a request to the offending contianer? The reload does validate and nginx.conf will fail to reload if there is a problem with it.

ehazlett commented 7 years ago

Hmm we currently use a HUP signal to reload the container (https://github.com/ehazlett/interlock/blob/master/ext/lb/nginx/nginx.go#L77). According to the docs a HUP should try to reload and only replace the workers if successful:

From the docs: "What happens is that when NGINX receives the HUP signal, it tries to parse the configuration file (the specified one, if present, otherwise the default), and if successful, tries to apply a new configuration (i.e. re-open the log files and listen sockets)."

So according to that it appears that Interlock should be reloading as desired. Do you have any logs to show the fail?

gseshagirirao commented 7 years ago

@adpjay you are right, we started getting 502 for all the apps.

@ehazlett - sorry, we dont have logs sice we restarted the containers.

adpjay commented 7 years ago

@gseshagirirao - if you get a 502, that usually means that nginx is up and thinks it knows where to route the requests but the target containers are not available. A 503 means it doesn't know where to route the request. Can you give an example of the invalid syntax used for the interlock.hostname? It could be that the nginx.conf did fail to reload after the offending container came up in the swarm on a new ip address and port and therefore the nginx.conf was stale and routing to an IP:port that is not assigned to a container. With the 1.3.2 release of interlock, the nginx logs will contain the ip and port it is routing to for a request (thanks @ehazlett). You can then compare that to your containers in the swarm to see where the requests are going.

gseshagirirao commented 7 years ago

@adpjay , here is the example of invalid syntax.. ( two dots in hostname). We are planning to upgrade to interlock 1.3.2 soon.

interlock.hostname: "ad.app..dc2.iat"

gseshagirirao commented 7 years ago

@ehazlett - Seems like this is someting which needs to be addressed within the nginx container itself. Would you please add it to defect/feature request?

ehazlett commented 7 years ago

@gseshagirirao thx -- i'm going to get a repro case as this should be handled as part of the SIGHUP by nginx itself. How are you running the nginx proxy containers?

ehazlett commented 7 years ago

Ok so here's the issue. When issuing a docker kill -s HUP it is not returning an error so Interlock thinks everything worked. I'm not sure why the signal isn't propagating back but I think I have a workaround.

rajarajanpsj commented 7 years ago

@ehazlett - Actually nginx does validate and NOT reload when it finds a problem with the conf file. But the problem is someone has to fix it (e.g. remove the problematic containers after figuring out what happened after someone reports). Until that is done, any new deployments to swarm will not be reflected in nginx (nginx reload will fail bcs of the error caused by that single container). In this case atleast the currently loaded nginx conf will continue to work.

But worst case happens when the ops guys try to restart the nginx containers (happens usually thinking something is wrong with nginx reload and did happen to us) and all the nginx containers will FAIL to come up and hence bringing down all the environments.

Is there a possibility to include a logic in interlock to revert back to the previously generated nginx conf, if the newly created one is invalid (validate using nginx -t option and check if it is valid, not sure if it is a good idea but just throwing in an idea)

ehazlett commented 7 years ago

@rajarajanpsj correct. that's what I said in my previous comment. The ContainerKill API call is returning without error so it appears that nginx is properly reloaded even with a bad config.

Effectively it does keep the old config since it doesn't reload. I've added reporting the error as well as the problem with the config:

DEBU[0325] poller tick
DEBU[0325] reloading proxy container: id=e539d646ac210f417b775010128bb834f898d6a09ab14709e72f59011c28cf75  ext=nginx
ERRO[0325] error reloading container, invalid proxy configuration: f2017/06/06 18:05:01 [emerg] 126#126: invalid server name or wildcard "ad.app..dc2.iat." on 0.0.0.0:80  ext=nginx
INFO[0325] reload duration: 1300.40ms                    ext=lb

Closing as 425c4dc25e adds this.

niroowns commented 7 years ago

@ehazlett Hi Evan, I think a roll back is needed though. If there is no reload - you are correct, that particular application will not be impacted and nginx will stay up. However, this also entails that any NEW events (whether related to the originating application or not) will not be picked up unless the troublesome container is removed. Does that make sense?

adpjay commented 7 years ago

Even made more sever - if we don't rollback the nginx.conf change and the container is restarted, we're left with an nginx that has no configuration and all containers within the swarm become unreachable via dns. @ehazlett do you have an image that has the error instrumentation in there? I'm not seeing nginx getting updated in some cases, and your change may give more insight.

ehazlett commented 7 years ago

@niroowns a rollback would be defined as rolling back to the previous configuration. if you want respond to new updates that's not a rollback. With the latest update, an error is logged and once you remove that erroneous config updates will flow again. It doesn't seem feasible to build the entire nginx config validation logic. Logging and reporting seems the saner approach.

@adpjay the latest ehazlett/interlock:dev has this.

adpjay commented 7 years ago

@ehazlett Agreed that interlock shouldn't duplicate the config validation that nginx is doing. A rollback would be just putting back what was there before. It makes the swarm unchangeable until the offending container is fixed, but at least it continues to be usable in the case where nginx is restarted. Do you think you'd be able to log the container that triggered the config update that caused an error?

ehazlett commented 7 years ago

@adpjay since nginx isn't reloading it is continuing to use the original config which is effectively the same as a rollback.

the way that the update works is that the config is completely built before triggering the reload so currently there is no way to tell what triggered the invalid config.

adpjay commented 7 years ago

@ehazlett Right, but if the nginx container gets restarted while it has a bad config in it, it will use the bad config on restart. We'd feel better if it had the last-known good config in case the nginx container gets restarted, which seems to happen often.

As far as knowing the bad container - do you know the container associated with the event that triggered the config rewrite? Let me know what you think about something like the steps below. I'm taking a guess on how some of this works, so take them with a grain of salt:

Container 'A' with an invalid value for interlock.domain comes up in the swarm
interlock receives an event and reads the interlock properties from Container 'A'
The entire nginx.conf is rewritten within interlock.
The old nginx.conf is saved as nginx.conf.old in the nginx container
The new nginx.conf is written to the nginx container
kill -s HUP is sent to the nginx container
an error is received from the above call.
We know the update was caused by the container identified in step 2 so we indicate that in the log message
we move nginx.conf.old back to nginx.conf within the nginx container

ehazlett commented 7 years ago

@adpjay i don't really like managing backup/restore config in the container but you have a point. i've updated ehazlett/interlock:dev with this functionality. Can you try it and see if that's what you are looking for? If so, I'll open a PR.

adpjay commented 7 years ago

@ehazlett I'm not seeing the rollback behavior. The build looks ok to me: interlock_1 | time="2017-06-06T16:48:48Z" level=info msg="interlock 1.4.0-dev (18e36cc)"

ehazlett commented 7 years ago

Can you post the logs please?

adpjay commented 7 years ago

@ehazlett OK, we figured out the issue. We'll log another issue to track it but we have a work-around. Your config restore is working very well and the error message printed out is very useful: time="2017-06-06T20:55:20Z" level=error msg="error reloading container, invalid proxy configuration: �2017/06/06 20:55:20 [emerg] 47#47: invalid server name or wildcard "interlock-route-a-test..a.a.example.com" on 0.0.0.0:80" ext=nginx

The reason why I wasn't seeing the log message yesterday and why I've been struggling lately to get the reloads to be triggered by interlock at all in my environment is because I have old and not-running interlock and nginx containers. We couldn't quite follow the logic in the proxyContainersToRestart() function of lb.lb.go, but the end result is that 6 interlock containers were found and 5 nginx containers were found, and the reload was done for 0 nginx containers (even though the running nginx container did get its nginx.conf file updated). Note that of the 6 interlock containers found, only 1 was running and I also only had 1 nginx container running.

ehazlett / interlock

Validataion for nginx.conf file #217