NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.07k stars 14.08k forks source link

nixos-rebuild doesn't log root cause of fail when services fails during activation #20099

Open danbst opened 8 years ago

danbst commented 8 years ago

Issue description

I often do nixos-rebuild with wrong configuration. Malconfigured services (for example, nginx) output errors on system activation. Here is excerpt from one of my nixops deploy:

jolly.roger> ● nginx.service - Nginx Web Server
jolly.roger>    Loaded: loaded (/nix/store/k4nws8cksrzv2x7ns2qrhv8pb1xlxig8-unit-nginx.service/nginx.service; bad; vendor preset: enabled)
jolly.roger>    Active: failed (Result: start-limit-hit) since Wed 2016-11-02 15:24:40 UTC; 39s ago
jolly.roger>   Process: 21024 ExecStart=/nix/store/lsi5kh270laxx055isx0b6x25ih8fblx-nginx-1.10.1/bin/nginx -c /nix/store/5fq692qlk1hlwlrnw5h58y1ilj1idcg
jolly.roger>   Process: 21018 ExecStartPre=/nix/store/j18i59gli6w66khbn0dr5yh8d1sk9fa3-unit-script/bin/nginx-pre-start (code=exited, status=0/SUCCESS)
jolly.roger>  Main PID: 21024 (code=exited, status=1/FAILURE)
jolly.roger>
jolly.roger> Nov 02 15:24:29 jolly.roger systemd[1]: nginx.service: Unit entered failed state.
jolly.roger> Nov 02 15:24:29 jolly.roger systemd[1]: nginx.service: Failed with result 'exit-code'.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Service hold-off time over, scheduling restart.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: Stopped Nginx Web Server.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Start request repeated too quickly.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: Failed to start Nginx Web Server.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Unit entered failed state.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Failed with result 'start-limit-hit'.
jolly.roger> error: unable to activate new configuration
error: activation of 1 of 1 machines failed (namely on ‘jolly.roger’)

The problem is, I don't get the root cause in this output, but lots of duplication. Here is my transcript of above message:

jolly.roger> Nov 02 15:24:29 jolly.roger systemd[1]: nginx.service: failed!
jolly.roger> Nov 02 15:24:29 jolly.roger systemd[1]: nginx.service: failed!
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: failed, I'm going to restart it!
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: Stopped Nginx Web Server.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: failed!
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: Failed to start Nginx Web Server.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: failed!
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Failed with result 'start-limit-hit'.
jolly.roger> error: unable to activate new configuration

Would be much greater if I got the root cause in this output:

jolly.roger> Nov 02 15:24:29 jolly.roger nginx[21024]: nginx: [warn] server name "http://example.com" has suspicious symbols in /nix/store/5fq692qlk1hlwlrnw5h58y1ilj1idcgd-nginx.conf:60
jolly.roger> Nov 02 15:24:29 jolly.roger nginx[21024]: nginx: [emerg] invalid URL prefix in /nix/store/5fq692qlk1hlwlrnw5h58y1ilj1idcgd-nginx.conf:69
jolly.roger> Nov 02 15:24:29 jolly.roger systemd[1]: nginx.service: Unit entered failed state.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: nginx.service: Service hold-off time over, scheduling restart.
jolly.roger> Nov 02 15:24:40 jolly.roger systemd[1]: Stopped Nginx Web Server.
jolly.roger> error: unable to activate new configuration

This isn't specific to nginx, I've seen this to many other services (for example, for containers).

Here are two issues: 1) too many duplicated "failed!" logs outputed 2) stderr isn't redirected as err priority to journald

PS. I have a fix for 2), but it requires lots of polishing

Steps to reproduce

services.nginx.enable = true;
services.nginx.virtualHosts."bla".locations."/".proxyPass = "100";

Technical details

nixos 16.09

danbst commented 8 years ago

The patch for 2)

diff --git a/nixos/modules/system/boot/systemd.nix b/nixos/modules/system/boot/systemd.nix
index 338b068..14a1ef1 100644
--- a/nixos/modules/system/boot/systemd.nix
+++ b/nixos/modules/system/boot/systemd.nix
@@ -185,7 +185,14 @@ let

   makeJobScript = name: text:
     let mkScriptName =  s: (replaceChars [ "\\" ] [ "-" ] (shellEscape s) );
-        x = pkgs.writeTextFile { name = "unit-script"; executable = true; destination = "/bin/${mkScriptName name}"; inherit text; };
+        realScriptText = ''
+          #! ${pkgs.stdenv.shell} -e
+          (
+          _DUMMY_=
+          ${text}
+          ) 2> >( while read line; do echo "<5>''${line}"; done)
+        '';
+        x = pkgs.writeTextFile { name = "unit-script"; executable = true; destination = "/bin/${mkScriptName name}"; text = realScriptText; };
     in "${x}/bin/${mkScriptName name}";

   unitConfig = { name, config, ... }: {
@@ -231,37 +238,31 @@ let
         }
         (mkIf (config.preStart != "")
           { serviceConfig.ExecStartPre = makeJobScript "${name}-pre-start" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.preStart}
             '';
           })
         (mkIf (config.script != "")
           { serviceConfig.ExecStart = makeJobScript "${name}-start" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.script}
             '' + " " + config.scriptArgs;
           })
         (mkIf (config.postStart != "")
           { serviceConfig.ExecStartPost = makeJobScript "${name}-post-start" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.postStart}
             '';
           })
         (mkIf (config.reload != "")
           { serviceConfig.ExecReload = makeJobScript "${name}-reload" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.reload}
             '';
           })
         (mkIf (config.preStop != "")
           { serviceConfig.ExecStop = makeJobScript "${name}-pre-stop" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.preStop}
             '';
           })
         (mkIf (config.postStop != "")
           { serviceConfig.ExecStopPost = makeJobScript "${name}-post-stop" ''
-              #! ${pkgs.stdenv.shell} -e
               ${config.postStop}
             '';
           })

Current problems with patch: containers fail to start because of some fd error; some services log info messages to stderr; when stderr is appended to stdout line, we get ugly <5> in info logs.

Profpatsch commented 6 years ago

(triage) does this still happen?

danbst commented 6 years ago

Yes, however this should be tagged with "Feature request", I wouldn't call this a bug.

stale[bot] commented 4 years ago

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.