exec stdout w/o newlines breaks health checks by logging on closed pipe

ochanism commented 7 years ago

Recently, I updated the containerpilot version from 2.7.3 to 2.7.4 and 2.7.5 and I got the error as described below.

2017/07/03 11:28:35 us-east-1b-log-zookeeper.health exited with error: io: read/write on closed pipe

Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service. When starting the service, it looked fine, but the problem occurred in a few days. I looked into containerpilot code to figure out what caused it. The error came out from here and it might have been caused by closing the logger in here. I couldn't figure out why the logger was closed and who did it. When I sent SIGHUP to a containerpilot process, the logger pipe was recreated and the service health was changed to in-service.

Anyway, in this situation, I think service health should be in-service, even if the logger is closed. I've attached the some scripts below to help you debug this issue.

health check script

#!/bin/bash
echo ruok | nc ${HOSTNAME} ${ZOOKEEPER_PORT}

containerpilot configuration

{
  "consul": "localhost:8500",
  "preStart": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/preStart.sh",
  "logging": {
    "level": "INFO",
    "format": "default",
    "output": "stdout"
  },
  "stopTimeout": 5,
  "preStop": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/preStop.sh",
  "postStop": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/postStop.sh",
  "services": [
    {
      "id": "{{.DC}}-{{.ZOOKEEPER_CONSUL_ID}}-{{.HOSTNAME}}",
      "name": "{{.DC}}-{{.ZOOKEEPER_CONSUL_NAME}}",
      "port": {{.ZOOKEEPER_PORT}},
      "health": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/healthCheck.sh",
      "tags": [
          {{.ZOOKEEPER_TAGS}}
      ],
      "interfaces": [
          {{.CONTAINER_PILOT_INTERFACES}}
      ],
      "poll": 10,
      "timeout": "10s",
      "ttl": 30
    }
  ],
  "backends": [
  ],
  "tasks": [
  ],
  "coprocesses": [
    {
      "name": "setExporter",
      "command": ["/zookeeper/{{.ZOOKEEPER_NAME}}/config/setExporter.sh"],
      "restarts": "never"
    }
  ]
}

tgross commented 7 years ago

Thanks for the detailed report @ochanism. I've dug into this a bit but don't have a conclusive answer yet.

At one point we had a problem where we were dropping logs (this was at the root of #267 #294 #295 #296). We fixed that in #297 by making it so that we didn't close the logger for the health check until we stop polling entirely in Service.PollStop. And that shouldn't happen unless ContainerPilot gets its configuration reloaded or terminated (and in that case it's racy but we can't really do much about that).

That being said, the error message definitely seems to be bubbling up from io.ErrClosedPipe, which if you chase it down through the stdlib is I think what we'd expect if the logger is closed when we try to write to it. I'm trying to write a repeatable repro now.

Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service.

Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.

tgross commented 7 years ago

I've broken this down a bit to confirm the error message is as expected, at least.

package main

import (
    "os/exec"

    log "github.com/Sirupsen/logrus"
)

func main() {
    logger := log.StandardLogger().Writer()

    cmd := exec.Command("sh", "-c", "echo stdout; echo 1>&2 stderr")
    cmd.Stdout = logger
    cmd.Stderr = logger

    cmd.Start()
    //  logger.Close() // uncomment this line to get expected err

    if err := cmd.Wait(); err != nil {
        log.Fatal(err)
    }
}

With the logger.Close() line commented-out we get:

$ go run logbug.go
INFO[0000] stdout
INFO[0000] stderr

Whereas with that line we get:

$ go run logbug.go
FATA[0000] io: read/write on closed pipe
exit status 1

So think the problem is now figuring out how or why we're closing the logger. With it only happening after a few days getting a reliable repro will be key.

tgross commented 7 years ago

I think I've narrowed this down further with the following:

check.sh:

#!/bin/bash

while true; do
    echo 'check'
    sleep 1
done

logbug.go:

package main

import (
    "os/exec"
    "time"

    log "github.com/Sirupsen/logrus"
)

func main() {
    logger := log.StandardLogger().Writer()
    cmd := exec.Command("./check.sh")
    cmd.Stdout = logger
    cmd.Stderr = logger

    err := cmd.Start()
    if err != nil {
        log.Fatalf("failed to start: %v", err)
    }

    go func() {
        timer := time.NewTimer(time.Second * 2)
        <-timer.C
        logger.Close()
    }()

    if err := cmd.Wait(); err != nil {
        log.Fatalf("failed: %v", err)
    }
}

output:

$ go run logbug.go
INFO[0000] check
INFO[0001] check
FATA[0003] failed: signal: broken pipe
exit status 1

Which means that in a case where the exec is running and we've had a chance to properly set up the loggers, we won't get the io: read/write on closed pipe error even if the logger is closed prematurely (we get a different error entirely). So the problem we're seeing is where the logger is being closed before we call wait?

@ochanism do you have any other debug logs from the health check around this error? It might help to see if we got other messages.

tgross commented 7 years ago

I spent a bit of time trying to investigate whether it had to do with the timeout, which is the same as the poll interval, under the theory that subsequent runs were interfering with each other. I wrote a config that would exercise this area with a very very tight interval and let it run for a good long while. But I still can't replicate the closing of the logger except artificially; there's not a code path that hits that. I'm beginning to wonder whether the logger is a red herring, and that it's another io.WriteCloser in the path that's getting closed when it shouldn't be?

tgross commented 7 years ago

I think I've figured out a way to bypass the issue entirely as well as solidify the fix in #297 and fix a regression in log formatting which I was planning on fixing for #393.

Instead of setting up individual loggers, I'm going to update Logrus to 1.0 and then use a context logger and attach the new logrus.Entry.Writer to the process stdout/stderr, rather than creating a new logger. I'll have a PR for v3 shortly (probably reviewed on Weds morning at this point, as it's a holiday here in the US tomorrow) and backport it to v2.

tgross commented 7 years ago

PR open for v3 https://github.com/joyent/containerpilot/pull/424 PR open for v2 https://github.com/joyent/containerpilot/pull/425

ochanism commented 7 years ago

Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.

@tgross, yes, I got the error every 10 sec.

do you have any other debug logs from the health check around this error? It might help to see if we got other messages.

@tgross, I launched it with debug log level yesterday and I've been waiting for the error. I will provide the debug log to you soon if I get it.

Thanks for your fast support.

ochanism commented 7 years ago

I got the debug logs concerning the error. It occurred 26 hours after the service started. According to the logs, the logger was not closed. It was broken due to a large size token. Do you have any idea about this log?

2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run waiting for PID 73299: 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run complete 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run waiting for PID 73424: 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run complete 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run waiting for PID 73622: 2017/07/04 13:24:31 Error while reading from Writer: bufio.Scanner: token too long 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run waiting for PID 73772: 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run waiting for PID 73905: 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run#gofunc received quit

ochanism commented 7 years ago

I found that Logrus 1.0 also has the same issue related to token size in here. I think Logrus 1.0 wouldn't be safe from this error.

ochanism commented 7 years ago

I think I figured out what caused it. The output string of the health check script did not contain a new line character at the end of the string, so the output might be accumulated on the buffer.

The defualt value of bufio.MaxScanTokenSize is set to 64KB. In case of my situation, the length of the output string was 7 bytes and the health check interval was 10 sec. And the service was in-service for 26 hours. I calculated the total length of the output during 26 hours like this: ((263600)/10)7 = 65520. It is very close to 64KB.

I added a new line character to the health check script to verify my hypothesis and I've been keeping my eyes on it

ochanism commented 7 years ago

I confirmed that there was no error with a new line character in the health check script. You can easily reproduce the error with the below health check script.

#!/bin/bash
printf aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

tgross commented 7 years ago

That's some nice detective work there, @ochanism!

I'm glad you were able to find a workaround for now, but we should probably try to come up with something more robust to this failure. Next steps:

I can take a look at https://github.com/sirupsen/logrus/issues/564 and see whether we can fix it in the library. Blindly increasing the buffer size seems like a bad idea, so handling ErrTooLong (or ErrBufferFull) gracefully as suggested in that issue seems like a better approach.
We can see if it's feasible to inject a newline at the end of an exec run so that we ensure the log entry is closed, but do so without adding a spurious empty log line in the case where the health check does exit with a newline.
Alternately, we might want to see about capturing this error and handling it more gracefully in our wait, but we need to make sure we don't break all other logging in the process.
~I'm going to open a new issue to track this work~ Edit: actually I'm just going to keep this one open and rename it.
I'll open a PR w/ a failing test that we can have commented-out in master but will serve as a "definition of done" when it works.

tgross commented 7 years ago

With respect to trying to handle the error in ContainerPilot, it looks like neither of the two approaches above are going to work (injecting a newline or capturing the error and handling it gracefully).

From the docs of the underlying stdlib interface (https://golang.org/pkg/bufio/#Scanner):

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token.

Once we've hit the error the logger is already broken. If it were possible to inject a newline we would reduce the risk of hitting an error but we'll still hit it in the case where a single exec outputs more than 64k w/o a newline. Handling this error within ContainerPilot would mean recreating the logger entirely, but even in that case we'd end up losing all the logs from any long-running jobs forever. This doesn't seem like a good approach, so we should try to get the fix done at the library level.

tgross commented 7 years ago

I've pushed https://github.com/joyent/containerpilot/commit/3408f9539f52245af6f85eb6c5de9eddfef21794 to update the documentation to warn about this issue. I'm going to cut a release for 2.7.6 and 3.1.2 that gives us the logrus 1.0 update and closes any possibility of a race around log closing.

And then I'll circle back to look at https://github.com/sirupsen/logrus/issues/564, in particular what's happening there around the bufio.Scanner.Scan call

tgross commented 7 years ago

I don't have the bandwidth right now to submit a fix to sirupsen/logrus. Ultimately this is an end-user issue that we've documented a warning for, and fixing the upstream problem doesn't have a ton of urgency. I'm going to marked this as blocked and help-wanted for the time being.

tgross commented 7 years ago

Another workaround option for someone running into this kind of problem will be to use the passthru logging option described in #462 #437

TritonDataCenter / containerpilot

exec stdout w/o newlines breaks health checks by logging on closed pipe #423