Open ochanism opened 7 years ago
Thanks for the detailed report @ochanism. I've dug into this a bit but don't have a conclusive answer yet.
At one point we had a problem where we were dropping logs (this was at the root of #267 #294 #295 #296). We fixed that in #297 by making it so that we didn't close the logger for the health check until we stop polling entirely in Service.PollStop
. And that shouldn't happen unless ContainerPilot gets its configuration reloaded or terminated (and in that case it's racy but we can't really do much about that).
That being said, the error message definitely seems to be bubbling up from io.ErrClosedPipe
, which if you chase it down through the stdlib is I think what we'd expect if the logger is closed when we try to write to it. I'm trying to write a repeatable repro now.
Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service.
Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.
I've broken this down a bit to confirm the error message is as expected, at least.
package main
import (
"os/exec"
log "github.com/Sirupsen/logrus"
)
func main() {
logger := log.StandardLogger().Writer()
cmd := exec.Command("sh", "-c", "echo stdout; echo 1>&2 stderr")
cmd.Stdout = logger
cmd.Stderr = logger
cmd.Start()
// logger.Close() // uncomment this line to get expected err
if err := cmd.Wait(); err != nil {
log.Fatal(err)
}
}
With the logger.Close()
line commented-out we get:
$ go run logbug.go
INFO[0000] stdout
INFO[0000] stderr
Whereas with that line we get:
$ go run logbug.go
FATA[0000] io: read/write on closed pipe
exit status 1
So think the problem is now figuring out how or why we're closing the logger. With it only happening after a few days getting a reliable repro will be key.
I think I've narrowed this down further with the following:
check.sh:
#!/bin/bash
while true; do
echo 'check'
sleep 1
done
logbug.go:
package main
import (
"os/exec"
"time"
log "github.com/Sirupsen/logrus"
)
func main() {
logger := log.StandardLogger().Writer()
cmd := exec.Command("./check.sh")
cmd.Stdout = logger
cmd.Stderr = logger
err := cmd.Start()
if err != nil {
log.Fatalf("failed to start: %v", err)
}
go func() {
timer := time.NewTimer(time.Second * 2)
<-timer.C
logger.Close()
}()
if err := cmd.Wait(); err != nil {
log.Fatalf("failed: %v", err)
}
}
output:
$ go run logbug.go
INFO[0000] check
INFO[0001] check
FATA[0003] failed: signal: broken pipe
exit status 1
Which means that in a case where the exec is running and we've had a chance to properly set up the loggers, we won't get the io: read/write on closed pipe
error even if the logger is closed prematurely (we get a different error entirely). So the problem we're seeing is where the logger is being closed before we call wait
?
@ochanism do you have any other debug logs from the health check around this error? It might help to see if we got other messages.
I spent a bit of time trying to investigate whether it had to do with the timeout, which is the same as the poll interval, under the theory that subsequent runs were interfering with each other. I wrote a config that would exercise this area with a very very tight interval and let it run for a good long while. But I still can't replicate the closing of the logger except artificially; there's not a code path that hits that. I'm beginning to wonder whether the logger is a red herring, and that it's another io.WriteCloser in the path that's getting closed when it shouldn't be?
I think I've figured out a way to bypass the issue entirely as well as solidify the fix in #297 and fix a regression in log formatting which I was planning on fixing for #393.
Instead of setting up individual loggers, I'm going to update Logrus to 1.0 and then use a context logger and attach the new logrus.Entry.Writer
to the process stdout/stderr, rather than creating a new logger. I'll have a PR for v3 shortly (probably reviewed on Weds morning at this point, as it's a holiday here in the US tomorrow) and backport it to v2.
PR open for v3 https://github.com/joyent/containerpilot/pull/424 PR open for v2 https://github.com/joyent/containerpilot/pull/425
Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.
@tgross, yes, I got the error every 10 sec.
do you have any other debug logs from the health check around this error? It might help to see if we got other messages.
@tgross, I launched it with debug log level yesterday and I've been waiting for the error. I will provide the debug log to you soon if I get it.
Thanks for your fast support.
I got the debug logs concerning the error. It occurred 26 hours after the service started. According to the logs, the logger was not closed. It was broken due to a large size token. Do you have any idea about this log?
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run waiting for PID 73299: 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run complete 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run#gofunc received quit
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run waiting for PID 73424: 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run complete 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run#gofunc received quit
2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run waiting for PID 73622: 2017/07/04 13:24:31 Error while reading from Writer: bufio.Scanner: token too long 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run#gofunc received quit
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run waiting for PID 73772: 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run#gofunc received quit
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout start 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.Cmd.Start 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run waiting for PID 73905: 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout end 2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run#gofunc received quit
I found that Logrus 1.0 also has the same issue related to token size in here. I think Logrus 1.0 wouldn't be safe from this error.
I think I figured out what caused it. The output string of the health check script did not contain a new line character at the end of the string, so the output might be accumulated on the buffer.
The defualt value of bufio.MaxScanTokenSize is set to 64KB. In case of my situation, the length of the output string was 7 bytes and the health check interval was 10 sec. And the service was in-service for 26 hours. I calculated the total length of the output during 26 hours like this: ((263600)/10)7 = 65520. It is very close to 64KB.
I added a new line character to the health check script to verify my hypothesis and I've been keeping my eyes on it
I confirmed that there was no error with a new line character in the health check script. You can easily reproduce the error with the below health check script.
#!/bin/bash
printf aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
That's some nice detective work there, @ochanism!
I'm glad you were able to find a workaround for now, but we should probably try to come up with something more robust to this failure. Next steps:
ErrTooLong
(or ErrBufferFull
) gracefully as suggested in that issue seems like a better approach.exec
run so that we ensure the log entry is closed, but do so without adding a spurious empty log line in the case where the health check does exit with a newline.wait
, but we need to make sure we don't break all other logging in the process.With respect to trying to handle the error in ContainerPilot, it looks like neither of the two approaches above are going to work (injecting a newline or capturing the error and handling it gracefully).
From the docs of the underlying stdlib interface (https://golang.org/pkg/bufio/#Scanner):
Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token.
Once we've hit the error the logger is already broken. If it were possible to inject a newline we would reduce the risk of hitting an error but we'll still hit it in the case where a single exec
outputs more than 64k w/o a newline. Handling this error within ContainerPilot would mean recreating the logger entirely, but even in that case we'd end up losing all the logs from any long-running jobs forever. This doesn't seem like a good approach, so we should try to get the fix done at the library level.
I've pushed https://github.com/joyent/containerpilot/commit/3408f9539f52245af6f85eb6c5de9eddfef21794 to update the documentation to warn about this issue. I'm going to cut a release for 2.7.6 and 3.1.2 that gives us the logrus 1.0 update and closes any possibility of a race around log closing.
And then I'll circle back to look at https://github.com/sirupsen/logrus/issues/564, in particular what's happening there around the bufio.Scanner.Scan
call
I don't have the bandwidth right now to submit a fix to sirupsen/logrus. Ultimately this is an end-user issue that we've documented a warning for, and fixing the upstream problem doesn't have a ton of urgency. I'm going to marked this as blocked
and help-wanted
for the time being.
Another workaround option for someone running into this kind of problem will be to use the passthru
logging option described in #462 #437
Recently, I updated the containerpilot version from 2.7.3 to 2.7.4 and 2.7.5 and I got the error as described below.
Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service. When starting the service, it looked fine, but the problem occurred in a few days. I looked into containerpilot code to figure out what caused it. The error came out from here and it might have been caused by closing the logger in here. I couldn't figure out why the logger was closed and who did it. When I sent SIGHUP to a containerpilot process, the logger pipe was recreated and the service health was changed to in-service.
Anyway, in this situation, I think service health should be in-service, even if the logger is closed. I've attached the some scripts below to help you debug this issue.
health check script
containerpilot configuration