deis / logger

In-memory log buffer used by Deis Workflow.
https://deis.com
MIT License
18 stars 15 forks source link

Logger excessive CPU usage #102

Closed felixbuenemann closed 7 years ago

felixbuenemann commented 8 years ago

I'm running deis workflow v2.3.0 with logger v2.2.0 and I noticed that the logger process is consistently eating up over 60% cpu time on a dual core m4 large instance.

Given that the whole cluster is a staging environment that is basically idle, this seems quite extreme…

The cluster consists of a t2.small controller (1 vCPU, 2GB) and two m4.large nodes (2 vCPU 8GB) running CoreOS stable 1068.9.0 and K8s v1.3.4.

felixbuenemann commented 8 years ago

I attached strace to the logger process and it seems to be excessively calling clock_gettime():

clock_gettime(CLOCK_REALTIME, {1471469666, 739216323}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739343729}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739462448}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739588907}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739679038}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739772809}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739864743}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 739956176}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 740051386}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 740162804}) = 0
clock_gettime(CLOCK_REALTIME, {1471469666, 740262589}) = 0

This goes on forever. Sometimes it switches over to futex/select and then back to the above loop.

The logger itself appear to be working, I can fetch logs for running apps.

jchauncey commented 8 years ago

@krancour thoughts?

On Aug 17, 2016 3:38 PM, "Felix Bünemann" notifications@github.com wrote:

I attached strace to the logger process and it seems to be excessively calling clock_gettime():

clock_gettime(CLOCK_REALTIME, {1471469666, 739216323}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739343729}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739462448}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739588907}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739679038}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739772809}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739864743}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 739956176}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 740051386}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 740162804}) = 0 clock_gettime(CLOCK_REALTIME, {1471469666, 740262589}) = 0

This goes on forever. Sometimes it switches over to futex/select and then back to the above loop.

The logger itself appear to be working, I can fetch logs for running apps.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deis/logger/issues/102#issuecomment-240557323, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGPzOssFQFre1uylT6tQKK2bDtgxyks5qg39YgaJpZM4Jm5_3 .

felixbuenemann commented 8 years ago

I also grabbed strace stats, not sure if it helps:

strace: Process 6404 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.67    1.677000       38114        44         5 futex
  5.27    0.093407           0    784501           clock_gettime
  0.04    0.000715           0      1973           write
  0.01    0.000215           9        23           select
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           close
  0.00    0.000000           0         3           sched_yield
  0.00    0.000000           0         1           epoll_wait
  0.00    0.000000           0         1           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.771337                786548         5 total
jchauncey commented 8 years ago

ok so the problem is that we poll redis every 1 second (configurable) - https://github.com/deis/logger/blob/master/storage/redis_adapter.go#L126

My guess is that is what is eating up proc. https://github.com/deis/logger/blob/master/storage/redis_config.go#L19

set that env var to something higher and see if the cpu usage comes down

felixbuenemann commented 8 years ago

I'm not sure, I wrote a simple example program, which also uses time.NewTicker() and it didn't do any calls to clock_gettime():

package main

import "time"
import "fmt"

func main() {
  ticker := time.NewTicker(time.Second * time.Duration(2))
  go func() {
      for t := range ticker.C {
          fmt.Println("Tick at", t)
      }
  }()
  time.Sleep(time.Second * 30)
  ticker.Stop()
  fmt.Println("Ticker stopped")
}
$ strace -c ./ticktick
Tick at 2016-08-17 22:34:23.59722152 +0000 UTC
Tick at 2016-08-17 22:34:25.594551997 +0000 UTC
Tick at 2016-08-17 22:34:27.594866744 +0000 UTC
Tick at 2016-08-17 22:34:29.592562115 +0000 UTC
Tick at 2016-08-17 22:34:31.596463859 +0000 UTC
Tick at 2016-08-17 22:34:33.596571237 +0000 UTC
Tick at 2016-08-17 22:34:35.59338629 +0000 UTC
Tick at 2016-08-17 22:34:37.59699556 +0000 UTC
Tick at 2016-08-17 22:34:39.596993062 +0000 UTC
Tick at 2016-08-17 22:34:41.593949556 +0000 UTC
Tick at 2016-08-17 22:34:43.592599362 +0000 UTC
Tick at 2016-08-17 22:34:45.596517487 +0000 UTC
Tick at 2016-08-17 22:34:47.594338917 +0000 UTC
Tick at 2016-08-17 22:34:49.594084299 +0000 UTC
Ticker stopped
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           write
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0       114           rt_sigaction
  0.00    0.000000           0         8           rt_sigprocmask
  0.00    0.000000           0         3           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           sigaltstack
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0        18         1 futex
  0.00    0.000000           0         1           sched_getaffinity
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   159         1 total

This doesn't show a lot of calls to clock_gettime() and it uses the futex call to wait on the timeout.

Looking at the go sources, it seems that calls to clock_gettime(CLOCK_REALTIME, ...) are caused by time.Now() instead.

Unless something goes totally wrong with the default for PipelineTimeout, which should be one second, I can't see how this would cause excessive cpu usage.

felixbuenemann commented 8 years ago

Could it be that the logger is accumulating lots of tickers, because it does not stop existing tickers?

If I read the code correctly, each call to newMessagePipeliner() creates a new ticker.

krancour commented 8 years ago

@jchauncey that's not actually polling Redis, but I get where you were going...

I also would have guessed, as you did, that the ticker might have been the explanation for this issue.

@felixbuenemann the results of your test program are interesting.

My gut still says that a "ticker" of some sort (so, perhaps not a literal time.Ticker) is involved in this.

There are other spots in the code (or in libraries we use) where I'm certain things are happening on intervals. Consider, for instance, our usage of go-nsq for listening to the log topic.

felixbuenemann commented 8 years ago

@krancour There are lot's of calls to time.Now() in go-nsq…

jchauncey commented 8 years ago

added pprof in #108 to help debug this

colemickens commented 8 years ago

I'm seeing this as well.

screen shot 2016-09-06 at 2 02 50 am

I'm leaving it running in case there's anything useful to be gotten from the running bad instance.

jeroenvisser101 commented 8 years ago

I've had this too. We've changed the storage adapter as a workaround (from redis -> memcached), which solved the high CPU.

jchauncey commented 8 years ago

That's interesting. Did you need to make any changes to logger to support memcached?

On Sep 6, 2016 8:29 AM, "Jeroen Visser" notifications@github.com wrote:

I've had this too. We've changed the storage adapter as a workaround (from redis -> memcached), which solved the high CPU.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deis/logger/issues/102#issuecomment-244935062, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGIFyvFNdJDSRaUYbLOpCevw_t9esks5qnVy2gaJpZM4Jm5_3 .

jeroenvisser101 commented 8 years ago

@jchauncey oh wow, I think I need another coffee, I meant: redis -> memory

jchauncey commented 8 years ago

In the next workflow release we will have a new logger image which adds a pprof debug endpoint. I am hoping that we can use it to pin point what is happening on clusters that are seeing high cpu usage as well as memory leaks.

felipejfc commented 7 years ago

I confirm I also suffer from this issue, my deis-logger is eating cpu here

felixbuenemann commented 7 years ago

@felipejfc @jchauncey added additional profiling in release 2.5, maybe he can give instructions on how to use that to debug the issue.

bacongobbler commented 7 years ago

I'm currently looking into the CPU/memory issues, but you can profile yourself using pprof on port 8080 within the logger container. See https://golang.org/pkg/net/http/pprof/

felipejfc commented 7 years ago

@bacongobbler you mean 8099

bacongobbler commented 7 years ago

thanks, yeah 8080 was off the cuff but port 8099 is the correct port, and it's only accessible from within the container itself. I am letting my cluster run overnight and profiling it in the morning to see what kind of data we get from the profiler.

felixbuenemann commented 7 years ago

Switched the deis-logger rc to quay.io/deisci/logger:git-02d5435 from #113 and will see how it goes.

jeroenvisser101 commented 7 years ago

Going to do the same as @felixbuenemann 👍