Fluent-bit in K8s stops sending logs to elastic after a few hours

marckamerbeek commented 6 years ago

Fluentbit stops sending logs to Elasticsearch after several hours without telling me why. Fluent-bit docker is not crashing. It just stops doing what it shoud do. What could be the reason? I can't figure out why. It happens in every of the 3 cluster.

Situation: 3x (14 nodes k8s cluster (including 3 masters)) Version fluent-bit: 0.13.3 Kubernetes version: 1.8.4 Node OS: Red Hat 7.4 Maipo

Resource limits k8s: "resources": { "limits": { "cpu": "100m", "memory": "1Gi" }, "requests": { "cpu": "100m", "memory": "512Mi" } }

Configuration:

[SERVICE] Flush 1 Daemon Off Log_Level debug

[INPUT] Buffer_Chunk_Size 1MB Buffer_Max_Size 25MB DB /var/log/containers/fluent-bit.db Exclude_Path kube-system.log Mem_Buf_Limit 25MB Name tail Parser json Path /var/log/containers/.log Refresh_Interval 5 Skip_Long_Lines On Tag kube.services.

[FILTER] K8S-Logging.Parser On K8S-Logging.exclude True Kube_URL https://${KUBERNETES_SERVICE_HOST}:443 Match kube.* Merge_JSON_Log On Name kubernetes tls.verify Off

[PARSER] Decode_Field_As escaped log Format json Name json Time_Format %Y-%m-%dT%H:%M:%S %z Time_Key time

[OUTPUT] Host xxx.xxx.xxx.xxx Include_Tag_Key true Logstash_DateFormat %G.%V Logstash_Format On Logstash_Prefix k8s-services-tst Match kube.services.* Name es Port 9200 Retry_Limit False

MohdAhmad commented 6 years ago

@marckamerbeek have you checked the logs from fluent-bit and elastisearch?

marckamerbeek commented 6 years ago

Yes...nothing in the logs. Nothing in elastic. Start building my own fluentbit with extra debug logging. See if that shows some light.

jpbelanger-mtl commented 6 years ago

@marckamerbeek how's your investigation going?

We've encountered the same thing with our AWS elasticsearch cluster and fluent-bit stopping randomly from one node to the other. We recently upgraded to 0.13.x to fix memory usage issue and restarting fluent-bit pod but now we just keep having to restart them manually almost daily.

marckamerbeek commented 6 years ago

Hi jpbelanger, Still no clue. There is nothing in de logs. They just stop logging without a warning. The fluent-bit pods don't crash just stop logging. Today I start making my own build of fluent-bit with some extra debug logging. My time is limited, but I have cronjob running every 2 hours to restart fluent-bit on 4 clusters. So that takes away the hurry for me. I'll keep you posted.

jpbelanger-mtl commented 6 years ago

@marckamerbeek thanks for the update. We might just rollback to the previous release we were running (0.12.11). It was using a insane amount of memory but at least it was sending log and restarting by itself when it reached kube memory limit. Wonder why not more people are having that same problem.

marckamerbeek commented 6 years ago

Thats not an option for me since version 0.13.x contains some critical functionality for kubernetes plugin. @edsiper Could you please shine a ligth on this problem. I have no clue where to start. I'm able to do some work on fluent-bit now and build my own release. But my C skills are still not that good (but learning...:-))

edsiper commented 6 years ago

would you please provide some metrics snapshots using Prometheus ?, meaning: when you don't see any messages ingested into elasticsearch, how Fluent Bit metrics looks like ?

marckamerbeek commented 6 years ago

We use Prometheus in our cluster although not in combination with fluent-bit. I was not aware of that possibility. I'll will instrument fluent-bit right away with prometheus. Give me sec...

marckamerbeek commented 6 years ago

Sorry, it took a while. My grafana skills are not optimal. This is the fluent-bit data in Grafana:

marckamerbeek commented 6 years ago

@edsiper Hi, I'm still restarting the fluent-bit pods every two hours. I provided the metrics. Do you see some unusual things?

edsiper commented 6 years ago

I see a high number of retries, which could mean problems reaching elaetocsearch or elaetocsearch errors. Do u see any anomally in elastic logs?

On Wed, Sep 12, 2018, 22:36 marckamerbeek notifications@github.com wrote:

@edsiper https://github.com/edsiper Hi, I'm still restarting the fluent-bit pods every two hours. I provided the metrics. Do you see some unusual things?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/649#issuecomment-420880520, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWkNiiN14lC9FRAKShDBdHZjtKRS8e8ks5uaeC7gaJpZM4UtfZq .

marckamerbeek commented 6 years ago

The last time I looked I saw some parsing errors, but I got rid of 90% I think. I'll check it again. I'll need to wait to get access again to the Elastic Cluster. I'll come back on this.

seeruk commented 5 years ago

This sounds very similar to an issue I'm currently experiencing with fluent-bit 0.14.7.

We're running in AWS, on Elastic Beanstalk (shudder), starting up a fluent-bit container alongside our apps, and using it as a Docker logging driver. It works for most of the day, then suddenly around the same time each day, not even when it's really any more busy than usual, it just completely stops logging.

We send our logs to AWS Elasticsearch currently, and so do many of our other applications, yet there is only one application that is having problems, and we have other applications that are noisier than this application that has stopped logging occasionally.

One potential reason I've seen is the Docker log rotation. The first log file I see contains 0 errors. 6 lines into starting a new log file, I start seeing this error:

[2018/11/21 18:19:32] [  Error] timerfd_settime: Invalid argument, errno=22 at /tmp/fluent-bit-0.14.7/lib/monkey/mk_core/mk_event_epoll.c:190

This happens continuously from the point where the log rotates, until the instance is rebooted / terminated. I'm a little bit tempted at the moment to explore how to use fluentd instead, but the reason we moved from fluentd to fluent-bit was because of fluentd's slow startup time causing issues with Elasticbeanstalk starting our apps (i.e. Docker complains the logging driver is not ready and the app just reboots continuously).

lunarfs commented 5 years ago

We had the same issue on our 1.12.2 cluster, in our case it was DNS related so the pods was not able to lookup the ES cluster seeing the same error. Not saying that it is the same issue in your case, but there is some general DNS issues at the moment see https://github.com/kubernetes/kubernetes/issues/56903

[2018/11/21 18:19:32] [  Error] timerfd_settime: Invalid argument, errno=22 at /tmp/fluent-bit-0.14.7/lib/monkey/mk_core/mk_event_epoll.c:190

mgsh commented 5 years ago

we have a similar issue where fluent-bit stops tracking a few applications after a few hours. we use fluent-bit in eks, to push logs to stackdriver. on further investigation this is what i found out.

our setup (fluent-bit configs attached):

fluent-bit is set up to tail all files matching /var/log/conainers/*.log. kubernetes takes care of setting up symlinks in /var/log/containers/ and logs in /var/lib/docker/containers/ (through /var/log/pods/)
by default, aws eks sets up docker with this log-opts setting {"max-size":"10m","max-file":"10"} (in /etc/docker/daemon.json)

issues:

gaps in container logs: when logs for a container rotates too fast (either for fluent-bit to keep up or for kubernetes to update symlinks), fluent-bit is still tailing one of the rotated files (*.{1..9}). by the time fluent-bit gets around to use the current log file symlink for that container in /var/log/containers/, some rotated files in between are never tailed (because /var/log/containers only has symlinks pointing to the current log file for each container). i think this can also be confirmed by examining the sqlite db, where it keeps a rotated flag for every log file. in cases where the rotated file is .2 or higher, i think logs in between were missed.
container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).
```
# sudo lsof -nP -a -p $(pgrep -f fluent-bi[t]) | grep -F b6a4c1b82014
fluent-bi 16157 root   54r      REG  259,1 10000149  814921332 /var/lib/docker/containers/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a-json.log.9 (deleted)
```

workaround (planned):

change eks docker log-opts to {"max-size":"100m","max-file":"2"} to reduce the log file rotation rate.

fluent-bit-configmap.txt fluent-bit-ds.txt

mgsh commented 5 years ago

container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).

if it helps to investigate, it does seem to recover at some point many hours later (although i missed to capture the exact state of things at the point when it recovered).

gaogj commented 5 years ago

we have a similar issue where fluent-bit stops tracking a few applications after a few hours. we use fluent-bit in eks, to push logs to stackdriver. on further investigation this is what i found out.

our setup (fluent-bit configs attached):

fluent-bit is set up to tail all files matching /var/log/conainers/*.log. kubernetes takes care of setting up symlinks in /var/log/containers/ and logs in /var/lib/docker/containers/ (through /var/log/pods/)

by default, aws eks sets up docker with this log-opts setting {"max-size":"10m","max-file":"10"} (in /etc/docker/daemon.json)

issues:

gaps in container logs: when logs for a container rotates too fast (either for fluent-bit to keep up or for kubernetes to update symlinks), fluent-bit is still tailing one of the rotated files (*.{1..9}). by the time fluent-bit gets around to use the current log file symlink for that container in /var/log/containers/, some rotated files in between are never tailed (because /var/log/containers only has symlinks pointing to the current log file for each container). i think this can also be confirmed by examining the sqlite db, where it keeps a rotated flag for every log file. in cases where the rotated file is .2 or higher, i think logs in between were missed.

container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).
# sudo lsof -nP -a -p $(pgrep -f fluent-bi[t]) | grep -F b6a4c1b82014
fluent-bi 16157 root   54r      REG  259,1 10000149  814921332 /var/lib/docker/containers/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a-json.log.9 (deleted)
workaround (planned):

change eks docker log-opts to {"max-size":"100m","max-file":"2"} to reduce the log file rotation rate.

fluent-bit-configmap.txt fluent-bit-ds.txt

I have the same problem

edsiper commented 5 years ago

FYI: I've pushed a fix for the root cause of the problem, thanks to everybody who helped to troubleshoot the problem.

ref: https://github.com/fluent/fluent-bit/commit/1ef74fb0674490f665c3139a406896453f4d4a53

edsiper commented 5 years ago

Fixed in v1.1.1 release:

https://fluentbit.io/announcements/v1.1.1/

pnl0dg7k commented 3 years ago

I am facing this issue in 1.8.4

xuchenCN commented 3 years ago

@edsiper I still have the problem using v1.8.9 I notice that there has the gap between link log file and orignal log file

# ls -l /var/log/pods/xxx/controller/0.log
lrwxrwxrwx 1 root root Nov  8 16:58 /var/log/pods/xxx/controller/0.log -> /home/docker/containers/xxx-json.log

# ls -l /home/docker/containers/xxx-json.log
-rw-r----- 1 root root Nov  9 17:14 /home/docker/containers/xxx-json.log

Fluent-bit fetch the link log file and recongize that file dosen't update ?

ndhcoder commented 2 years ago

I got same issue at latest version

FranciscoPimentel commented 5 months ago

I got same issue at latest version

pierluigilenoci commented 5 months ago

@pnl0dg7k @xuchenCN @ndhcoder @FranciscoPimentel, I suggest you open a new issue detailing how to reproduce your specific problem more precisely. This has now been closed, and commenting here will lead nowhere.

fluent / fluent-bit

Fluent-bit in K8s stops sending logs to elastic after a few hours #649