Journal rotation fix - Githubissues

errm commented 7 years ago

Should fix #26 #27

I have pushed 0.1.1.pre to rubygems...

@taskinen @evmin @dannyk81 would you be able to check if this fixes things for you?

taskinen commented 7 years ago

Thanks @errm. I installed 0.1.1.pre to my test environment and will report the outcome once the journal log rotates.

dannyk81 commented 7 years ago

Thank you! running on my test cluster now, will update.

dannyk81 commented 7 years ago

@errm

I'm seeing some strange behavior, a single line of log just keeps repeating over and over. It seems to be the last last in the journal for each source tag.

read_from_head is set to false and I also tried deleting all the *.pos file and restart all fluentd pods, but looks the same.

_time   HOSTNAME    source  MESSAGE
2017-02-08 18:22:20.255 rndv-micro03    system.kube-proxy   I0208 18:22:20.255743 16914 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:20.255 rndv-micro03    system.kube-proxy   I0208 18:22:20.255743 16914 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:21:30.236 rndv-micro01    system.kube-proxy   I0208 18:21:30.236319 21596 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:21:30.236 rndv-micro01    system.kube-proxy   I0208 18:21:30.236319 21596 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:05.207 rndv-micro02    system.kube-proxy   I0208 18:22:05.207468 12527 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:05.207 rndv-micro02    system.kube-proxy   I0208 18:22:05.207468 12527 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:05.207 rndv-micro02    system.kube-proxy   I0208 18:22:05.207468 12527 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:21:30.236 rndv-micro01    system.kube-proxy   I0208 18:21:30.236319 21596 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:21:30.236 rndv-micro01    system.kube-proxy   I0208 18:21:30.236319 21596 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:20.255 rndv-micro03    system.kube-proxy   I0208 18:22:20.255743 16914 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2017-02-08 18:22:20.255 rndv-micro03    system.kube-proxy   I0208 18:22:20.255743 16914 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600

errm commented 7 years ago

Humm ok, so my strategy here was to reinit the journal reading after all the entries have been read, it seems that the implementation of seek is not quite right here so we are seeking to the entry we have just read rather than the next one, we accepted a pull request a few weeks ago that made a change to this logic so I will take a look at that....

errm commented 7 years ago

Hi @dannyk81 and @taskinen I have just pushed 0.1.1.pre2 to rubygems... I think this should work for you correctly now :crossed_fingers:

dannyk81 commented 7 years ago

💯 Deploying now, will update :)

dannyk81 commented 7 years ago

@errm

This fixed the issue with seek :+1: I will update once journal rotation will happen, hopefully we'll be able to close this.

dannyk81 commented 7 years ago

@errm I'm afraid bad news on my end, I'm collecting logs from 5 systemd services on 6 nodes, exactly at midnight I stopped receiving events from 4 out of 5 services across all 6 nodes (the exception is docker service), see graph below that shows the count of events based on source tag over time.

When I closer examined the docker events that supposedly continued to come in, I noticed that they are in fact the same event which just kept repeating over and over.

taskinen commented 7 years ago

I am seeing the exact same issue. After journald rotate, the only logs I get successfully into Graylog are repeated old ones.

errm commented 7 years ago

Ok yuck... I am going to keep plugging at this then ...

reevoo-samuel commented 7 years ago

Er, hello?

This Pull Request is too damn old! Merge or close this, sucka.

errm commented 7 years ago

So I still have this on my radar, I did get some help here https://github.com/ledbettj/systemd-journal/issues/70 so will look at getting a new version together tomorow...

dannyk81 commented 7 years ago

@errm looking forward to it, ready to test when available.

errm commented 7 years ago

So I have done some more testing once I found a dev machine with systemd... it seems that this fix is working for me ...

I suspect that the fix is not working inside of a docker container, for some reason ... checking it out now with the journal in a mount...

errm commented 7 years ago

So I have tried to reproduce this by running fluentd in a container and can't seem to reproduce the problem... Could you provide some more info about your host and the fluentd container you are using... e.g. systemd versions etc?

dannyk81 commented 7 years ago

That's odd indeed, here are the versions I use:

OS details:

# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

# cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core)

systemd:

# systemctl --version
systemd 219

docker:

# docker info|grep Version
Server Version: 1.12.5
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Kernel Version: 3.10.0-514.6.1.el7.x86_64

fluentd:

2017-02-17 15:20:29 +0000 [info]: starting fluentd-0.14.12

This is my Dockerfile for fluentd container:

FROM gcr.io/google_containers/ubuntu-slim:0.5

# Ensure there are enough file descriptors for running Fluentd.
RUN ulimit -n 65536

# Disable prompts from apt.
ENV DEBIAN_FRONTEND noninteractive

# Install prerequisites.
RUN apt-get update && \
    apt-get install -y -q curl make g++ sudo libsystemd0 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# Install Fluentd.
RUN /usr/bin/curl -sSL https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent2.sh | sh

# Change the default user and group to root.
# Needed to allow access to /var/log/docker/... files.
RUN sed -i -e "s/USER=td-agent/USER=root/" -e "s/GROUP=td-agent/GROUP=root/" /etc/init.d/td-agent

# Install the Kubernetes, systemd and Splunk Fluentd plug-ins.
# Note: net-http-persistent v2.9.4 library is required by splunk plugin, but not automatically installed
RUN td-agent-gem install --no-document net-http-persistent -v 2.9.4
RUN td-agent-gem install --no-document fluent-plugin-systemd -v 0.1.1.pre2
RUN td-agent-gem install --no-document fluent-plugin-kubernetes_metadata_filter fluent-plugin-splunkapi

# Remove docs and postgres references
RUN rm -rf /opt/td-agent/embedded/share/doc \
    /opt/td-agent/embedded/share/gtk-doc \
    /opt/td-agent/embedded/lib/postgresql \
    /opt/td-agent/embedded/bin/postgres \
    /opt/td-agent/embedded/share/postgresql

RUN apt-get remove -y make g++
RUN apt-get autoremove -y
RUN apt-get clean -y

RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# Copy the Fluentd configuration file.
COPY td-agent.conf /etc/td-agent/td-agent.conf

# Environment variables for configuration
# FLUENTD_ARGS cannot be empty, so a placeholder is used. It should not have any effect because it is a default.
ENV LD_PRELOAD /opt/td-agent/embedded/lib/libjemalloc.so
ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR 0.9
ENV SPLUNK_SERVER localhost:8089
ENV SPLUNK_AUTH admin:changme
ENV SPLUNK_INDEX main
ENV FLUENTD_ARGS --use-v1-config
ENV KUBE_VERIFY_SSL true

# Run the Fluentd service.
ENTRYPOINT "exec" "td-agent" "$FLUENTD_ARGS"

errm commented 7 years ago

I am not sure if this is the issue, but you container more than likely has systemd 229 while your host is still on systemd 219, to get round some issue the GCP folks bind mount libsystemd from the host, then copy it into place ... its a bit of an ugly hack but perhaps worth a try...for the sake of confirm / denying this as a theory....

See https://github.com/kubernetes/kubernetes/commit/a1ec542d7c89b202c3a976d2a10c867f11d30fa4 for an example of this...

The only other thing I can think of is perhaps an issue with privileges, I am assuming you shouldn't have any issues since you can read these files ok, but perhaps something more fine grained is being setup?

btw I just pushed 0.1.1.pre3, with a few changes, but since I can't reproduce the issue I am fairly sure it will make no odds to you...

dannyk81 commented 7 years ago

Interesting! let me try this out and report back :smile:

EDIT: Regarding permissions, there's nothing out of the ordinary:

# pwd
/var/log
# ls -lsa | grep journal
    0 drwxr-sr-x.  3 root   systemd-journal         45 Dec  7 15:34 journal
# ls -lsaR journal/
journal/:
total 16
 0 drwxr-sr-x.  3 root systemd-journal   45 Dec  7 15:34 .
 4 drwxr-xr-x. 12 root root            4096 Feb 12 03:10 ..
12 drwxr-sr-x+  2 root systemd-journal 4096 Feb 15 15:47 21964f7e436e499a8cf0b977ab1fcb30

journal/21964f7e436e499a8cf0b977ab1fcb30:
total 1859792
    12 drwxr-sr-x+ 2 root systemd-journal      4096 Feb 15 15:47 .
     0 drwxr-sr-x. 3 root systemd-journal        45 Dec  7 15:34 ..
106504 -rw-r-x---+ 1 root root            109051904 Jan 18 08:35 system@a5a25f67f50249f58a17bd20f105d487-00000000002d640a-00054642e7b06e4f.journal
106504 -rw-r-x---+ 1 root root            109051904 Jan 19 09:25 system@a5a25f67f50249f58a17bd20f105d487-0000000000300bdd-0005465a4d3092e1.journal
114696 -rw-r-x---+ 1 root root            117440512 Jan 20 08:23 system@a5a25f67f50249f58a17bd20f105d487-000000000032aeea-0005466f1ec9fb81.journal
114696 -rw-r-x---+ 1 root root            117440512 Jan 21 08:12 system@a5a25f67f50249f58a17bd20f105d487-0000000000355228-000546825f277e9d.journal
131080 -rw-r-x---+ 1 root systemd-journal 134217728 Jan 29 00:54 system@a5a25f67f50249f58a17bd20f105d487-000000000037dfd8-00054696547da493.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  1 05:12 system@a5a25f67f50249f58a17bd20f105d487-00000000003a2cf5-000547312532c10f.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  2 19:17 system@a5a25f67f50249f58a17bd20f105d487-00000000003cb6f3-0005477118a28812.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  4 11:51 system@a5a25f67f50249f58a17bd20f105d487-00000000003f53b6-0005479104758ff4.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  6 03:34 system@a5a25f67f50249f58a17bd20f105d487-000000000041f927-000547b303c8ca11.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  7 19:23 system@a5a25f67f50249f58a17bd20f105d487-0000000000449ec7-000547d44eafb926.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb  9 09:58 system@a5a25f67f50249f58a17bd20f105d487-0000000000474323-000547f5ae3118c1.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb 10 23:35 system@a5a25f67f50249f58a17bd20f105d487-000000000049e628-00054816045dfa89.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb 12 13:08 system@a5a25f67f50249f58a17bd20f105d487-00000000004c8bf6-000548358e736b2f.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb 14 02:49 system@a5a25f67f50249f58a17bd20f105d487-00000000004f3206-000548550820ed66.journal
106504 -rw-r-----+ 1 root systemd-journal 109051904 Feb 15 15:47 system@a5a25f67f50249f58a17bd20f105d487-000000000051d812-000548749cab5a30.journal
 81928 -rw-r-----+ 1 root systemd-journal  83886080 Feb 17 16:41 system.journal
  8196 -rw-r-x---+ 1 root root              8388608 Jan 19 09:25 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000032027d-0005466a09fb8752.journal
  8196 -rw-r-x---+ 1 root root              8388608 Jan 21 08:12 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000036399d-00054688f983b164.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Jan 29 00:54 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000037efe5-00054696cbe4c272.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  1 05:12 user-1000@80b290d668e84ab88f67c0e3698d2ed7-00000000003a2e29-0005473202bf6cae.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  2 19:17 user-1000@80b290d668e84ab88f67c0e3698d2ed7-00000000003f0c1b-0005478d7cf7e0a9.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  4 11:51 user-1000@80b290d668e84ab88f67c0e3698d2ed7-0000000000412a92-000547a8db8c0d23.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  7 19:23 user-1000@80b290d668e84ab88f67c0e3698d2ed7-0000000000458dd5-000547e025642331.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  9 09:58 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000048baf4-000548086f68273b.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 14 02:49 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000051d74d-00054874949a9fc2.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 15 15:47 user-1000@80b290d668e84ab88f67c0e3698d2ed7-000000000052723c-0005487bb7f98b64.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 17 15:12 user-1000.journal
  8196 -rw-r-x---+ 1 root root              8388608 Feb  1 05:12 user-16777218@9199bbabe044403dbb9f2b7305342aa5-00000000003b4248-000547619848ad5f.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 15 15:50 user-16777218.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb  2 19:17 user-16777222@1450d32b88fe4b9b80cbfe5fe2bf882d-00000000003eff2e-0005478cd86ec97e.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 15 15:50 user-16777222.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 15 15:47 user-16777227@d699eac4978247418fc1fabdc7f9146c-0000000000544742-00054891392d2779.journal
  8196 -rw-r-x---+ 1 root systemd-journal   8388608 Feb 15 15:50 user-16777227.journal

and SELinux is set to Permissive

# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      28

dannyk81 commented 7 years ago

No luck on my end :(

I implemented the hack of replacing libsystemd with the one from the host and also used 0.1.1.pre3, unfortunately the result is the same... I wonder if this is a fundamental issue caused by systemd 219 ?

@taskinen can you provide the details of your environment ? does it match my spec ?

errm commented 7 years ago

I will try spinning up a CentOS 7 VM to see if I can work out whats going on ... I may get to it on Monday or Tuesday...

So far I have just been testing on a Ubuntu Host + Ubuntu container...

errm commented 7 years ago

I have tried with a CentOS 7 host (vagrant machine) with with docker 1.12.5 and using your dockerfile, I can reproduce the issue with version 0.1.0 of the gem, but it does not occur on 0.1.1.pre3.

I am wondering if there is some issue relating to kubernetes launching the pod... could you share the pod config you are using so I can try an debug the issue.

I am going to merge this branch, as It does fix the issue on the systems I have tested on, so I think it is a valuable update to release, because the current version 0.1.0 is broken everywhere. I will leave #26 and #27 open in order to continue working on this issue...

dannyk81 commented 7 years ago

@errm Thanks for all your efforts to get this sorted!!! quite frustrating I keep hitting this issue despite all these fixes...

Here's my Pod definition, I run it as a DaemonSet (so that each node has fluentd for collection logs) and each pod has two containers, the 2nd one is the splunk universalforwarder.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-splunk-uf
  namespace: kube-system
  labels:
    name: fluentd-logging
spec:
  template:
    metadata:
      labels:
        k8s-app: fluentd-logging
    spec:
      containers:
      - name: fluentd
        image: <local-docker>/fluentd:v0.0.3.pre2
        imagePullPolicy: Always
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env:
        - name: "FLUENTD_ARGS"
          valueFrom:
            configMapKeyRef:
              name: fluentd-splunk-uf-config
              key: fluentd.args
        - name: "KUBE_NODE_NAME"
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: td-agent-config-volume
          mountPath: /etc/td-agent
        - name: var-log
          mountPath: /var/log
        - name: var-lib-docker-containers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: lib-systemd-dir
          mountPath: /host/lib
          readOnly: true
      - name: splunk-uf
        image: <local-docker>/splunk-universalforwarder:6.5.2
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env:
        - name: "SPLUNK_START_ARGS"
          valueFrom:
            configMapKeyRef:
              name: fluentd-splunk-uf-config
              key: splunk.start.args
        - name: "SPLUNK_FORWARD_SERVER"
          valueFrom:
            configMapKeyRef:
              name: fluentd-splunk-uf-config
              key: splunk.forward.server
        - name: "SPLUNK_USER"
          valueFrom:
            configMapKeyRef:
              name: fluentd-splunk-uf-config
              key: splunk.user
        volumeMounts:
        - name: opt-splunk-etc
          mountPath: /opt/splunk/etc
        - name: opt-splunk-var
          mountPath: /opt/splunk/var
      terminationGracePeriodSeconds: 30
      volumes:
      - name: var-log
        hostPath:
          path: /var/log
      - name: var-lib-docker-containers
        hostPath:
          path: /var/lib/docker/containers
      - name: opt-splunk-etc
        emptyDir: {}
      - name: opt-splunk-var
        emptyDir: {}
      - configMap:
          name: fluentd-splunk-uf-config
        name: td-agent-config-volume
      - name: lib-systemd-dir
        hostPath:
          path: /usr/lib64

Here's my ConfigMap which also stores td-agent.conf:

kind: ConfigMap
apiVersion: v1
metadata:
  name: fluentd-splunk-uf-config
  namespace: kube-system
data:
  fluentd.args: -q
  splunk.start.args: --accept-license --answer-yes --no-prompt
  splunk.forward.server: <splunk-server>:9997
  splunk.user: root
  td-agent.conf: |-
    # This configuration file for Fluentd / td-agent is used
    # to watch changes to Docker log files. The kubelet creates symlinks that
    # capture the pod name, namespace, container name & Docker container ID
    # to the docker logs for pods in the /var/log/containers directory on the host.
    # If running this fluentd configuration in a Docker container, the /var/log
    # directory should be mounted in the container.
    #
    # These logs are then submitted to Splunk which assumes the
    # installation of the fluent-plugin-splunkapi & the
    # fluent-plugin-kubernetes_metadata_filter plugins.
    # See https://github.com/k24d/fluent-plugin-splunkapi &
    # https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter for
    # more information about the plugins.
    #
    # Example
    # =======
    # A line in the Docker log file might like like this JSON:
    #
    # {
    #   "log":"2014/09/25 21:15:03 Got request with path wombat\n",
    #   "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z"
    # }
    #
    # The time_format specification below makes sure we properly
    # parse the time format produced by Docker. This will be
    # submitted to Splunk and should appear like:
    # $ curl 'http://splunk-logging:8089/services/receivers/simple?sourcetype=fluentd&index=main&source=kubernetes.var.log.containers.synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log'
    # ...
    #    {
    #      "_index" : "logstash-2014.09.25",
    #      "_type" : "fluentd",
    #      "_id" : "VBrbor2QTuGpsQyTCdfzqA",
    #      "_score" : 1.0,
    #      "_source":{"log":"2014/09/25 22:45:50 Got request with path wombat\n",
    #                 "stream":"stderr","tag":"docker.container.all",
    #                 "@timestamp":"2014-09-25T22:45:50+00:00"}
    #    },
    # ...
    #
    # The Kubernetes fluentd plugin is used to write the Kubernetes metadata to the log
    # record & add labels to the log record if properly configured. This enables users
    # to filter & search logs on any metadata.
    # For example a Docker container's logs might be in the directory:
    #
    #  /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b
    #
    # and in the file:
    #
    #  997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    #
    # where 997599971ee6... is the Docker ID of the running container.
    # The Kubernetes kubelet makes a symbolic link to this file on the host machine
    # in the /var/log/containers directory which includes the pod name and the Kubernetes
    # container name:
    #
    #    synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #    ->
    #    /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    #
    # The /var/log directory on the host is mapped to the /var/log directory in the container
    # running this instance of Fluentd and we end up collecting the file:
    #
    #   /var/log/containers/synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # This results in the tag:
    #
    #  var.log.containers.synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # The Kubernetes fluentd plugin is used to extract the namespace, pod name & container name
    # which are added to the log message as a kubernetes field object & the Docker container ID
    # is also added under the docker field object.
    # The final tag is:
    #
    #   kubernetes.var.log.containers.synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # And the final log record look like:
    #
    # {
    #   "log":"2014/09/25 21:15:03 Got request with path wombat\n",
    #   "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z",
    #   "kubernetes": {
    #     "namespace": "default",
    #     "pod_name": "synthetic-logger-0.25lps-pod",
    #     "container_name": "synth-lgr"
    #   },
    #   "docker": {
    #     "container_id": "997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b"
    #   }
    # }
    #
    # This makes it easier for users to search for logs by pod name or by
    # the name of the Kubernetes container regardless of how many times the
    # Kubernetes pod has been restarted (resulting in a several Docker container IDs).

    # Do not directly collect fluentd's own logs to avoid infinite loops.
    <match fluent.**>
      @type null
    </match>

    # Container logs
    # {"log":"[info:2016-02-16T16:04:05.930-08:00] Some log text here\n","stream":"stdout","time":"2016-02-17T00:04:05.931087621Z"}
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/splunk-containers.log.pos
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      tag raw.kubernetes.*
      format json
      time_key time
      read_from_head false
    </source>

    # Detect exceptions in the log output and forward them as one log entry.
    <match raw.kubernetes.**>
      @type detect_exceptions
      remove_tag_prefix raw
      message log
      stream stream
      multiline_flush_interval 5
      max_bytes 500000
      max_lines 1000
    </match>

    # Docker Service logs
    # time="2016-02-04T06:51:03.053580605Z" level=info msg="GET /containers/json"
    # time="2016-02-04T07:53:57.505612354Z" level=error msg="HTTP Error" err="No such image: -f" statusCode=404
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      pos_file /var/log/splunk-docker.log.pos
      tag system.docker
      read_from_head false
      strip_underscores true
    </source>

    # etcd logs [Master nodes only]
    # 2016/02/04 06:52:38 filePurge: successfully removed file /var/etcd/data/member/wal/00000000000006d0-00000000010a23d1.wal
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "etcd.service" }]
      pos_file /var/log/splunk-etcd.log.pos
      tag system.etcd
      read_from_head false
      strip_underscores true
    </source>

    # Kubelet logs
    # I0204 07:32:30.020537    3368 server.go:1048] POST /stats/container/: (13.972191ms) 200 [[Go-http-client/1.1] 10.244.1.3:40537]
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      pos_file /var/log/splunk-kubelet.log.pos
      tag system.kubelet
      read_from_head false
      strip_underscores true
    </source>

    # Kube-proxy logs
    # I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "kube-proxy.service" }]
      pos_file /var/log/splunk-kube-proxy.log.pos
      tag system.kube-proxy
      read_from_head false
      strip_underscores true
    </source>

    # flanneld logs
    # I0106 23:34:48.734594 15350 vxlan.go:356] AddL3 succeeded
    #<source>
    #  type systemd
    #  path /var/log/journal
    #  filters [{ "_SYSTEMD_UNIT": "flanneld.service" }]
    #  pos_file /var/log/splunk-flanneld.log.pos
    #  tag system.flanneld
    #  read_from_head false
    #  strip_underscores true
    #</source>

    # Kube apiserver logs [Master nodes only]
    # I0204 07:00:19.604280       5 handlers.go:131] GET /api/v1/nodes: (1.624207ms) 200 [[kube-controller-manager/v1.1.3 (linux/amd64) kubernetes/6a81b50] 127.0.0.1:38266]
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "kube-apiserver.service" }]
      pos_file /var/log/splunk-kube-apiserver.log.pos
      tag system.kube-apiserver
      read_from_head false
      strip_underscores true
    </source>

    # Kube controller manager logs [Master nodes only]
    # I0204 06:55:31.872680       5 servicecontroller.go:277] LB already exists and doesn't need update for service kube-system/kube-ui
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "kube-controller-manager.service" }]
      pos_file /var/log/splunk-kube-controller-manager.log.pos
      tag system.kube-controller-manager
      read_from_head false
      strip_underscores true
    </source>

    # Kube scheduler logs [Master nodes only]
    # W0204 06:49:18.239674       7 reflector.go:245] pkg/scheduler/factory/factory.go:193: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [2578313/2577886]) [2579312]
    <source>
      @type systemd
      path /var/log/journal
      filters [{ "_SYSTEMD_UNIT": "kube-scheduler.service" }]
      pos_file /var/log/splunk-kube-scheduler.log.pos
      tag system.kube-scheduler
      read_from_head false
      strip_underscores true
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
      verify_ssl "#{ENV['KUBE_VERIFY_SSL']}"
    </filter>

    <filter system.**>
      @type record_transformer
      renew_record true
      keep_keys HOSTNAME,MESSAGE
    </filter>

    <match **>
      # Deliver events to the local forwarder.
      @type splunkapi
      protocol rest
      server 127.0.0.1:8089
      verify false
      auth admin:changeme

      # Convert fluent tags to Splunk sources.
      # If you set an index, "check_index false" is required.
      host "#{ENV['KUBE_NODE_NAME']}"
      index main
      check_index false
      source "{TAG}"
      sourcetype fluentd

      # TIMESTAMP: key1="value1" key2="value2" ...
      time_format none
      format json
      time_key time

      # Memory buffer with a short flush internal.
      buffer_type memory
      buffer_queue_limit 16
      buffer_chunk_limit 8m
      flush_interval 2s
    </match>

errm commented 7 years ago

Nothing looks out of the ordinary here ... The only thing I can think of at this stage is to double check that you are really running version 0.1.1.pre3 of the plugin, you should be able to see something like 2017-02-20 05:12:15 +0000 [info]: gem 'fluent-plugin-systemd' version '0.1.1.pre3' in the log from the fluentd container...

Otherwise all I can suggest perhaps is to jump on skype or somesuch and try debugging the issue together ...

dannyk81 commented 7 years ago

Indeed, I can see the correct plugin version:

2017-02-20 03:14:13 +0000 [info]: gem 'fluent-plugin-systemd' version '0.1.1.pre3'

Hope we can squash it together :)

fluent-plugins-nursery / fluent-plugin-systemd

Journal rotation fix #30