Kubernetes data enrichment problems

vladarts commented 3 years ago

Hi! I have an issue with kubernetes data enrichment - it works only for part of logs. Example: fluent-bit image itself produces the following logs:

* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/01/19 15:36:50] [ info] [engine] started (pid=1)
[2021/01/19 15:36:50] [ info] [storage] version=1.0.6, initializing...
[2021/01/19 15:36:50] [ info] [storage] in-memory
[2021/01/19 15:36:50] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/01/19 15:36:50] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/01/19 15:36:50] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/01/19 15:36:50] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/01/19 15:36:50] [ info] [filter:kubernetes:kubernetes.0] API server connectivity OK
[2021/01/19 15:36:50] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/01/19 15:36:50] [ info] [sp] stream processor started
[2021/01/19 15:36:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=52429428 watch_fd=1 name=/var/log/containers/aws-node-c9kjf_kube-system_aws-node-67cea6537d9d775091cc78803c6c9a567c157d9e48d5d6fbf41a00f0773fcea2.log
[2021/01/19 15:36:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=62915779 watch_fd=2 name=/var/log/containers/aws-node-c9kjf_kube-system_aws-vpc-cni-init-6ba2c829d003e1572de57a86de12a97270ee8275598bcfff0d626410363372c6.log

when the following part will be not enriched:

* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

and similar issue happens on other pods. I did not recognize any regularity of bug appearance.

Checked on aws-for-fluent-bit versions:

2.10.0
2.10.1

And helm charts:

https://github.com/fluent/helm-charts/tree/master/charts/fluent-bit with default configuration + custom output for ES
https://github.com/aws/eks-charts/tree/master/stable/aws-for-fluent-bit - this chart much worser than previous at least because it can not be configured to collect logs from kubelet. Also default configuration was checked.

Thanks!

vladarts commented 3 years ago

Also, most of such cases happens on Self-managed nodes not Managed node groups but I can not imagine why.

PettitWesley commented 3 years ago

You mean some of the logs from a container do not have metadata? Like the earliest part of the logs- the first logs from the container?

I think that is normal and expected and a bit random (if that is what is happening). Fluent Bit has to get the metadata from the API Server- and that can take time I think so initially logs might not have metadata. That's one hypothesis.

Also- please enable debug logging for Fluent Bit. This came up in another issue recently- the kubernetes filter prints some key information about errors to debug- some sorts of failures can only be seen with debug logging. I created an issue upstream for fixing this: https://github.com/fluent/fluent-bit/issues/2934

vladarts commented 3 years ago

Hi! I tried with debug log level for fluent-bit and I did not found any error-like messages - I see blocks like:

[2021/01/19 12:50:26] [debug] [http_client] header=GET /api/v1/namespaces/gitlab-runner-jobs-team/pods/runner-e6wuacxt-project-1337-concurrent-2tkl6w HTTP/1.1
Host: kubernetes.default.svc
Content-Length: 0
User-Agent: Fluent-Bit
Connection: close
Authorization: Bearer <TOKEN>

[2021/01/19 12:50:26] [debug] [filter:kubernetes:kubernetes.0] API Server (ns=gitlab-runner-jobs-team, pod=runner-e6wuacxt-project-1337-concurrent-2tkl6w) http_do=0, HTTP Status: 200

but all logs from such pods are not enriched with kubernetes meta. Such pods lifetime can be up to 1hour and they produce logs constantly.

PettitWesley commented 3 years ago

@xxxbobrxxx Sorry I missed your reply- can you open an issue on the upstream repo for this? I am not sure what is going wrong here.

zhonghui12 commented 2 years ago

Hi @xxxbobrxxx, could you please upgrade to the latest image and see if you still have this issue? Thanks.

aws / aws-for-fluent-bit

Kubernetes data enrichment problems #137