Closed alexo1088 closed 1 year ago
So, after using the newer versions of the addon modules, i'm now not sure if this is indeed a misconfiguration, or an actual bug. I'm following the documentation here. I recently found this, and decided to attempt to deploy the fluent-bit addon following the process outlined in the docs. According to those docs, fluent-bit should indeed be creating the log streams that I see missing, as finding those log groups is part of the validation process. Im no longer seeing any errors in the logs after updating, either. Here's my code:
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.7.0"
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_version = module.eks.cluster_version
oidc_provider_arn = module.eks.oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 7
}
aws_for_fluentbit = {
chart_version = "0.1.28"
}
}
Considering that the docs pertaining to this project indicate that those log groups should be generated upon deployment, i'm now more inclined to assume this may be a bug. Can I please get some guidance on this?
I've since tried a different chart_version
- reverting from 0.1.28
and using the exact same config as in the documentation (0.1.24), and while the configmap is different, it still has no indication that the /application, /host, or /dataplane log groups should exist. This is the CM i'm seeing:
│ apiVersion: v1 │
│ data: │
│ fluent-bit.conf: | │
│ [SERVICE] │
│ Parsers_File /fluent-bit/parsers/parsers.conf │
│ [INPUT] │
│ Name tail │
│ Tag kube.* │
│ Path /var/log/containers/*.log │
│ DB /var/log/flb_kube.db │
│ Parser docker │
│ Docker_Mode On │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ [FILTER] │
│ Name kubernetes │
│ Match kube.* │
│ Kube_URL https://kubernetes.default.svc.cluster.local:443 │
│ Merge_Log On │
│ Merge_Log_Key data │
│ Keep_Log On │
│ K8S-Logging.Parser On │
│ K8S-Logging.Exclude On │
│ Buffer_Size 32k │
│ [OUTPUT] │
│ Name cloudwatch_logs │
│ Match * │
│ region us-east-1 │
│ log_group_name /aws/eks/xxx-xxx-xxx-xx/aws-fluentbit-logs │
│ log_stream_prefix fluentbit- │
│ log_stream_template $kubernetes['pod_name'].$kubernetes['container_name'] │
│ kind: ConfigMap │
│ metadata: │
│ annotations: │
│ meta.helm.sh/release-name: aws-for-fluent-bit │
│ meta.helm.sh/release-namespace: kube-system │
│ creationTimestamp: "2023-08-25T17:09:45Z" │
│ labels: │
│ app.kubernetes.io/instance: aws-for-fluent-bit │
│ app.kubernetes.io/managed-by: Helm
I've also updated my deployment code to match exactly what is in the examples:
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.7.0"
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_version = module.eks.cluster_version
oidc_provider_arn = module.eks.oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 7
}
aws_for_fluentbit = {
name = "aws-for-fluent-bit"
namespace = "kube-system"
repository = "https://aws.github.io/eks-charts"
chart_version = "0.1.24"
}
}
It really just seems like the configmaps are missing the necessary configuration to generate and export the desired logs.
EDIT: Attempts to replace the deployed configmap with the AWS provided configmap that we can get from here:
https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml
do provide the expected results. In other words, if I were to replace the configmap that this chart defaults to with the below, all of the expected log groups are created:
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Grace 30
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server ${HTTP_SERVER}
HTTP_Listen 0.0.0.0
HTTP_Port ${HTTP_PORT}
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
application-log.conf: |
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/fluent-bit*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_log.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/cloudwatch-agent*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_cwagent.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250
Buffer_Size 0
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/application
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
dataplane-log.conf: |
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/dataplane
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
host-log.conf: |
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Key message
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name aws
Match host.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match host.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/host
log_stream_prefix ${HOST_NAME}.
auto_create_group true
extra_user_agent container-insights
parsers.conf: |
[PARSER]
Name syslog
Format regex
Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name container_firstline
Format regex
Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name cwagent_firstline
Format regex
Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
---
I think this proves that this is a matter of the configmap not being configured correctly, as the documentation clearly indicates that we should be expecting those log groups to exist. Can we please get that CM updated?
I'm seeing the same behavior (first when using "old way" of addons with 4.32.1 and then with "new way" off addons with 1.7.0).
Fluent bit was unable to create log groups despite the create=true parameter.
Here's how I installed aws-for-fluent-bit:
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "1.7.0"
cluster_name = var.name
cluster_endpoint = module.eks_blueprints.eks_cluster_endpoint
cluster_version = module.eks_blueprints.eks_cluster_version
oidc_provider_arn = module.eks_blueprints.eks_oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 90
}
}
When I applied this workaround things started working as expected:
kubectl edit cm/aws-for-fluent-bit -n kube-system
# add ` auto_create_group \ true\n` (including whitespace but not backticks) to the data block
kubectl rollout restart ds/aws-for-fluent-bit -n kube-system
Please advise if there's a better solution to this, or if this issue can be replicated.
any update on this?
Hi @alexo1088 and @thekozak!
Have you tried to create the add-on using the cloudWatchLogs.autoCreateGroup
set to true
?
If we use the configuration shown on main.tf
you have this set as false
by default, can you try to set this to true and see if it behaviors as you expect?
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.7.0"
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_version = module.eks.cluster_version
oidc_provider_arn = module.eks.oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 7
}
aws_for_fluentbit = {
chart_version = "0.1.28"
set = {
name = "cloudWatchLogs.autoCreateGroup"
value = true
}
}
}
Let me know so we can work on this.
Hi @alexo1088 and @thekozak!
Have you tried to create the add-on using the
cloudWatchLogs.autoCreateGroup
set totrue
?If we use the configuration shown on
main.tf
you have this set asfalse
by default, can you try to set this to true and see if it behaviors as you expect?module "eks_blueprints_addons" { source = "aws-ia/eks-blueprints-addons/aws" version = "~> 1.7.0" cluster_name = module.eks.cluster_name cluster_endpoint = module.eks.cluster_endpoint cluster_version = module.eks.cluster_version oidc_provider_arn = module.eks.oidc_provider_arn enable_aws_for_fluentbit = true aws_for_fluentbit_cw_log_group = { create = true use_name_prefix = true # Set this to true to enable name prefix name_prefix = "eks-cluster-logs-" retention = 7 } aws_for_fluentbit = { chart_version = "0.1.28" set = { name = "cloudWatchLogs.autoCreateGroup" value = true } } }
Let me know so we can work on this.
Hi @rodrigobersa ,
So, I attempted to redeploy using the example you gave above, along with the same chart version, and had no luck. Even after using the additional parameters suggested in your example, the log groups are still not being created.
Just an FYI, this is what the CM looks like using 0.1.28 as the chart version:
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
fluent-bit.conf: "[SERVICE]\n HTTP_Server On\n HTTP_Listen 0.0.0.0\n HTTP_PORT
\ 2020\n Health_Check On \n HC_Errors_Count 5 \n HC_Retry_Failure_Count
5 \n HC_Period 5 \n \n Parsers_File /fluent-bit/parsers/parsers.conf\n[INPUT]\n
\ Name tail\n Tag kube.*\n Path /var/log/containers/*.log\n
\ DB /var/log/flb_kube.db\n Parser docker\n Docker_Mode
\ On\n Mem_Buf_Limit 5MB\n Skip_Long_Lines On\n Refresh_Interval
\ 10\n[FILTER]\n Name kubernetes\n Match kube.*\n
\ Kube_URL https://kubernetes.default.svc.cluster.local:443\n Merge_Log
\ On\n Merge_Log_Key data\n Keep_Log On\n K8S-Logging.Parser
\ On\n K8S-Logging.Exclude On\n Buffer_Size 32k\n[OUTPUT]\n Name
\ cloudwatch_logs\n Match *\n region us-east-1\n
\ log_group_name /aws/eks/xxxxxxxx/aws-fluentbit-logs\n
\ log_stream_prefix fluentbit-\n log_stream_template $kubernetes['pod_name'].$kubernetes['container_name']\n
\ auto_create_group true\n"
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: aws-for-fluent-bit
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2023-09-12T15:03:03Z"
labels:
app.kubernetes.io/instance: aws-for-fluent-bit
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aws-for-fluent-bit
app.kubernetes.io/version: 2.31.11
helm.sh/chart: aws-for-fluent-bit-0.1.28
name: aws-for-fluent-bit
namespace: kube-system
resourceVersion: "1013"
uid: 0200c4c3-4016-4b3e-acbc-20e0d674aa9b
Im not sure why it's being formatted that way, as other versions of the chart seem to have a cleaner formatting of the CM in place.
Hi @alexo1088, @thekozak and @aospinaLW.
I could figure out the issue better now. I misunderstood what was going on.
So to have the following logs, what need to be enabled is Container Insights through fluent-bit, what is not covered by this module.
/aws/containerinsights/Cluster_Name/dataplane
/aws/containerinsights/Cluster_Name/host
/aws/containerinsights/Cluster_Name/application
I'll work on something and try to make it available here, but can't guarantee, since it's not the main goal terraform-aws-eks-blueprints-addons. I also will submit a PR to fix this documentation.
Thanks @rodrigobersa . In our particular situation, we are also enabling container insights through this module. Is the creation of those log groups something that needs to be done via the addon config for container insights?
Hi @aospinaLW!
Yes. I just added the option to enable Container Insights through this module, with the following configuration.
aws_for_fluentbit = {
enable_containerinsights = true
}
This will use this template as base configuration and will create the requested log groups.
Thanks so much @rodrigobersa. It does look like the log groups are now being created, but i'm not quite sure this is working as intended.
Interestingly, if I deploy fluent-bit using the example you provided as a fresh install, it does not create the log groups. However, if I kill the fluent-bit pods and have them recreated, it will create the log groups. Unfortunately, this also results in errors flooding the pod logs that say the below:
kubelet upstream connection error
My CM looks like this now:
│ apiVersion: v1 │
│ data: │
│ application-log.conf: | │
│ [INPUT] │
│ Name tail │
│ Tag application.* │
│ Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/lo │
│ g/containers/kube-proxy* │
│ Path /var/log/containers/*.log │
│ multiline.parser docker, cri │
│ DB /var/fluent-bit/state/flb_container.db │
│ Mem_Buf_Limit 50MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Rotate_Wait 30 │
│ storage.type filesystem │
│ Read_from_Head Off │
│ │
│ [INPUT] │
│ Name tail │
│ Tag application.* │
│ Path /var/log/containers/fluent-bit* │
│ multiline.parser docker, cri │
│ DB /var/fluent-bit/state/flb_log.db │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Read_from_Head Off │
│ │
│ [INPUT] │
│ Name tail │
│ Tag application.* │
│ Path /var/log/containers/cloudwatch-agent* │
│ multiline.parser docker, cri │
│ DB /var/fluent-bit/state/flb_cwagent.db │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Read_from_Head Off │
│ │
│ [FILTER] │
│ Name kubernetes │
│ Match application.* │
│ Kube_URL https://kubernetes.default.svc:443 │
│ Kube_Tag_Prefix application.var.log.containers. │
│ Merge_Log On
Merge_Log_Key log_processed │
│ K8S-Logging.Parser On │
│ K8S-Logging.Exclude Off │
│ Labels Off │
│ Annotations Off │
│ Use_Kubelet On │
│ Kubelet_Port 10250 │
│ Buffer_Size 0 │
│ │
│ [OUTPUT] │
│ Name cloudwatch_logs │
│ Match application.* │
│ region us-east-1 │
│ log_group_name /aws/containerinsights/xxxxx-xxxx-xxt/application │
│ log_stream_prefix ${HOSTNAME}- │
│ auto_create_group true │
│ extra_user_agent container-insights │
│ workers 1 │
│ dataplane-log.conf: | │
│ [INPUT] │
│ Name systemd │
│ Tag dataplane.systemd.* │
│ Systemd_Filter _SYSTEMD_UNIT=docker.service │
│ Systemd_Filter _SYSTEMD_UNIT=containerd.service │
│ Systemd_Filter _SYSTEMD_UNIT=kubelet.service │
│ DB /var/fluent-bit/state/systemd.db │
│ Path /var/log/journal │
│ Read_From_Tail On │
│ │
│ [INPUT] │
│ Name tail │
│ Tag dataplane.tail.* │
│ Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy* │
│ multiline.parser docker, cri │
│ DB /var/fluent-bit/state/flb_dataplane_tail.db │
│ Mem_Buf_Limit 50MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Rotate_Wait 30 │
│ storage.type filesystem │
│ Read_from_Head Off │
│ │
│ [FILTER] │
│ Name modify │
│ Match dataplane.systemd.* │
│ Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit │
│ Rename MESSAGE message │
│ Remove_regex ^((?!hostname|systemd_unit|message).)*$ │
│ │
│ [FILTER] │
│ Name aws │
│ Match dataplane.* │
│ imds_version v2 │
│ │
│ [OUTPUT] │
│ Name cloudwatch_logs │
│ Match dataplane.* │
│ region us-east-1 │
│ log_group_name /aws/containerinsights/xxx-xx-xx/dataplane │
│ log_stream_prefix ${HOSTNAME}- │
│ auto_create_group true │
│ extra_user_agent container-insights │
│ fluent-bit.conf: | │
│ [SERVICE] │
│ Flush 5 │
│ Grace 30 │
│ Log_Level info │
│ Daemon off │
│ Parsers_File parsers.conf │
│ HTTP_Server On │
│ HTTP_Listen 0.0.0.0 │
│ HTTP_Port 2020 │
│ storage.path /var/fluent-bit/state/flb-storage/ │
│ storage.sync normal │
│ storage.checksum off │
│ storage.backlog.mem_limit 5M │
│ │
│ @INCLUDE application-log.conf │
│ @INCLUDE dataplane-log.conf │
│ @INCLUDE host-log.conf │
│ host-log.conf: | │
│ [INPUT] │
│ Name tail │
│ Tag host.dmesg │
│ Path /var/log/dmesg │
│ Key message │
│ DB /var/fluent-bit/state/flb_dmesg.db │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Read_from_Head Off
[INPUT] │
│ Name tail │
│ Tag host.messages │
│ Path /var/log/messages │
│ Parser syslog │
│ DB /var/fluent-bit/state/flb_messages.db │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Read_from_Head Off │
│ │
│ [INPUT] │
│ Name tail │
│ Tag host.secure │
│ Path /var/log/secure │
│ Parser syslog │
│ DB /var/fluent-bit/state/flb_secure.db │
│ Mem_Buf_Limit 5MB │
│ Skip_Long_Lines On │
│ Refresh_Interval 10 │
│ Read_from_Head Off │
│ │
│ [FILTER] │
│ Name aws │
│ Match host.* │
│ imds_version v2 │
│ │
│ [OUTPUT] │
│ Name cloudwatch_logs │
│ Match host.* │
│ region us-east-1 │
│ log_group_name /aws/containerinsights/xxx-xxx/host │
│ log_stream_prefix ${HOSTNAME}. │
│ auto_create_group true │
│ extra_user_agent container-insights │
│ parsers.conf: | │
│ [PARSER] │
│ Name syslog │
│ Format regex │
│ Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(? │
│ <message>.*)$ │
│ Time_Key time │
│ Time_Format %b %d %H:%M:%S │
│ │
│ [PARSER] │
│ Name container_firstline
Format regex │
│ Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d │
│ {2}\.\w*).*(?=}) │
│ Time_Key time │
│ Time_Format %Y-%m-%dT%H:%M:%S.%LZ │
│ │
│ [PARSER] │
│ Name cwagent_firstline │
│ Format regex │
│ Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*? │
│ )".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=}) │
│ Time_Key time │
│ Time_Format %Y-%m-%dT%H:%M:%S.%LZ │
│ kind: ConfigMap │
│ metadata: │
│ annotations: │
│ meta.helm.sh/release-name: aws-for-fluent-bit │
│ meta.helm.sh/release-namespace: kube-system │
│ creationTimestamp: "2023-09-15T16:35:44Z" │
│ labels: │
│ app.kubernetes.io/instance: aws-for-fluent-bit │
│ app.kubernetes.io/managed-by: Helm │
│ app.kubernetes.io/name: aws-for-fluent-bit │
│ app.kubernetes.io/version: 2.31.11 │
│ helm.sh/chart: aws-for-fluent-bit-0.1.28 │
│ name: aws-for-fluent-bit │
│ namespace: kube-system │
│ resourceVersion: "11286" │
│ uid: 5ba54edc-39ae-4d53-80d8-afc584a18939
Unfortunately, it seems like this latest configuration is resulting in a huge CPU spike, pushing the cluster to its limits. The below is a screenshot showing a very large spike in CPU for the fluent bit pods. Wondering if you might be able to replicate this on your end?
NAMESPACE↑ NAME PF READY RESTARTS STATUS CPU MEM %CPU/R %CPU/L /R │
│ amazon-cloudwatch aws-cloudwatch-metrics-j6slb ● 1/1 0 Running 7 20 3 3 10 │
│ amazon-cloudwatch aws-cloudwatch-metrics-rr2w8 ● 1/1 0 Running 5 23 2 2 11 │
│ default nginx-deployment-cbdccf466-fnhqd ● 1/1 0 Running 0 1 n/a n/a /a │
│ default nginx-deployment-cbdccf466-s9dtq ● 1/1 0 Running 0 1 n/a n/a /a │
│ kube-system aws-for-fluent-bit-4rflv ● 1/1 0 Running 1042 27 2084 n/a 55 │
│ kube-system aws-for-fluent-bit-vl796 ● 1/1 0 Running 990 27 1980 n/a 55 │
│ kube-system aws-node-fdjdh ● 1/1 0 Running 4 37 16 n/a /a │
│ kube-system aws-node-lp5sb ● 1/1 0 Running 3 37 12 n/a /a │
│ kube-system cluster-autoscaler-aws-cluster-autoscaler-5ccd8ccddd-q24j8 ● 1/1 0 Running 2 30 1 1 5 │
│ kube-system coredns-79df7fff65-79c6n ● 1/1 0 Running 1 13 1 n/a 18 │
│ kube-system coredns-79df7fff65-g9r6t ● 1/1 0 Running 2 13 2 n/a 18 │
│ kube-system ebs-csi-controller-5cfc7dd9c8-kntfg ● 6/6 0 Running 2 47 3 n/a 19 │
│ kube-system ebs-csi-controller-5cfc7dd9c8-phv5q ● 6/6 0 Running 3 54 5 n/a 22 │
│ kube-system ebs-csi-node-ld4bm ● 3/3 0 Running 1 20 3 n/a 16 │
│ kube-system ebs-csi-node-vpk24 ● 3/3 0 Running 1 20 3 n/a 16 │
│ kube-system kube-proxy-k5npb ● 1/1 0 Running 1 11 1 n/a /a │
│ kube-system kube-proxy-pwgmg ● 1/1 0 Running 1 11 1 n/a /a │
│ kube-system metrics-server-675ff9f75d-nlpcg ● 1/1 0 Running 4 17 n/a n/a /a
Hi @aospinaLW!
I didn't get one of the behaviors. If you do a fresh install, it should create all the log groups, but if you're changing an existing one, and replacing the configMap
for the one provided by the module, you may need to recycle your aws-for-fluent-bit-
PODs, to use the new configuration.
The log flooding, I could see in earlier scenarios when the aws-for-fluent-bit
configMap
, was not aligned to my cluster configuration.
Give me sometime to investigate this last one, and let me know if the log group creation is not really working on a fresh install, it should.
Thanks for the quick reply @rodrigobersa !
Here's my process:
Since clusters existing, i'm removing the install by commenting out fluent-bit and then uncommenting it to do a fresh install with the enable_containerinsights = true
flag. Here's the full config
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.8.0"
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_version = module.eks.cluster_version
oidc_provider_arn = module.eks.oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 7
}
aws_for_fluentbit = {
set = [{
name = "cloudWatchLogs.autoCreateGroup"
value = true
}]
enable_containerinsights = true
chart_version = var.fluentbit_chart_version ### Visit https://artifacthub.io/packages/helm/aws/aws-for-fluent-bit to see latest chart version available when deploying
}
}
If I deploy this, the log groups are NOT created. The above deployment does not result in the CPU spike though.
Once I kill the pods and have them recreated, the log groups are immediately created, but then I start seeing the errors I mentioned above and the CPU spikes to the levels shown.
Hi @alexo1088 and @aospinaLW!
Can you share more details of your environment? I couldn't reproduce the Log Groups not being created. As you can see below, I deployed a new EKS Cluster with this example, and it created the Log Groups around 10 seconds after the aws-for-fluent-bit-
PODs were up. No restarts.
# kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-for-fluent-bit
NAME READY STATUS RESTARTS AGE
aws-for-fluent-bit-56zs7 1/1 Running 0 5m15s
aws-for-fluent-bit-5m7wb 1/1 Running 0 5m18s
aws-for-fluent-bit-9hzck 1/1 Running 0 3m39s
aws-for-fluent-bit-cmbxm 1/1 Running 0 5m18s
aws-for-fluent-bit-jpmtt 1/1 Running 0 3m42s
aws-for-fluent-bit-l75lh 1/1 Running 0 3m19s
# kubectl -n kube-system describe pod -l app.kubernetes.io/name=aws-for-fluent-bit | grep 'Start Time'
Start Time: Fri, 15 Sep 2023 15:59:46 -0400
Start Time: Fri, 15 Sep 2023 15:59:43 -0400
Start Time: Fri, 15 Sep 2023 16:01:22 -0400
Start Time: Fri, 15 Sep 2023 15:59:43 -0400
Start Time: Fri, 15 Sep 2023 16:01:19 -0400
Start Time: Fri, 15 Sep 2023 16:01:42 -0400
aws logs describe-log-groups --query 'logGroups[].[logGroupName,creationTime]'
[
[
"/aws/containerinsights/complete/application",
1694807994414
],
[
"/aws/containerinsights/complete/dataplane",
1694807994416
],
[
"/aws/containerinsights/complete/host",
1694807994416
],
[
"/aws/eks/complete/cluster",
1694807368371
]
]
# date +"%c" -d @1694807994
Fri 15 Sep 2023 07:59:54 PM UTC
If we consider that I'm in EST, -0400 from the UTC time it will be 03:59:54 PM.
I also couldn't reproduce the flooding scenario. I left PODs running and simulated some load, Memory consumption didn't exceed 20% of the limits, and CPU kept around 1200-1500m. As you can see below, no customization to the aws-for-fluent-bit-
PODs was done.
I could see some kubelet upstream connection error
that I'm investigating, but nothing to be concerned about.
# kubectl -n kube-system top pod -l app.kubernetes.io/name=aws-for-fluent-bit
NAME CPU(cores) MEMORY(bytes)
aws-for-fluent-bit-56zs7 1267m 29Mi
aws-for-fluent-bit-5m7wb 1254m 29Mi
aws-for-fluent-bit-9hzck 961m 27Mi
aws-for-fluent-bit-cmbxm 1270m 28Mi
aws-for-fluent-bit-jpmtt 988m 27Mi
aws-for-fluent-bit-l75lh 968m 27Mi
# kubectl get ds -n kube-system aws-for-fluent-bit -o yaml | yq '.spec.template.spec.containers[].resources'
limits:
memory: 250Mi
requests:
cpu: 50m
memory: 50Mi
Logs seem to be generated as expected as well.
# aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/application --query 'logStreams[].logStreamName' | head -n10
[
"aws-for-fluent-bit-56zs7-application.var.log.containers.aws-for-fluent-bit-56zs7_kube-system_aws-for-fluent-bit-81eb3c6215a7c144fa5feb359cf26252979b0682d43b82f2f893ad718891f36b.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.aws-guardduty-agent-dc5z8_amazon-guardduty_aws-guardduty-agent-a8dcd25a72ab0700176ebfd1dd15e3c3b74d5fed716c517133f6075b8da30a10.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.coredns-7f8587b949-xhwr2_kube-system_coredns-d54d649c4a6cbe7dae1124261e4db1dfd1d84e546389c2c480b8ed767782e201.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-attacher-b079f4121abf981da1a9704cdf6f5e100b10676e0cec985e11a1bcee6da8e0ca.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-provisioner-22a618a4f52b059d07dc56b478346556a4f064508602a6ad25dc006c77f0b374.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-resizer-7b8bb7f351ab256adf1347aba89b976caaecadc753137ef3acc7736c019bec3f.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-snapshotter-b7db45dee9e23e1363b96438a243aa40f046ba3f476640cbb5b7e13bce7ae66b.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_ebs-plugin-9167e1768b9a14daa3b5a80e275230c1687b9d430c079c5e7fdf17fbad9b4a5b.log",
"aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_liveness-probe-110b3f7842736d9eb7dedc5a8a42651719b08d2135bb04ec3872e0a0316fdef5.log",
# aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/host --query 'logStreams[].logStreamName' | head -n10
[
"aws-for-fluent-bit-56zs7.host.messages",
"aws-for-fluent-bit-5m7wb.host.messages",
"aws-for-fluent-bit-9hzck.host.messages",
"aws-for-fluent-bit-cmbxm.host.messages",
"aws-for-fluent-bit-jpmtt.host.messages",
"aws-for-fluent-bit-l75lh.host.messages"
]
# aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/dataplane --query 'logStreams[].logStreamName' | head -n10
[
"aws-for-fluent-bit-56zs7-dataplane.systemd.containerd.service",
"aws-for-fluent-bit-56zs7-dataplane.systemd.kubelet.service",
"aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-eks-nodeagent-b42903c593896412fedf67272c4b7e29bde11cec24804169cd0c0d363c7087de.log",
"aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-node-bfa0dea8c5d2503245ac4cb5a8bfd3ab88d579e768270aed07f7b78d7aea2c16.log",
"aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-vpc-cni-init-5a7503b8bbb2610e8ddbe27a30d19dab0ac685fbd18b262b40c5b71a350e9828.log",
"aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.kube-proxy-d795v_kube-system_kube-proxy-1bfefe55272af690b7b511b74081eb540e8bd48d08caeba7e0640e64ccb07d9d.log",
"aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.kube-proxy-gfwkn_kube-system_kube-proxy-5098463fbb0f192a598d20a582330d1311f333b69c6d606aff08e59c0c67dde9.log",
"aws-for-fluent-bit-5m7wb-dataplane.systemd.containerd.service",
"aws-for-fluent-bit-5m7wb-dataplane.systemd.kubelet.service",
@rodrigobersa
Hmm, inteesting. Thanks for taking the time to provide detailed testing information here.
I decided to recreate a completely fresh cluster from scratch, using the same example I provided above. This time, the log groups did get created with no manual intervention, but unfortunately, the CPU spike was immediately evident within the cluster as well:
NAMESPACE↑ NAME PF READY RESTARTS STATUS CPU MEM %CPU/R %CPU/L /R │
│ amazon-cloudwatch aws-cloudwatch-metrics-nkw2b ● 1/1 0 Running 9 23 4 4 11 │
│ amazon-cloudwatch aws-cloudwatch-metrics-rmp4m ● 1/1 0 Running 6 20 3 3 10 │
│ kube-system aws-for-fluent-bit-glgmt ● 1/1 0 Running 843 27 1686 n/a 55 │
│ kube-system aws-for-fluent-bit-q4xtp ● 1/1 0 Running 1016 27 2032 n/a 55 │
│ kube-system aws-node-6zsq7 ● 1/1 0 Running 5 36 20 n/a /a │
│ kube-system aws-node-55v99 ● 1/1 0 Running 4 37 16 n/a /a
╰─$ kubectl -n kube-system describe pod -l app.kubernetes.io/name=aws-for-fluent-bit | grep 'Start Time'
Start Time: Mon, 18 Sep 2023 10:28:16 -0400
Start Time: Mon, 18 Sep 2023 10:28:16 -0400
My cluster is pretty standard, since this is completely fresh, It doesn't have any workloads running on it other than core cluster components and additional addons. I am running fluent-bit in conjunction with container insights
, metrics server
, and cluster auto-scaler
, all installed via same module. Here's the complete configuration:
provider "aws" {
region = var.region
assume_role {
role_arn = "arn:aws:iam::xxxxx:role/terraform-execute"
}
}
data "aws_eks_cluster_auth" "cluster" {
name = module.eks.cluster_name
}
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.cluster.token
}
provider "helm" {
kubernetes {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.cluster.token
}
}
data "aws_caller_identity" "current" {}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.12"
cluster_addons = {
coredns = {
resolve_conflicts_on_create = "OVERWRITE"
addon_version = var.coredns_version
}
kube-proxy = {
resolve_conflicts_on_create = "OVERWRITE"
addon_version = var.kube_proxy_version
}
vpc-cni = {
resolve_conflicts_on_create = "OVERWRITE"
before_compute = true
addon_version = var.vpc_cni_version
}
aws-ebs-csi-driver = {
resolve_conflicts_on_create = "OVERWRITE"
addon_version = var.aws_ebs_csi_driver
}
}
vpc_id = local.vpc.vpc_id
subnet_ids = local.private_subnets
kms_key_owners = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/xxxx/xxxx/xxxxx"]
cluster_version = var.eks_cluster_version
cluster_name = var.eks_cluster_name
cluster_endpoint_private_access = true
cluster_endpoint_public_access = false
cluster_enabled_log_types = ["audit"]
manage_aws_auth_configmap = true
# EKS MANAGED NODE GROUPS
eks_managed_node_groups = {
eks_mng_lin = {
name = var.eks_nodegroup_name_lin_mng
min_size = var.min_ng_nodes_lin_mng
max_size = var.max_ng_nodes_lin_mng
desired_size = var.desire_ng_nodes_lin_mng
iam_role_additional_policies = {
CloudWatchAgentServerPolicy = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
AmazonEBSCSIDriverPolicy = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}
instance_types = [var.instance_type_lin_mng]
}
}
aws_auth_roles= [
{
rolearn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/xxxx"
username = "xxxxx"
groups = ["system:masters"]
},
]
}
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.8.0"
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_version = module.eks.cluster_version
oidc_provider_arn = module.eks.oidc_provider_arn
enable_aws_for_fluentbit = true
aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = true # Set this to true to enable name prefix
name_prefix = "eks-cluster-logs-"
retention = 7
}
aws_for_fluentbit = {
set = [{
name = "cloudWatchLogs.autoCreateGroup"
value = true
}]
enable_containerinsights = true
chart_version = var.fluentbit_chart_version ### Visit https://artifacthub.io/packages/helm/aws/aws-for-fluent-bit to see latest chart version available when deploying
}
enable_aws_cloudwatch_metrics = true
aws_cloudwatch_metrics = {
version = var.cloudwatch_chart_version ### Visit https://artifacthub.io/packages/helm/aws/aws-cloudwatch-metrics to see latest chart version available when deploying
}
enable_metrics_server = true
metrics_server = {
version = var.metrics_server_chart_version ### Visit https://artifacthub.io/packages/helm/metrics-server/metrics-server to see latest chart version available when deploying
}
enable_cluster_autoscaler = true
cluster_autoscaler = {
version = var.autoscaler_chart_version ### Visit https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler to see latest chart version available when deploying
}
depends_on = [
module.eks
]
}
The instance type that i'm using is a t3.medium
and i'm running cluster version 1.27. I don't think it's an instance type issue since it seems to clearly indicate that only the fluent-bit pods are running with a huge spike. Any other ideas you might have seeing the above config?
Any chance you tested this with a private cluster? I should also mention that my cluster is private
EDIT: I've now tried this deployment with both a public and private cluster, and both versions have resulted in the same flooding of the same error:
kubelet upstream connection error
This error is not intermittent - it's flooding the pod logs with entries every second. Occasionally, I also see this error:
[tls] error: unexpected EOF
I also changed my instance type from t3.medium
to m6i.large
with no change.
Can you share what version of fluent-bit you're running? I'm running:
app.kubernetes.io/version: 2.31.11
Hi @alexo1088 and @aospinaLW!
Thanks for sharing all this info! Did you ever setup Container Insights with success in any of your environments?
Asking, because I found the reason for the kubelet upstream connection error
. The FluentBit monitoring for kubelet
doesn't work by default, there are some few steps that need to be setup in order to work correctly with ContainerInsights, like RBAC and change the DaemonSet
configuration to hostNetwork: true
. Here's the documentation.
Can you try to remove kubelet
configuration from you aws-for-fluentbit
configMap and validate if that works for you? If so, I'll port the change to the repo.
Remove this block:
Use_Kubelet On
Kubelet_Port 10250
Hey @rodrigobersa
Looks like you were right. As soon as I removed those two lines from the CM, and recreated the pods, the CPU spike was gone. Just to confirm, this shouldn't impact our ability to collect kubelet logs from the nodes, right? As long as we're able to collect those, then I think we're good to go! Thank you so much for all of your help on this!
Hi @alexo1088!
Thanks for getting back!
Not really sure, by removing these, FluentBit will not access Kubelet directly and get all the logs through the Cluster API. In the docs I mentioned above, they suggest enabling this feature just for large clusters, and since the requirements affect the cluster RBAC and the resources created by FluentBit Helm chart, I don't see right now a way to add those to the repository patterns.
What I would suggest you is to check the logs that were generated and see if the specific info that you need is there. Otherwise, you can try to add the resources mentioned in the docs to the Terraform code, so you'll be able to enable the Kubelet monitoring feature. I'll add some information related to that in the docs.
Hey all. I'm not quite sure if this is a bug, but opting to post this as a question first. Any other time that I've deployed fluent-bit outside of using this addon (https://aws-ia.github.io/terraform-aws-eks-blueprints/v4.32.1/add-ons/aws-for-fluent-bit/), additional log groups have been created. If following AWS documentation, you would expect the following three log groups to appear:
/aws/containerinsights/Cluster_Name/dataplane /aws/containerinsights/Cluster_Name/host /aws/containerinsights/Cluster_Name/application
Using the 'aws-for-fluentbit' addon, there are different log groups that get created (/aws/eks/fluentbit-cloudwatch/logs, /aws/eks/fluentbit-cloudwatch/workloads/*), but I do not see the above three created. I do see /aws/containerinsights/Cluster_Name/performance, but that doesn't seem to be capturing the logs I need.
Specifically, I'd like to obtain the kubelet logs from the nodes themselves, which was captured in the /aws/containerinsights/Cluster_Name/dataplane log group previously. Is this a misconfiguration/bug? I see some logs that are stating errors are occurring:
The example that is relevant to this is the aws-for-fluentbit addon: https://aws-ia.github.io/terraform-aws-eks-blueprints/v4.32.1/add-ons/aws-for-fluent-bit/
My current configuration looks like this:
The fact that some log groups are created, but others are not, leads me to think this is not a permissions issue, and perhaps i'm misunderstanding this new method of deployment that is unlike what I was expecting and used to seeing. However, if there's a way to obtain things like kubelet logs using this default deployment, i'd appreciate any guidance. Thank you!