aws-ia / terraform-aws-eks-blueprints-addons

Terraform module which provisions addons on Amazon EKS clusters
https://aws-ia.github.io/terraform-aws-eks-blueprints-addons/main/
Apache License 2.0
256 stars 120 forks source link

aws-for-fluentbit '/aws/containerinsights/Cluster_Name/performance' log group, but does not create application/dataplane/host log groups. #231

Closed alexo1088 closed 1 year ago

alexo1088 commented 1 year ago

Hey all. I'm not quite sure if this is a bug, but opting to post this as a question first. Any other time that I've deployed fluent-bit outside of using this addon (https://aws-ia.github.io/terraform-aws-eks-blueprints/v4.32.1/add-ons/aws-for-fluent-bit/), additional log groups have been created. If following AWS documentation, you would expect the following three log groups to appear:

/aws/containerinsights/Cluster_Name/dataplane /aws/containerinsights/Cluster_Name/host /aws/containerinsights/Cluster_Name/application

Using the 'aws-for-fluentbit' addon, there are different log groups that get created (/aws/eks/fluentbit-cloudwatch/logs, /aws/eks/fluentbit-cloudwatch/workloads/*), but I do not see the above three created. I do see /aws/containerinsights/Cluster_Name/performance, but that doesn't seem to be capturing the logs I need.

Specifically, I'd like to obtain the kubelet logs from the nodes themselves, which was captured in the /aws/containerinsights/Cluster_Name/dataplane log group previously. Is this a misconfiguration/bug? I see some logs that are stating errors are occurring:


│ aws-for-fluent-bit-s8rd9 [2023/08/16 19:08:47] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443 ?                                                   │
│ aws-for-fluent-bit-s8rd9 [2023/08/16 19:08:47] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443 ?                                                   │
│ aws-for-fluent-bit-s8rd9 [2023/08/16 19:08:47] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events                                                    │
│ aws-for-fluent-bit-s8rd9 [2023/08/16 19:08:47] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events                                                    │
│ aws-for-fluent-bit-s8rd9 [2023/08/16 19:08:47] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events

The example that is relevant to this is the aws-for-fluentbit addon: https://aws-ia.github.io/terraform-aws-eks-blueprints/v4.32.1/add-ons/aws-for-fluent-bit/

My current configuration looks like this:

 enable_aws_for_fluentbit      = true
    aws_for_fluentbit_helm_config = {
      version     = "0.1.28"
  }

The fact that some log groups are created, but others are not, leads me to think this is not a permissions issue, and perhaps i'm misunderstanding this new method of deployment that is unlike what I was expecting and used to seeing. However, if there's a way to obtain things like kubelet logs using this default deployment, i'd appreciate any guidance. Thank you!

alexo1088 commented 1 year ago

So, after using the newer versions of the addon modules, i'm now not sure if this is indeed a misconfiguration, or an actual bug. I'm following the documentation here. I recently found this, and decided to attempt to deploy the fluent-bit addon following the process outlined in the docs. According to those docs, fluent-bit should indeed be creating the log streams that I see missing, as finding those log groups is part of the validation process. Im no longer seeing any errors in the logs after updating, either. Here's my code:

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.7.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
      chart_version     = "0.1.28"
  }
}

Considering that the docs pertaining to this project indicate that those log groups should be generated upon deployment, i'm now more inclined to assume this may be a bug. Can I please get some guidance on this?

alexo1088 commented 1 year ago

I've since tried a different chart_version - reverting from 0.1.28 and using the exact same config as in the documentation (0.1.24), and while the configmap is different, it still has no indication that the /application, /host, or /dataplane log groups should exist. This is the CM i'm seeing:

│ apiVersion: v1                                                                                                                                                                                           │
│ data:                                                                                                                                                                                                    │
│   fluent-bit.conf: |                                                                                                                                                                                     │
│     [SERVICE]                                                                                                                                                                                            │
│         Parsers_File /fluent-bit/parsers/parsers.conf                                                                                                                                                    │
│     [INPUT]                                                                                                                                                                                              │
│         Name              tail                                                                                                                                                                           │
│         Tag               kube.*                                                                                                                                                                         │
│         Path              /var/log/containers/*.log                                                                                                                                                      │
│         DB                /var/log/flb_kube.db                                                                                                                                                           │
│         Parser            docker                                                                                                                                                                         │
│         Docker_Mode       On                                                                                                                                                                             │
│         Mem_Buf_Limit     5MB                                                                                                                                                                            │
│         Skip_Long_Lines   On                                                                                                                                                                             │
│         Refresh_Interval  10                                                                                                                                                                             │
│     [FILTER]                                                                                                                                                                                             │
│         Name                kubernetes                                                                                                                                                                   │
│         Match               kube.*                                                                                                                                                                       │
│         Kube_URL            https://kubernetes.default.svc.cluster.local:443                                                                                                                             │
│         Merge_Log           On                                                                                                                                                                           │
│         Merge_Log_Key       data                                                                                                                                                                         │
│         Keep_Log            On                                                                                                                                                                           │
│         K8S-Logging.Parser  On                                                                                                                                                                           │
│         K8S-Logging.Exclude On                                                                                                                                                                           │
│         Buffer_Size         32k                                                                                                                                                                          │
│     [OUTPUT]                                                                                                                                                                                             │
│         Name                  cloudwatch_logs                                                                                                                                                            │
│         Match                 *                                                                                                                                                                          │
│         region                us-east-1                                                                                                                                                                  │
│         log_group_name        /aws/eks/xxx-xxx-xxx-xx/aws-fluentbit-logs                                                                                                                          │
│         log_stream_prefix     fluentbit-                                                                                                                                                                 │
│         log_stream_template   $kubernetes['pod_name'].$kubernetes['container_name']                                                                                                                      │
│ kind: ConfigMap                                                                                                                                                                                          │
│ metadata:                                                                                                                                                                                                │
│   annotations:                                                                                                                                                                                           │
│     meta.helm.sh/release-name: aws-for-fluent-bit                                                                                                                                                        │
│     meta.helm.sh/release-namespace: kube-system                                                                                                                                                          │
│   creationTimestamp: "2023-08-25T17:09:45Z"                                                                                                                                                              │
│   labels:                                                                                                                                                                                                │
│     app.kubernetes.io/instance: aws-for-fluent-bit                                                                                                                                                       │
│     app.kubernetes.io/managed-by: Helm

I've also updated my deployment code to match exactly what is in the examples:

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.7.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
      name              = "aws-for-fluent-bit"
      namespace         = "kube-system"
      repository        = "https://aws.github.io/eks-charts"
      chart_version     = "0.1.24"
  }
}

It really just seems like the configmaps are missing the necessary configuration to generate and export the desired logs.

EDIT: Attempts to replace the deployed configmap with the AWS provided configmap that we can get from here:

https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

do provide the expected results. In other words, if I were to replace the configmap that this chart defaults to with the below, all of the expected log groups are created:

data:
  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               ${HTTP_SERVER}
        HTTP_Listen               0.0.0.0
        HTTP_Port                 ${HTTP_PORT}
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 5M

    @INCLUDE application-log.conf
    @INCLUDE dataplane-log.conf
    @INCLUDE host-log.conf

  application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/fluent-bit*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_log.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/cloudwatch-agent*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_cwagent.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              Off
        Annotations         Off
        Use_Kubelet         On
        Kubelet_Port        10250
        Buffer_Size         0

    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  dataplane-log.conf: |
    [INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        Systemd_Filter      _SYSTEMD_UNIT=containerd.service
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

    [INPUT]
        Name                tail
        Tag                 dataplane.tail.*
        Path                /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_dataplane_tail.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               dataplane.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  host-log.conf: |
    [INPUT]
        Name                tail
        Tag                 host.dmesg
        Path                /var/log/dmesg
        Key                 message
        DB                  /var/fluent-bit/state/flb_dmesg.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.messages
        Path                /var/log/messages
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_messages.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.secure
        Path                /var/log/secure
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_secure.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                aws
        Match               host.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               host.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/host
        log_stream_prefix   ${HOST_NAME}.
        auto_create_group   true
        extra_user_agent    container-insights

  parsers.conf: |
    [PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

    [PARSER]
        Name                container_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cwagent_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ
---

I think this proves that this is a matter of the configmap not being configured correctly, as the documentation clearly indicates that we should be expecting those log groups to exist. Can we please get that CM updated?

thekozak commented 1 year ago

I'm seeing the same behavior (first when using "old way" of addons with 4.32.1 and then with "new way" off addons with 1.7.0).

thekozak commented 1 year ago

Fluent bit was unable to create log groups despite the create=true parameter.

Here's how I installed aws-for-fluent-bit:

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "1.7.0"

  cluster_name      = var.name
  cluster_endpoint  = module.eks_blueprints.eks_cluster_endpoint
  cluster_version   = module.eks_blueprints.eks_cluster_version
  oidc_provider_arn = module.eks_blueprints.eks_oidc_provider_arn

  enable_aws_for_fluentbit       = true
  aws_for_fluentbit_cw_log_group = {
    create          = true
    use_name_prefix = true # Set this to true to enable name prefix
    name_prefix     = "eks-cluster-logs-"
    retention       = 90
  }

}

When I applied this workaround things started working as expected:

kubectl edit cm/aws-for-fluent-bit -n kube-system
# add `    auto_create_group    \  true\n` (including whitespace but not backticks) to the data block
kubectl rollout restart ds/aws-for-fluent-bit -n kube-system

Please advise if there's a better solution to this, or if this issue can be replicated.

alexo1088 commented 1 year ago

any update on this?

rodrigobersa commented 1 year ago

Hi @alexo1088 and @thekozak!

Have you tried to create the add-on using the cloudWatchLogs.autoCreateGroup set to true?

If we use the configuration shown on main.tf you have this set as false by default, can you try to set this to true and see if it behaviors as you expect?

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.7.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
      chart_version     = "0.1.28"
      set = {
        name  = "cloudWatchLogs.autoCreateGroup"
        value = true
    }
  }
}

Let me know so we can work on this.

aospinaLW commented 1 year ago

Hi @alexo1088 and @thekozak!

Have you tried to create the add-on using the cloudWatchLogs.autoCreateGroup set to true?

If we use the configuration shown on main.tf you have this set as false by default, can you try to set this to true and see if it behaviors as you expect?

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.7.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
      chart_version     = "0.1.28"
      set = {
        name  = "cloudWatchLogs.autoCreateGroup"
        value = true
    }
  }
}

Let me know so we can work on this.

Hi @rodrigobersa ,

So, I attempted to redeploy using the example you gave above, along with the same chart version, and had no luck. Even after using the additional parameters suggested in your example, the log groups are still not being created.

aospinaLW commented 1 year ago

Just an FYI, this is what the CM looks like using 0.1.28 as the chart version:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  fluent-bit.conf: "[SERVICE]\n    HTTP_Server  On\n    HTTP_Listen  0.0.0.0\n    HTTP_PORT
    \   2020\n    Health_Check On \n    HC_Errors_Count 5 \n    HC_Retry_Failure_Count
    5 \n    HC_Period 5 \n    \n    Parsers_File /fluent-bit/parsers/parsers.conf\n[INPUT]\n
    \   Name              tail\n    Tag               kube.*\n    Path              /var/log/containers/*.log\n
    \   DB                /var/log/flb_kube.db\n    Parser            docker\n    Docker_Mode
    \      On\n    Mem_Buf_Limit     5MB\n    Skip_Long_Lines   On\n    Refresh_Interval
    \ 10\n[FILTER]\n    Name                kubernetes\n    Match               kube.*\n
    \   Kube_URL            https://kubernetes.default.svc.cluster.local:443\n    Merge_Log
    \          On\n    Merge_Log_Key       data\n    Keep_Log            On\n    K8S-Logging.Parser
    \ On\n    K8S-Logging.Exclude On\n    Buffer_Size         32k\n[OUTPUT]\n    Name
    \                 cloudwatch_logs\n    Match                 *\n    region                us-east-1\n
    \   log_group_name        /aws/eks/xxxxxxxx/aws-fluentbit-logs\n
    \   log_stream_prefix     fluentbit-\n    log_stream_template   $kubernetes['pod_name'].$kubernetes['container_name']\n
    \   auto_create_group     true\n"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: aws-for-fluent-bit
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-09-12T15:03:03Z"
  labels:
    app.kubernetes.io/instance: aws-for-fluent-bit
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-for-fluent-bit
    app.kubernetes.io/version: 2.31.11
    helm.sh/chart: aws-for-fluent-bit-0.1.28
  name: aws-for-fluent-bit
  namespace: kube-system
  resourceVersion: "1013"
  uid: 0200c4c3-4016-4b3e-acbc-20e0d674aa9b

Im not sure why it's being formatted that way, as other versions of the chart seem to have a cleaner formatting of the CM in place.

rodrigobersa commented 1 year ago

Hi @alexo1088, @thekozak and @aospinaLW.

I could figure out the issue better now. I misunderstood what was going on.

So to have the following logs, what need to be enabled is Container Insights through fluent-bit, what is not covered by this module.

/aws/containerinsights/Cluster_Name/dataplane
/aws/containerinsights/Cluster_Name/host
/aws/containerinsights/Cluster_Name/application

I'll work on something and try to make it available here, but can't guarantee, since it's not the main goal terraform-aws-eks-blueprints-addons. I also will submit a PR to fix this documentation.

aospinaLW commented 1 year ago

Thanks @rodrigobersa . In our particular situation, we are also enabling container insights through this module. Is the creation of those log groups something that needs to be done via the addon config for container insights?

rodrigobersa commented 1 year ago

Hi @aospinaLW!

Yes. I just added the option to enable Container Insights through this module, with the following configuration.

  aws_for_fluentbit = {
    enable_containerinsights = true
    }

This will use this template as base configuration and will create the requested log groups.

aospinaLW commented 1 year ago

Thanks so much @rodrigobersa. It does look like the log groups are now being created, but i'm not quite sure this is working as intended.

Interestingly, if I deploy fluent-bit using the example you provided as a fresh install, it does not create the log groups. However, if I kill the fluent-bit pods and have them recreated, it will create the log groups. Unfortunately, this also results in errors flooding the pod logs that say the below:

kubelet upstream connection error

My CM looks like this now:

│ apiVersion: v1                                                                                                                      │
│ data:                                                                                                                               │
│   application-log.conf: |                                                                                                           │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag application.*                                                                                                           │
│         Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/lo │
│ g/containers/kube-proxy*                                                                                                            │
│         Path /var/log/containers/*.log                                                                                              │
│         multiline.parser docker, cri                                                                                                │
│         DB /var/fluent-bit/state/flb_container.db                                                                                   │
│         Mem_Buf_Limit 50MB                                                                                                          │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Rotate_Wait 30                                                                                                              │
│         storage.type filesystem                                                                                                     │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag application.*                                                                                                           │
│         Path /var/log/containers/fluent-bit*                                                                                        │
│         multiline.parser docker, cri                                                                                                │
│         DB /var/fluent-bit/state/flb_log.db                                                                                         │
│         Mem_Buf_Limit 5MB                                                                                                           │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag application.*                                                                                                           │
│         Path /var/log/containers/cloudwatch-agent*                                                                                  │
│         multiline.parser docker, cri                                                                                                │
│         DB /var/fluent-bit/state/flb_cwagent.db                                                                                     │
│         Mem_Buf_Limit 5MB                                                                                                           │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [FILTER]                                                                                                                        │
│         Name kubernetes                                                                                                             │
│         Match application.*                                                                                                         │
│         Kube_URL https://kubernetes.default.svc:443                                                                                 │
│         Kube_Tag_Prefix application.var.log.containers.                                                                             │
│         Merge_Log On     
         Merge_Log_Key log_processed                                                                                                 │
│         K8S-Logging.Parser On                                                                                                       │
│         K8S-Logging.Exclude Off                                                                                                     │
│         Labels Off                                                                                                                  │
│         Annotations Off                                                                                                             │
│         Use_Kubelet On                                                                                                              │
│         Kubelet_Port 10250                                                                                                          │
│         Buffer_Size 0                                                                                                               │
│                                                                                                                                     │
│     [OUTPUT]                                                                                                                        │
│         Name cloudwatch_logs                                                                                                        │
│         Match application.*                                                                                                         │
│         region us-east-1                                                                                                            │
│         log_group_name /aws/containerinsights/xxxxx-xxxx-xxt/application                                                     │
│         log_stream_prefix ${HOSTNAME}-                                                                                              │
│         auto_create_group true                                                                                                      │
│         extra_user_agent container-insights                                                                                         │
│         workers 1                                                                                                                   │
│   dataplane-log.conf: |                                                                                                             │
│     [INPUT]                                                                                                                         │
│         Name systemd                                                                                                                │
│         Tag dataplane.systemd.*                                                                                                     │
│         Systemd_Filter _SYSTEMD_UNIT=docker.service                                                                                 │
│         Systemd_Filter _SYSTEMD_UNIT=containerd.service                                                                             │
│         Systemd_Filter _SYSTEMD_UNIT=kubelet.service                                                                                │
│         DB /var/fluent-bit/state/systemd.db                                                                                         │
│         Path /var/log/journal                                                                                                       │
│         Read_From_Tail On                                                                                                           │
│                                                                                                                                     │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag dataplane.tail.*                                                                                                        │
│         Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*                                                         │
│         multiline.parser docker, cri                                                                                                │
│         DB /var/fluent-bit/state/flb_dataplane_tail.db                                                                              │
│         Mem_Buf_Limit 50MB                                                                                                          │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Rotate_Wait 30                                                                                                              │
│         storage.type filesystem                                                                                                     │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [FILTER]                                                                                                                        │
│         Name modify                                                                                                                 │
│         Match dataplane.systemd.*                                                                                                   │
│         Rename _HOSTNAME hostname                                       
         Rename _SYSTEMD_UNIT systemd_unit                                                                                           │
│         Rename MESSAGE message                                                                                                      │
│         Remove_regex ^((?!hostname|systemd_unit|message).)*$                                                                        │
│                                                                                                                                     │
│     [FILTER]                                                                                                                        │
│         Name aws                                                                                                                    │
│         Match dataplane.*                                                                                                           │
│         imds_version v2                                                                                                             │
│                                                                                                                                     │
│     [OUTPUT]                                                                                                                        │
│         Name cloudwatch_logs                                                                                                        │
│         Match dataplane.*                                                                                                           │
│         region us-east-1                                                                                                            │
│         log_group_name /aws/containerinsights/xxx-xx-xx/dataplane                                                       │
│         log_stream_prefix ${HOSTNAME}-                                                                                              │
│         auto_create_group true                                                                                                      │
│         extra_user_agent container-insights                                                                                         │
│   fluent-bit.conf: |                                                                                                                │
│     [SERVICE]                                                                                                                       │
│       Flush 5                                                                                                                       │
│       Grace 30                                                                                                                      │
│       Log_Level info                                                                                                                │
│       Daemon off                                                                                                                    │
│       Parsers_File parsers.conf                                                                                                     │
│       HTTP_Server On                                                                                                                │
│       HTTP_Listen 0.0.0.0                                                                                                           │
│       HTTP_Port 2020                                                                                                                │
│       storage.path /var/fluent-bit/state/flb-storage/                                                                               │
│       storage.sync normal                                                                                                           │
│       storage.checksum off                                                                                                          │
│       storage.backlog.mem_limit 5M                                                                                                  │
│                                                                                                                                     │
│     @INCLUDE application-log.conf                                                                                                   │
│     @INCLUDE dataplane-log.conf                                                                                                     │
│     @INCLUDE host-log.conf                                                                                                          │
│   host-log.conf: |                                                                                                                  │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag host.dmesg                                                                                                              │
│         Path /var/log/dmesg                                                                                                         │
│         Key message                                                                                                                 │
│         DB /var/fluent-bit/state/flb_dmesg.db                                                                                       │
│         Mem_Buf_Limit 5MB                                                                                                           │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Read_from_Head Off                                       
     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag host.messages                                                                                                           │
│         Path /var/log/messages                                                                                                      │
│         Parser syslog                                                                                                               │
│         DB /var/fluent-bit/state/flb_messages.db                                                                                    │
│         Mem_Buf_Limit 5MB                                                                                                           │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [INPUT]                                                                                                                         │
│         Name tail                                                                                                                   │
│         Tag host.secure                                                                                                             │
│         Path /var/log/secure                                                                                                        │
│         Parser syslog                                                                                                               │
│         DB /var/fluent-bit/state/flb_secure.db                                                                                      │
│         Mem_Buf_Limit 5MB                                                                                                           │
│         Skip_Long_Lines On                                                                                                          │
│         Refresh_Interval 10                                                                                                         │
│         Read_from_Head Off                                                                                                          │
│                                                                                                                                     │
│     [FILTER]                                                                                                                        │
│         Name aws                                                                                                                    │
│         Match host.*                                                                                                                │
│         imds_version v2                                                                                                             │
│                                                                                                                                     │
│     [OUTPUT]                                                                                                                        │
│         Name cloudwatch_logs                                                                                                        │
│         Match host.*                                                                                                                │
│         region us-east-1                                                                                                            │
│         log_group_name /aws/containerinsights/xxx-xxx/host                                                            │
│         log_stream_prefix ${HOSTNAME}.                                                                                              │
│         auto_create_group true                                                                                                      │
│         extra_user_agent container-insights                                                                                         │
│   parsers.conf: |                                                                                                                   │
│     [PARSER]                                                                                                                        │
│         Name syslog                                                                                                                 │
│         Format regex                                                                                                                │
│         Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(? │
│ <message>.*)$                                                                                                                       │
│         Time_Key time                                                                                                               │
│         Time_Format %b %d %H:%M:%S                                                                                                  │
│                                                                                                                                     │
│     [PARSER]                                                                                                                        │
│         Name container_firstline                                        
         Format regex                                                                                                                │
│         Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d │
│ {2}\.\w*).*(?=})                                                                                                                    │
│         Time_Key time                                                                                                               │
│         Time_Format %Y-%m-%dT%H:%M:%S.%LZ                                                                                           │
│                                                                                                                                     │
│     [PARSER]                                                                                                                        │
│         Name cwagent_firstline                                                                                                      │
│         Format regex                                                                                                                │
│         Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*? │
│ )".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})                                                                    │
│         Time_Key time                                                                                                               │
│         Time_Format %Y-%m-%dT%H:%M:%S.%LZ                                                                                           │
│ kind: ConfigMap                                                                                                                     │
│ metadata:                                                                                                                           │
│   annotations:                                                                                                                      │
│     meta.helm.sh/release-name: aws-for-fluent-bit                                                                                   │
│     meta.helm.sh/release-namespace: kube-system                                                                                     │
│   creationTimestamp: "2023-09-15T16:35:44Z"                                                                                         │
│   labels:                                                                                                                           │
│     app.kubernetes.io/instance: aws-for-fluent-bit                                                                                  │
│     app.kubernetes.io/managed-by: Helm                                                                                              │
│     app.kubernetes.io/name: aws-for-fluent-bit                                                                                      │
│     app.kubernetes.io/version: 2.31.11                                                                                              │
│     helm.sh/chart: aws-for-fluent-bit-0.1.28                                                                                        │
│   name: aws-for-fluent-bit                                                                                                          │
│   namespace: kube-system                                                                                                            │
│   resourceVersion: "11286"                                                                                                          │
│   uid: 5ba54edc-39ae-4d53-80d8-afc584a18939    

Unfortunately, it seems like this latest configuration is resulting in a huge CPU spike, pushing the cluster to its limits. The below is a screenshot showing a very large spike in CPU for the fluent bit pods. Wondering if you might be able to replicate this on your end?

 NAMESPACE↑         NAME                                                        PF READY RESTARTS STATUS    CPU MEM %CPU/R %CPU/L /R │
│ amazon-cloudwatch  aws-cloudwatch-metrics-j6slb                                ●  1/1          0 Running     7  20      3      3 10 │
│ amazon-cloudwatch  aws-cloudwatch-metrics-rr2w8                                ●  1/1          0 Running     5  23      2      2 11 │
│ default            nginx-deployment-cbdccf466-fnhqd                            ●  1/1          0 Running     0   1    n/a    n/a /a │
│ default            nginx-deployment-cbdccf466-s9dtq                            ●  1/1          0 Running     0   1    n/a    n/a /a │
│ kube-system        aws-for-fluent-bit-4rflv                                    ●  1/1          0 Running  1042  27   2084    n/a 55 │
│ kube-system        aws-for-fluent-bit-vl796                                    ●  1/1          0 Running   990  27   1980    n/a 55 │
│ kube-system        aws-node-fdjdh                                              ●  1/1          0 Running     4  37     16    n/a /a │
│ kube-system        aws-node-lp5sb                                              ●  1/1          0 Running     3  37     12    n/a /a │
│ kube-system        cluster-autoscaler-aws-cluster-autoscaler-5ccd8ccddd-q24j8  ●  1/1          0 Running     2  30      1      1  5 │
│ kube-system        coredns-79df7fff65-79c6n                                    ●  1/1          0 Running     1  13      1    n/a 18 │
│ kube-system        coredns-79df7fff65-g9r6t                                    ●  1/1          0 Running     2  13      2    n/a 18 │
│ kube-system        ebs-csi-controller-5cfc7dd9c8-kntfg                         ●  6/6          0 Running     2  47      3    n/a 19 │
│ kube-system        ebs-csi-controller-5cfc7dd9c8-phv5q                         ●  6/6          0 Running     3  54      5    n/a 22 │
│ kube-system        ebs-csi-node-ld4bm                                          ●  3/3          0 Running     1  20      3    n/a 16 │
│ kube-system        ebs-csi-node-vpk24                                          ●  3/3          0 Running     1  20      3    n/a 16 │
│ kube-system        kube-proxy-k5npb                                            ●  1/1          0 Running     1  11      1    n/a /a │
│ kube-system        kube-proxy-pwgmg                                            ●  1/1          0 Running     1  11      1    n/a /a │
│ kube-system        metrics-server-675ff9f75d-nlpcg                             ●  1/1          0 Running     4  17    n/a    n/a /a
rodrigobersa commented 1 year ago

Hi @aospinaLW!

I didn't get one of the behaviors. If you do a fresh install, it should create all the log groups, but if you're changing an existing one, and replacing the configMap for the one provided by the module, you may need to recycle your aws-for-fluent-bit- PODs, to use the new configuration.

The log flooding, I could see in earlier scenarios when the aws-for-fluent-bit configMap, was not aligned to my cluster configuration.

Give me sometime to investigate this last one, and let me know if the log group creation is not really working on a fresh install, it should.

alexo1088 commented 1 year ago

Thanks for the quick reply @rodrigobersa !

Here's my process:

Since clusters existing, i'm removing the install by commenting out fluent-bit and then uncommenting it to do a fresh install with the enable_containerinsights = true flag. Here's the full config

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.8.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
    set = [{
      name  = "cloudWatchLogs.autoCreateGroup"
      value = true
    }]
      enable_containerinsights = true
      chart_version     = var.fluentbit_chart_version       ### Visit https://artifacthub.io/packages/helm/aws/aws-for-fluent-bit to see latest chart version available when deploying
  }
}

If I deploy this, the log groups are NOT created. The above deployment does not result in the CPU spike though.

Once I kill the pods and have them recreated, the log groups are immediately created, but then I start seeing the errors I mentioned above and the CPU spikes to the levels shown.

rodrigobersa commented 1 year ago

Hi @alexo1088 and @aospinaLW!

Can you share more details of your environment? I couldn't reproduce the Log Groups not being created. As you can see below, I deployed a new EKS Cluster with this example, and it created the Log Groups around 10 seconds after the aws-for-fluent-bit- PODs were up. No restarts.

# kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-for-fluent-bit   
NAME                       READY   STATUS    RESTARTS   AGE
aws-for-fluent-bit-56zs7   1/1     Running   0          5m15s
aws-for-fluent-bit-5m7wb   1/1     Running   0          5m18s
aws-for-fluent-bit-9hzck   1/1     Running   0          3m39s
aws-for-fluent-bit-cmbxm   1/1     Running   0          5m18s
aws-for-fluent-bit-jpmtt   1/1     Running   0          3m42s
aws-for-fluent-bit-l75lh   1/1     Running   0          3m19s
# kubectl -n kube-system describe pod -l app.kubernetes.io/name=aws-for-fluent-bit | grep 'Start Time'
Start Time:       Fri, 15 Sep 2023 15:59:46 -0400
Start Time:       Fri, 15 Sep 2023 15:59:43 -0400
Start Time:       Fri, 15 Sep 2023 16:01:22 -0400
Start Time:       Fri, 15 Sep 2023 15:59:43 -0400
Start Time:       Fri, 15 Sep 2023 16:01:19 -0400
Start Time:       Fri, 15 Sep 2023 16:01:42 -0400
aws logs describe-log-groups --query 'logGroups[].[logGroupName,creationTime]' 
[
    [
        "/aws/containerinsights/complete/application",
        1694807994414
    ],
    [
        "/aws/containerinsights/complete/dataplane",
        1694807994416
    ],
    [
        "/aws/containerinsights/complete/host",
        1694807994416
    ],
    [
        "/aws/eks/complete/cluster",
        1694807368371
    ]
]
# date +"%c" -d @1694807994
Fri 15 Sep 2023 07:59:54 PM UTC

If we consider that I'm in EST, -0400 from the UTC time it will be 03:59:54 PM.


I also couldn't reproduce the flooding scenario. I left PODs running and simulated some load, Memory consumption didn't exceed 20% of the limits, and CPU kept around 1200-1500m. As you can see below, no customization to the aws-for-fluent-bit- PODs was done.

I could see some kubelet upstream connection error that I'm investigating, but nothing to be concerned about.

# kubectl -n kube-system top pod -l app.kubernetes.io/name=aws-for-fluent-bit
NAME                       CPU(cores)   MEMORY(bytes)   
aws-for-fluent-bit-56zs7   1267m        29Mi            
aws-for-fluent-bit-5m7wb   1254m        29Mi            
aws-for-fluent-bit-9hzck   961m         27Mi            
aws-for-fluent-bit-cmbxm   1270m        28Mi            
aws-for-fluent-bit-jpmtt   988m         27Mi            
aws-for-fluent-bit-l75lh   968m         27Mi 

# kubectl get ds -n kube-system aws-for-fluent-bit -o yaml | yq '.spec.template.spec.containers[].resources'
limits:
  memory: 250Mi
requests:
  cpu: 50m
  memory: 50Mi

Logs seem to be generated as expected as well.

# aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/application --query 'logStreams[].logStreamName' | head -n10
[
    "aws-for-fluent-bit-56zs7-application.var.log.containers.aws-for-fluent-bit-56zs7_kube-system_aws-for-fluent-bit-81eb3c6215a7c144fa5feb359cf26252979b0682d43b82f2f893ad718891f36b.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.aws-guardduty-agent-dc5z8_amazon-guardduty_aws-guardduty-agent-a8dcd25a72ab0700176ebfd1dd15e3c3b74d5fed716c517133f6075b8da30a10.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.coredns-7f8587b949-xhwr2_kube-system_coredns-d54d649c4a6cbe7dae1124261e4db1dfd1d84e546389c2c480b8ed767782e201.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-attacher-b079f4121abf981da1a9704cdf6f5e100b10676e0cec985e11a1bcee6da8e0ca.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-provisioner-22a618a4f52b059d07dc56b478346556a4f064508602a6ad25dc006c77f0b374.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-resizer-7b8bb7f351ab256adf1347aba89b976caaecadc753137ef3acc7736c019bec3f.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_csi-snapshotter-b7db45dee9e23e1363b96438a243aa40f046ba3f476640cbb5b7e13bce7ae66b.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_ebs-plugin-9167e1768b9a14daa3b5a80e275230c1687b9d430c079c5e7fdf17fbad9b4a5b.log",
    "aws-for-fluent-bit-56zs7-application.var.log.containers.ebs-csi-controller-755bb8bf7d-h8wtk_kube-system_liveness-probe-110b3f7842736d9eb7dedc5a8a42651719b08d2135bb04ec3872e0a0316fdef5.log",

#  aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/host --query 'logStreams[].logStreamName' | head -n10
[
    "aws-for-fluent-bit-56zs7.host.messages",
    "aws-for-fluent-bit-5m7wb.host.messages",
    "aws-for-fluent-bit-9hzck.host.messages",
    "aws-for-fluent-bit-cmbxm.host.messages",
    "aws-for-fluent-bit-jpmtt.host.messages",
    "aws-for-fluent-bit-l75lh.host.messages"
]

# aws logs describe-log-streams --log-group-name /aws/containerinsights/complete/dataplane --query 'logStreams[].logStreamName' | head -n10
[
    "aws-for-fluent-bit-56zs7-dataplane.systemd.containerd.service",
    "aws-for-fluent-bit-56zs7-dataplane.systemd.kubelet.service",
    "aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-eks-nodeagent-b42903c593896412fedf67272c4b7e29bde11cec24804169cd0c0d363c7087de.log",
    "aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-node-bfa0dea8c5d2503245ac4cb5a8bfd3ab88d579e768270aed07f7b78d7aea2c16.log",
    "aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.aws-node-5mtfw_kube-system_aws-vpc-cni-init-5a7503b8bbb2610e8ddbe27a30d19dab0ac685fbd18b262b40c5b71a350e9828.log",
    "aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.kube-proxy-d795v_kube-system_kube-proxy-1bfefe55272af690b7b511b74081eb540e8bd48d08caeba7e0640e64ccb07d9d.log",
    "aws-for-fluent-bit-56zs7-dataplane.tail.var.log.containers.kube-proxy-gfwkn_kube-system_kube-proxy-5098463fbb0f192a598d20a582330d1311f333b69c6d606aff08e59c0c67dde9.log",
    "aws-for-fluent-bit-5m7wb-dataplane.systemd.containerd.service",
    "aws-for-fluent-bit-5m7wb-dataplane.systemd.kubelet.service",
alexo1088 commented 1 year ago

@rodrigobersa

Hmm, inteesting. Thanks for taking the time to provide detailed testing information here.

I decided to recreate a completely fresh cluster from scratch, using the same example I provided above. This time, the log groups did get created with no manual intervention, but unfortunately, the CPU spike was immediately evident within the cluster as well:

 NAMESPACE↑         NAME                                                        PF READY RESTARTS STATUS    CPU MEM %CPU/R %CPU/L /R │
│ amazon-cloudwatch  aws-cloudwatch-metrics-nkw2b                                ●  1/1          0 Running     9  23      4      4 11 │
│ amazon-cloudwatch  aws-cloudwatch-metrics-rmp4m                                ●  1/1          0 Running     6  20      3      3 10 │
│ kube-system        aws-for-fluent-bit-glgmt                                    ●  1/1          0 Running   843  27   1686    n/a 55 │
│ kube-system        aws-for-fluent-bit-q4xtp                                    ●  1/1          0 Running  1016  27   2032    n/a 55 │
│ kube-system        aws-node-6zsq7                                              ●  1/1          0 Running     5  36     20    n/a /a │
│ kube-system        aws-node-55v99                                              ●  1/1          0 Running     4  37     16    n/a /a
╰─$ kubectl -n kube-system describe pod -l app.kubernetes.io/name=aws-for-fluent-bit | grep 'Start Time'
Start Time:   Mon, 18 Sep 2023 10:28:16 -0400
Start Time:   Mon, 18 Sep 2023 10:28:16 -0400

My cluster is pretty standard, since this is completely fresh, It doesn't have any workloads running on it other than core cluster components and additional addons. I am running fluent-bit in conjunction with container insights, metrics server, and cluster auto-scaler, all installed via same module. Here's the complete configuration:

provider "aws" {
  region = var.region
  assume_role {
    role_arn = "arn:aws:iam::xxxxx:role/terraform-execute"
  }
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_name
}

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
  token                  = data.aws_eks_cluster_auth.cluster.token
}

provider "helm" {
  kubernetes {
    host                   = module.eks.cluster_endpoint
    cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
    token                  = data.aws_eks_cluster_auth.cluster.token
  }
}

data "aws_caller_identity" "current" {}

module "eks" {
  source = "terraform-aws-modules/eks/aws"
  version = "~> 19.12"

  cluster_addons = {
    coredns = {
      resolve_conflicts_on_create = "OVERWRITE"
      addon_version               = var.coredns_version
    }
    kube-proxy = {
      resolve_conflicts_on_create = "OVERWRITE"
      addon_version               = var.kube_proxy_version
    }
    vpc-cni = {
      resolve_conflicts_on_create = "OVERWRITE"
      before_compute              = true
      addon_version               = var.vpc_cni_version
    }
    aws-ebs-csi-driver  = {
      resolve_conflicts_on_create = "OVERWRITE"
      addon_version               = var.aws_ebs_csi_driver
    }
  }
  vpc_id                = local.vpc.vpc_id
  subnet_ids            = local.private_subnets
  kms_key_owners        = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/xxxx/xxxx/xxxxx"]

  cluster_version       = var.eks_cluster_version
  cluster_name          = var.eks_cluster_name

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false

  cluster_enabled_log_types       = ["audit"]
  manage_aws_auth_configmap       = true

  # EKS MANAGED NODE GROUPS
  eks_managed_node_groups = {
    eks_mng_lin = {
     name                           = var.eks_nodegroup_name_lin_mng
     min_size                       = var.min_ng_nodes_lin_mng
     max_size                       = var.max_ng_nodes_lin_mng
     desired_size                   = var.desire_ng_nodes_lin_mng
     iam_role_additional_policies   = {
       CloudWatchAgentServerPolicy  = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
       AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
       AmazonEBSCSIDriverPolicy     = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
     }
     instance_types                = [var.instance_type_lin_mng]
    }
  }
  aws_auth_roles= [
    {
      rolearn  = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/xxxx"
      username = "xxxxx"
      groups   = ["system:masters"]
    },
  ]
}
module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.8.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn
  enable_aws_for_fluentbit      = true
    aws_for_fluentbit_cw_log_group = {
      create          = true
      use_name_prefix = true # Set this to true to enable name prefix
      name_prefix     = "eks-cluster-logs-"
      retention       = 7
   }
    aws_for_fluentbit = {
    set = [{
      name  = "cloudWatchLogs.autoCreateGroup"
      value = true
    }]
      enable_containerinsights = true
      chart_version     = var.fluentbit_chart_version       ### Visit https://artifacthub.io/packages/helm/aws/aws-for-fluent-bit to see latest chart version available when deploying
  }
  enable_aws_cloudwatch_metrics = true
    aws_cloudwatch_metrics   = {
      version    = var.cloudwatch_chart_version       ### Visit https://artifacthub.io/packages/helm/aws/aws-cloudwatch-metrics to see latest chart version available when deploying
  }
  enable_metrics_server         = true
    metrics_server   = {
      version    = var.metrics_server_chart_version   ### Visit https://artifacthub.io/packages/helm/metrics-server/metrics-server to see latest chart version available when deploying
  }
  enable_cluster_autoscaler     = true
    cluster_autoscaler   = {
      version    = var.autoscaler_chart_version       ### Visit https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler to see latest chart version available when deploying
  }
  depends_on = [
    module.eks
  ]
}

The instance type that i'm using is a t3.medium and i'm running cluster version 1.27. I don't think it's an instance type issue since it seems to clearly indicate that only the fluent-bit pods are running with a huge spike. Any other ideas you might have seeing the above config?

alexo1088 commented 1 year ago

Any chance you tested this with a private cluster? I should also mention that my cluster is private

EDIT: I've now tried this deployment with both a public and private cluster, and both versions have resulted in the same flooding of the same error:

kubelet upstream connection error

This error is not intermittent - it's flooding the pod logs with entries every second. Occasionally, I also see this error:

[tls] error: unexpected EOF

I also changed my instance type from t3.medium to m6i.large with no change.

Can you share what version of fluent-bit you're running? I'm running:

app.kubernetes.io/version: 2.31.11

rodrigobersa commented 1 year ago

Hi @alexo1088 and @aospinaLW!

Thanks for sharing all this info! Did you ever setup Container Insights with success in any of your environments?

Asking, because I found the reason for the kubelet upstream connection error. The FluentBit monitoring for kubelet doesn't work by default, there are some few steps that need to be setup in order to work correctly with ContainerInsights, like RBAC and change the DaemonSet configuration to hostNetwork: true. Here's the documentation.

Can you try to remove kubelet configuration from you aws-for-fluentbit configMap and validate if that works for you? If so, I'll port the change to the repo.

Remove this block:

        Use_Kubelet On
        Kubelet_Port 10250
alexo1088 commented 1 year ago

Hey @rodrigobersa

Looks like you were right. As soon as I removed those two lines from the CM, and recreated the pods, the CPU spike was gone. Just to confirm, this shouldn't impact our ability to collect kubelet logs from the nodes, right? As long as we're able to collect those, then I think we're good to go! Thank you so much for all of your help on this!

rodrigobersa commented 1 year ago

Hi @alexo1088!

Thanks for getting back!

Not really sure, by removing these, FluentBit will not access Kubelet directly and get all the logs through the Cluster API. In the docs I mentioned above, they suggest enabling this feature just for large clusters, and since the requirements affect the cluster RBAC and the resources created by FluentBit Helm chart, I don't see right now a way to add those to the repository patterns.

What I would suggest you is to check the logs that were generated and see if the specific info that you need is there. Otherwise, you can try to add the resources mentioned in the docs to the Terraform code, so you'll be able to enable the Kubelet monitoring feature. I'll add some information related to that in the docs.