Open ben851 opened 7 months ago
From slack: There are actually two different problems (yay). The original problem with the TLS errors. The good news here is that this is pretty much benign. It occurs when a new node is spun up, and only for 1 second. It seems to be a race condition where the pod is spinning up super quick before K8s has fully accepted the node (Or something like that). After that 1 second, all things are loading properly. I'm going to try introducing a very short delay in the fluentbit startup to confirm this
Possibly related: https://github.com/fluent/fluent-bit/issues/3544
Implemented a 2 second wait on fluent bit startup to ensure that the node is fully initialized. Seems to have corrected the issue in staging.
PR to go to prod ready to merge. Will do so today.
Deployed to production, will evaluate over the coming days.
There was a config error in the daemonset for fluentbit in prod, so this part did not get deployed. New PR to fix created, aiming to release today.
Config fixed, no TLS errors in prod since
still have the error, have to tweak. Might have it wait as we do for CWAgent.
Have a PR that skips TLS cert verification (this is fine). Will test...
TLS cert verification skip didn't have the desired effect. Reverted and added 10s delay again to test.
Ben to make verifications if latest changes worked.
No errors in staging with node scale up/down. No errors in prod but there's no karpenter scaling.
Moving to QA. Will verify prod when #232 is in production
@ben851 We'd need either @sastels or myself to make the QA check. Can you provide the QA steps please?
QA Steps added
Steve will QA
Steve might QA
DEFINITELY will QA today.
see for fluent-bit-6jwsb
[2023/12/07 18:25:43] [ info] [input:tail:tail.7] inotify_fs_add(): inode=3158017 watch_fd=1 name=/var/log/messages
[2023/12/07 18:25:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=50571462 watch_fd=9 name=/var/log/containers/celery-email-send-primary-74fdfcd47b-t86mn_notification-canada-ca_celery-email-send-primary-6ded4b645235a466889c82540718010d32d30f03e55acc56f4bb18527bcdf56a.log
[2023/12/07 18:25:44] [ info] [input:tail:tail.1] inotify_fs_add(): inode=109509709 watch_fd=8 name=/var/log/containers/documentation-5484458b47-549nx_notification-canada-ca_documentation-c6e4e3962850d21bf1c7695b90fc947c27486447440c05181b1e5ffdcb3afdab.log
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.kubelet.service in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal.host.messages in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/host
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.kubelet.service already exists
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Log Stream ip-10-0-0-236.ca-central-1.compute.internal.host.messages already exists
[2023/12/07 18:25:44] [error] [tls] error: unexpected EOF
[2023/12/07 18:25:44] [error] [aws_client] connection initialization error
time="2023-12-07T18:25:44Z" level=info msg="[cloudwatch 1] Log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application already exists\n"
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.containerd.service in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.containerd.service already exists
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.tail.var.log.containers.kube-proxy-vgdkz_kube-system_kube-proxy-4c95faae1c63070c4ed18a7b5eb1592fb777152a8d3a9e925c1a67ab2bd60e75.log in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.tail.var.log.containers.kube-proxy-vgdkz_kube-system_kube-proxy-4c95faae1c63070c4ed18a7b5eb1592fb777152a8d3a9e925c1a67ab2bd60e75.log already exists
time="2023-12-07T18:25:46Z" level=info msg="[cloudwatch 1] Log stream fluent-bit-6jwsb_amazon-cloudwatch_fluent-bit-ebbe7e548be6fbbbc1a2c4d83ee3486f13fd40b05418886d1bd9e447615ff59d does not exist in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application"
time="2023-12-07T18:25:46Z" level=info msg="[cloudwatch 1] Created log stream fluent-bit-6jwsb_amazon-cloudwatch_fluent-bit-ebbe7e548be6fbbbc1a2c4d83ee3486f13fd40b05418886d1bd9e447615ff59d in group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application"
Ok after investigating this - I think it's benign and we just leave it as is. Reasons: 1 - This error is actually on an on-demand node, not spot - so it's not a question of autoscaling, and has likely existed for a while 2 - It doesn't prevent logs from going through. I just verified that there are logs from today.
sounds good to me!
Describe the bug
When a new node is spun up, for the first 1-2 seconds of fluent bit running, it spits out TLS errors:
After ~1s the errors disappear as it appears that the node is fully initialized.
To Reproduce
Delete a spot node in staging, wait for karpenter to spin up a new one, view the fluent-bit logs
Expected behavior
Fluent-bit should wait until the node is fully initialized before sending logs
Impact
These errors create noise in our environments, and could potentially lead to log loss - although given that fluent bit is typically running before celery, this is less of a concern.
QA
Once #232 is complete and has been running for scale up/scale down issues
for pod in $PODS; do echo "$pod" kubectl logs "$pod" -n amazon-cloudwatch | grep error done