cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Fluent Bit TLS Errors On Startup of New Node #220

Open ben851 opened 7 months ago

ben851 commented 7 months ago

Describe the bug

When a new node is spun up, for the first 1-2 seconds of fluent bit running, it spits out TLS errors:

[2023/11/09 00:48:34] [error] [tls] error: unexpected EOF
[2023/11/09 00:48:34] [error] [filter:kubernetes:notify-celery-k8s-filter] kubelet upstream connection error 

After ~1s the errors disappear as it appears that the node is fully initialized.

To Reproduce

Delete a spot node in staging, wait for karpenter to spin up a new one, view the fluent-bit logs

Expected behavior

Fluent-bit should wait until the node is fully initialized before sending logs

Impact

These errors create noise in our environments, and could potentially lead to log loss - although given that fluent bit is typically running before celery, this is less of a concern.

QA

Once #232 is complete and has been running for scale up/scale down issues

for pod in $PODS; do echo "$pod" kubectl logs "$pod" -n amazon-cloudwatch | grep error done



- [ ] Verify that there are no TLS EOF errors
ben851 commented 7 months ago

From slack: There are actually two different problems (yay). The original problem with the TLS errors. The good news here is that this is pretty much benign. It occurs when a new node is spun up, and only for 1 second. It seems to be a race condition where the pod is spinning up super quick before K8s has fully accepted the node (Or something like that). After that 1 second, all things are loading properly. I'm going to try introducing a very short delay in the fluentbit startup to confirm this

ben851 commented 7 months ago

Possibly related: https://github.com/fluent/fluent-bit/issues/3544

ben851 commented 7 months ago

Implemented a 2 second wait on fluent bit startup to ensure that the node is fully initialized. Seems to have corrected the issue in staging.

ben851 commented 7 months ago

PR to go to prod ready to merge. Will do so today.

ben851 commented 7 months ago

Deployed to production, will evaluate over the coming days.

ben851 commented 7 months ago

There was a config error in the daemonset for fluentbit in prod, so this part did not get deployed. New PR to fix created, aiming to release today.

ben851 commented 7 months ago

Config fixed, no TLS errors in prod since

sastels commented 7 months ago

still have the error, have to tweak. Might have it wait as we do for CWAgent.

sastels commented 7 months ago

Have a PR that skips TLS cert verification (this is fine). Will test...

ben851 commented 7 months ago

TLS cert verification skip didn't have the desired effect. Reverted and added 10s delay again to test.

jimleroyer commented 7 months ago

Ben to make verifications if latest changes worked.

ben851 commented 7 months ago

No errors in staging with node scale up/down. No errors in prod but there's no karpenter scaling.

Moving to QA. Will verify prod when #232 is in production

jimleroyer commented 7 months ago

@ben851 We'd need either @sastels or myself to make the QA check. Can you provide the QA steps please?

ben851 commented 7 months ago

QA Steps added

ben851 commented 7 months ago

232 has been merged... Ready for QA

sastels commented 7 months ago

Steve will QA

sastels commented 7 months ago

Steve might QA

sastels commented 6 months ago

DEFINITELY will QA today.

sastels commented 6 months ago

see for fluent-bit-6jwsb

[2023/12/07 18:25:43] [ info] [input:tail:tail.7] inotify_fs_add(): inode=3158017 watch_fd=1 name=/var/log/messages
[2023/12/07 18:25:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=50571462 watch_fd=9 name=/var/log/containers/celery-email-send-primary-74fdfcd47b-t86mn_notification-canada-ca_celery-email-send-primary-6ded4b645235a466889c82540718010d32d30f03e55acc56f4bb18527bcdf56a.log
[2023/12/07 18:25:44] [ info] [input:tail:tail.1] inotify_fs_add(): inode=109509709 watch_fd=8 name=/var/log/containers/documentation-5484458b47-549nx_notification-canada-ca_documentation-c6e4e3962850d21bf1c7695b90fc947c27486447440c05181b1e5ffdcb3afdab.log
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.kubelet.service in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal.host.messages in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/host
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.kubelet.service already exists
[2023/12/07 18:25:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Log Stream ip-10-0-0-236.ca-central-1.compute.internal.host.messages already exists
[2023/12/07 18:25:44] [error] [tls] error: unexpected EOF
[2023/12/07 18:25:44] [error] [aws_client] connection initialization error
time="2023-12-07T18:25:44Z" level=info msg="[cloudwatch 1] Log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application already exists\n"
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.containerd.service in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.systemd.containerd.service already exists
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Creating log stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.tail.var.log.containers.kube-proxy-vgdkz_kube-system_kube-proxy-4c95faae1c63070c4ed18a7b5eb1592fb777152a8d3a9e925c1a67ab2bd60e75.log in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/dataplane
[2023/12/07 18:25:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] Log Stream ip-10-0-0-236.ca-central-1.compute.internal-dataplane.tail.var.log.containers.kube-proxy-vgdkz_kube-system_kube-proxy-4c95faae1c63070c4ed18a7b5eb1592fb777152a8d3a9e925c1a67ab2bd60e75.log already exists
time="2023-12-07T18:25:46Z" level=info msg="[cloudwatch 1] Log stream fluent-bit-6jwsb_amazon-cloudwatch_fluent-bit-ebbe7e548be6fbbbc1a2c4d83ee3486f13fd40b05418886d1bd9e447615ff59d does not exist in log group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application"
time="2023-12-07T18:25:46Z" level=info msg="[cloudwatch 1] Created log stream fluent-bit-6jwsb_amazon-cloudwatch_fluent-bit-ebbe7e548be6fbbbc1a2c4d83ee3486f13fd40b05418886d1bd9e447615ff59d in group /aws/containerinsights/notification-canada-ca-production-eks-cluster/application"
ben851 commented 6 months ago

Ok after investigating this - I think it's benign and we just leave it as is. Reasons: 1 - This error is actually on an on-demand node, not spot - so it's not a question of autoscaling, and has likely existed for a while 2 - It doesn't prevent logs from going through. I just verified that there are logs from today.

sastels commented 6 months ago

sounds good to me!