aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[EKS] [request]: InvalidDiskCapacity warning on EKS Fargate #1403

Open mickael-ange opened 3 years ago

mickael-ange commented 3 years ago

Community Note

Tell us about your request What do you want us to build?

Since https://github.com/aws/containers-roadmap/issues/625 has been shipped, we are starting to scheduling workloads on EKS Fargate. Every time a new Fargate node is created, the node reports the following warning:

invalid capacity 0 on image filesystem.

kubectl describe node fargate-ip-172-17-8-64.ap-northeast-1.compute.internal

Events:
  Type     Reason                   Age    From                                                             Message
  ----     ------                   ----   ----                                                             -------
  Normal   Starting                 9m25s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Starting kubelet.
  Warning  InvalidDiskCapacity      9m25s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  9m25s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Node fargate-ip-172-17-8-64.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    9m25s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Node fargate-ip-172-17-8-64.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     9m25s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Node fargate-ip-172-17-8-64.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  9m24s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Updated Node Allocatable limit across pods
  Normal   NodeReady                9m15s  kubelet, fargate-ip-172-17-8-64.ap-northeast-1.compute.internal  Node fargate-ip-172-17-8-64.ap-northeast-1.compute.internal status is now: NodeReady

We use BotKube to monitor our EKS clusters. Warnings and errors are sent to our Slack channels. The above InvalidDiskCapacity is now "spamming" us for each scheduled pod on EKS Fargate.

I'm wondering if we are the only one affected by this issue or if this is a temporary issue on EKS Fargate scheduler and whether or not AWS is going to handle this warning in the near future?

Which service(s) is this request for? EKS Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

I'm trying to avoid false positive alarms on EKS cluster with workload scheduled on Fargate.

Are you currently working around this issue? How are you currently solving this problem?

I have implemented a custom BotKube filter to ignore invalid capacity 0 on image filesystem Node event on Fargate.

Here is the custom filter for those who want to have a look: botkube/pkg/filterengine/filters/custom_node_event_checker.go

// CustomNodeEventsChecker filter to send notifications on critical node events

package filters

import (
    "github.com/infracloudio/botkube/pkg/events"
    "github.com/infracloudio/botkube/pkg/filterengine"
    "github.com/infracloudio/botkube/pkg/log"
    "strings"
)

const (
    // InvalidDiskCapacity EventReason when Node has InvalidDiskCapacity
    InvalidDiskCapacity string = "InvalidDiskCapacity"
)

// CustomNodeEventsChecker checks job status and adds message in the events structure
type CustomNodeEventsChecker struct {
    Description string
}

// Register filter
func init() {
    filterengine.DefaultFilterEngine.Register(CustomNodeEventsChecker{
        Description: "Sends notifications on node level critical events.",
    })
}

// Run filers and modifies event struct
func (f CustomNodeEventsChecker) Run(object interface{}, event *events.Event) {

    // Run filter only on Node events
    if event.Kind != "Node" {
        return
    }

    log.Debugf("CustomNodeEventsChecker, object: %+v\n------------", object)
    log.Debugf("CustomNodeEventsChecker, event: %+v\n------------", event)

    // Update event details
    // Promote InfoEvent with critical reason as significant ErrorEvent
    switch event.Reason {
    case InvalidDiskCapacity:
        log.Debug("Node has InvalidDiskCapacity, ignoring it")
        if strings.Contains(event.Name, "fargate-ip-") {
            for _, m := range event.Messages {
                // As of 2021/06/17 skip warning events due to invalid capacity 0 on image filesystem during Fargate node creation
                // See https://github.com/aws/containers-roadmap/issues/1403
                if strings.Contains(m, "invalid capacity 0 on image filesystem") {
                    log.Debug("Skipping Node event with InvalidDiskCapacity for EKS Fargate")
                    event.Skip = true
                }
            }
        }
    default:
    }

    log.Debug("Node Critical Event filter successful!")
}

// Describe filter
func (f CustomNodeEventsChecker) Describe() string {
    return f.Description
}

Additional context Anything else we should know?

We don't have this issue with Self-Managed EKS Nodes nor AWS Managed EKS nodes.

Thanks in advance for your time.

Hunter-Thompson commented 3 years ago

We have the same issue. This issue has been spamming our BotKube channel since way before #625.

herod2k commented 3 years ago

Same here, same error message:

invalid capacity 0 on image filesystem

EKS on fargate too.

FireballDWF commented 1 year ago

I see the same Event Type warning when I run "kubectl describe node instance_name" where instance_name is the dns name of a EKS Local Clusters control-plane,master node.

Narsilion commented 1 year ago

Same warning, but I can't see any bad effects from it