Open JacobAmar opened 2 months ago
Hi @JacobAmar, If Karpenter sees an insufficientCapacityError, it would immediately delete the nodeClaim and start spinning up a new node. It will also update the cache to mark this instance type as unavailable. Can you share logs from when this happened?
Hi @JacobAmar, If Karpenter sees an insufficientCapacityError, it would immediately delete the nodeClaim and start spinning up a new node. It will also update the cache to mark this instance type as unavailable. Can you share logs from when this happened?
Hi, it seems that we encountered an issue where AWS experienced problems launching c6 instances within us-east-1f (and sometimes us-east-1b). This caused Karpenter nodes to remain in an unready state for an extended period (15 minutes) before being terminated. While AWS is addressing the underlying infrastructure issue, it highlights some limitations in Karpenter's current node provisioning behavior.
Currently, Karpenter has a hardcoded 15-minute timeout before terminating a node that fails to launch, as seen in here
We believe this could be improved with two key enhancements:
Allow users to configure the timeout duration before Karpenter terminates a node that fails to launch. This provides greater control over node provisioning behavior and allows faster response to transient infrastructure issues.
Implement a mechanism for Karpenter to automatically try different instance types after a configurable number of unsuccessful launch attempts for a specific instance type. This would improve provisioning success rates when specific instance types are unavailable or experiencing problems.
Extend the fallback mechanism to include trying different Availability Zones after encountering launch failures within a single AZ for a given period. This would help mitigate issues localized to specific AZs.
These enhancements would make Karpenter more resilient to infrastructure disruptions and provide users with more granular control over node provisioning.
Karpenter would fallback to a different instance type if it fails to launch a nodeClaim just like you mentioned in the second enhancement suggestion. However, if any issue occurs during registration we don't expect this instance to be not available. While we can consider having configurable timeout for node termination, I was wondering if you could share your logs for us to see if Karpenter was able to launch the nodeClaims successfully.
Description
Observed Behavior:
We experienced prolonged (~15 minute) delays in node provisioning due to AWS experiencing internal issues launching c6 instances in us-east-1f (and intermittently us-east-1b). The problem was visible in the EC2 console and not related to capacity constraints.
Currently, Karpenter takes an extended time to react and recover from such failures.
Expected Behavior:
Reproduction Steps:
nodepool.yaml
Deploy.yaml
Explenation:
Versions:
0.37.0
kubectl version
):