-
### Description
**Observed Behavior**:
karpenter.cluster_state.synced metrics drops to 0 for ~30 minutes and no new nodes can come up, causing lots of pods pending. Worstly, we are undergoing sp…
-
@vitillo: Analysis jobs have different requirements in terms of hardware resources; some might benefit from more cores while others might benefit from more memory. Our users would like to select the i…
-
Inference system currently runs on AKS cluster with 3 Standard B4ms (4 vcpus, 16 GiB memory) VMs. Optimize the usage:
1. Adjust the pod resource requests if not used
2. Adjust the VM SKU to match …
micya updated
1 month ago
-
Currently, we start benchmark clusters with the `spot_with_fallback` policy. While this makes sense from a cost perspective, spot replacement will mess up the results. When running benchmarks, we shou…
-
GCP recently announced an evolution of preemptible VMs. They are now called Spot VMs.
More info here: https://cloud.google.com/compute/docs/instances/spot
-
We should support running spot instances on AWS.
Things to consider:
- should it be per-instance or global per-cluster?
- how do we get notified of an impending termination?
- how should we hand…
-
We should probably review our k8s setup with respect to spot instances, or instance types. Over the last few weeks, the cluster is very unstable, with machines starting/stopping as soon as i start cre…
-
Hello Vault Secrets Webhook Team,
I am currently using the Vault Secrets Webhook Helm chart version 1.19.0 for secret injection into pods. My setup, including the values.yaml, works well most of th…
-
I'd like to propose a feature for implementing fail-safe mechanisms and partial redundancy in FSDP2 (possibly not FSDP already, more like HSDP) to allow for more robust training on unreliable compute …
-
### Description
*Note that I am cross-posting this from https://github.com/aws/karpenter-provider-aws/issues/7254 as the more I look into the issue, the more it seems to be related to core Karpenter …