Closed jlevydev closed 2 months ago
Can you provide an example pod manifest where this problem occurs?
@jlevydev I believe we need to look into the particulars of your setup as I am unable to reproduce.
I have created an instance of kube_deployment with the following temporary directory:
tmp_directories = {
"tmp" = {
mount_path = "/tmp"
node_local = false
size_mb = 1024
}
}
The resulting pod has the following resource config:
resources:
limits:
ephemeral-storage: 100Mi
memory: "136314880"
requests:
cpu: 10m
ephemeral-storage: 100Mi
memory: "104857600"
I am then able to write a 1GB file to the /tmp
directory (dd if=/dev/zero of=/tmp/test bs=1M count=1024
) successfully and without disruption.
This aligns with the Kubernetes specification as described here which explicitly lists what is controlled by the ephemeral-storage
limits / requests. Mounted PVCs (created when node_local = false
) are not included in that limit; this makes sense b/c ephemeral-storage
is meant to allocate resources provided directly by the node and PVCs are not node-scoped.
There are a couple possibilities for what you may be experiencing:
You have not set readonly
to true
and are actually writing files outside of your temporary directories. This creates a "writeable container layer" and the size of this layer is applied against the ephemeral-storage
limit.
You may have a runaway process that is writing a huge amount of logs and this is overwhelming the node's abilities to process and rotate the log files. As on-disk log size is also counted against the ephemeral-storage
limit, this could also result in the problem you are seeing.
Here's my kubernetes_deployment
object:
module "worker_deployment" {
source = "github.com/Panfactum/stack.git//packages/infrastructure/kube_deployment?ref=dfa23313b6748c957f95af6eb1a322e0fe170f12" # pf-update
containers = [{
name = local.worker_name
image = var.image_repo
version = var.image_version
command = var.is_local ? ["gow", "-r=false", "run", "/src/."] : ["/main"]
minimum_memory = var.worker_minimum_memory
env = merge(local.shared_env, { MODE = "WORKER" })
liveness_check_port = 3000
liveness_check_type = "HTTP"
liveness_check_route = "/health"
}]
name = local.worker_name
namespace = module.namespace.namespace
arm_nodes_enabled = true
burstable_nodes_enabled = true
host_anti_affinity_required = false
ignore_replica_count = true // TODO
panfactum_scheduler_enabled = true
ports = {
primary = {
service_port = 3000
pod_port = 3000
}
}
secrets = var.secrets
spot_nodes_enabled = true
tmp_directories = {
"/tmp" = {
mount_path = "/tmp"
size_mb = 2000
node_local = true
},
"/.config" = {
mount_path = "/.config"
node_local = true
}
}
vpa_enabled = true
depends_on = [module.redis]
# pf-generate: pass_vars
pf_stack_version = var.pf_stack_version
pf_stack_commit = var.pf_stack_commit
environment = var.environment
region = var.region
pf_root_module = var.pf_root_module
is_local = var.is_local
extra_tags = var.extra_tags
# end-generate
}
I am using this package for file watching and rebuilding in my local-dev set up: https://github.com/mitranim/gow/tree/master
I am a little confused as in that example you have node_local
set totrue
? In the original post, you were referencing issues with temporary directories with node_local
set to false
(non-node_local
).
Sorry, that's my current deployment, the one causing issues was the same except with node_local
unset since it's optional in the object. I didn't want to risk the issues I was seeing in my local dev environment in production, so this is just the current configuration
Ok. Can you post an actual pod manifest for an example where you are experiencing the issue?
For example, in k9s, highlight the pod, press y, press c, then paste here.
@jlevydev I am going to close this issue as given our attempts to reproduce I do not think this is a bug in the stack.
Feel free to re-open / continue commenting here if you find any additional information that would indicate otherwise.
Prior Search
What happened?
I was attempting to deploy a workload for local dev that builds using a
/tmp
directory, however the pods kept crashing or finding their way intoUnknown
state. Mytmp_directories
were all using the default ofnode_local
false, however on pod start up I saw the below:Looking at the
kube_pod
module it does not add the non-node_local
storage to itsephemeral-storage
limit on the container spec, and it seems like when you use past that limit, even without a mounted volume, the container crashesSteps to Reproduce
node_local
tmp_directory
with capacity greater than that ofnode_local
tmp_directories
+100Mi
(the default overhead over thenode_local
storagenode_local
tmp_directory
Relevant log output
No response