[Bug]: Non `node_local` ephemeral pod storage does not get accounted for correctly

jlevydev commented 3 months ago

Prior Search

[X] I have already searched this project's issues to determine if a bug report has already been made.

What happened?

I was attempting to deploy a workload for local dev that builds using a /tmp directory, however the pods kept crashing or finding their way into Unknown state. My tmp_directories were all using the default of node_local false, however on pod start up I saw the below:

Looking at the kube_pod module it does not add the non-node_local storage to its ephemeral-storage limit on the container spec, and it seems like when you use past that limit, even without a mounted volume, the container crashes

Steps to Reproduce

Create a workload with a non-node_local tmp_directory with capacity greater than that of node_local tmp_directories + 100Mi (the default overhead over the node_local storage
Use the entirety of the non-node_local tmp_directory
Observe your container bomb

Relevant log output

No response

fullykubed commented 3 months ago

Can you provide an example pod manifest where this problem occurs?

fullykubed commented 3 months ago

@jlevydev I believe we need to look into the particulars of your setup as I am unable to reproduce.

I have created an instance of kube_deployment with the following temporary directory:

  tmp_directories = {
    "tmp" = {
      mount_path = "/tmp"
      node_local = false
      size_mb = 1024
    }
  }

The resulting pod has the following resource config:

    resources:
      limits:
        ephemeral-storage: 100Mi
        memory: "136314880"
      requests:
        cpu: 10m
        ephemeral-storage: 100Mi
        memory: "104857600"

I am then able to write a 1GB file to the /tmp directory (dd if=/dev/zero of=/tmp/test bs=1M count=1024) successfully and without disruption.

This aligns with the Kubernetes specification as described here which explicitly lists what is controlled by the ephemeral-storage limits / requests. Mounted PVCs (created when node_local = false) are not included in that limit; this makes sense b/c ephemeral-storage is meant to allocate resources provided directly by the node and PVCs are not node-scoped.

There are a couple possibilities for what you may be experiencing:

You have not set readonly to true and are actually writing files outside of your temporary directories. This creates a "writeable container layer" and the size of this layer is applied against the ephemeral-storage limit.
You may have a runaway process that is writing a huge amount of logs and this is overwhelming the node's abilities to process and rotate the log files. As on-disk log size is also counted against the ephemeral-storage limit, this could also result in the problem you are seeing.

jlevydev commented 3 months ago

Here's my kubernetes_deployment object:

module "worker_deployment" {
  source = "github.com/Panfactum/stack.git//packages/infrastructure/kube_deployment?ref=dfa23313b6748c957f95af6eb1a322e0fe170f12" # pf-update

  containers = [{
    name = local.worker_name
    image = var.image_repo
    version = var.image_version
    command = var.is_local ? ["gow", "-r=false", "run", "/src/."] : ["/main"]
    minimum_memory = var.worker_minimum_memory
    env = merge(local.shared_env, { MODE = "WORKER" })
    liveness_check_port = 3000
    liveness_check_type = "HTTP"
    liveness_check_route = "/health"
  }]

  name = local.worker_name
  namespace = module.namespace.namespace
  arm_nodes_enabled = true
  burstable_nodes_enabled = true
  host_anti_affinity_required = false
  ignore_replica_count = true // TODO
  panfactum_scheduler_enabled = true
  ports = {
    primary = {
      service_port = 3000
      pod_port     = 3000
    }
  }
  secrets = var.secrets
  spot_nodes_enabled = true
  tmp_directories = { 
    "/tmp" = {
      mount_path = "/tmp"
      size_mb = 2000
      node_local = true
    }, 
    "/.config" = {
      mount_path = "/.config"
      node_local = true
    } 
  }
  vpa_enabled = true

  depends_on = [module.redis]

  # pf-generate: pass_vars
  pf_stack_version = var.pf_stack_version
  pf_stack_commit  = var.pf_stack_commit
  environment      = var.environment
  region           = var.region
  pf_root_module   = var.pf_root_module
  is_local         = var.is_local
  extra_tags       = var.extra_tags
  # end-generate
}

I am using this package for file watching and rebuilding in my local-dev set up: https://github.com/mitranim/gow/tree/master

fullykubed commented 3 months ago

I am a little confused as in that example you have node_local set totrue? In the original post, you were referencing issues with temporary directories with node_local set to false (non-node_local).

jlevydev commented 3 months ago

Sorry, that's my current deployment, the one causing issues was the same except with node_local unset since it's optional in the object. I didn't want to risk the issues I was seeing in my local dev environment in production, so this is just the current configuration

fullykubed commented 3 months ago

Ok. Can you post an actual pod manifest for an example where you are experiencing the issue?

For example, in k9s, highlight the pod, press y, press c, then paste here.

fullykubed commented 2 months ago

@jlevydev I am going to close this issue as given our attempts to reproduce I do not think this is a bug in the stack.

Feel free to re-open / continue commenting here if you find any additional information that would indicate otherwise.

Panfactum / stack