estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
633 stars 55 forks source link

Constrain connector storage space #952

Open mdibaiee opened 1 year ago

mdibaiee commented 1 year ago

Currently our pods don't have a limit on the ephemeral-storage provided to them, which means they have access to all the storage of the node (~100G).

On the other hand, our connector containers do not have disk size limits, so in effect they can fill up the node disk if they want to, in turn exhausting our node's disk space and cause an eviction of all pods on that node.

Unfortunately, limiting disk size of a docker container is not as trivial, at least not with our current storage driver and filesystem (overlay2 with extfs, see here).

From Resource Management for Pods and Containers:

Caution:

If the kubelet is not measuring local ephemeral storage, then a Pod that exceeds its local storage limit will not be evicted for breaching local storage resource limits.

However, if the filesystem space for writeable container layers, node-level logs, or emptyDir volumes falls low, the node taints itself as short on local storage and this taint triggers eviction for any Pods that don't specifically tolerate the taint.

See the supported configurations for ephemeral local storage.

My understanding of this is that, given that at the moment the connectors directly write to the node's volume, the node may fall low on storage and taint itself and evict all the pods (unless they tolerate the taint, which in our case they don't). What I meant by "kill a node" is to kill all the pods on the node which essentially means a re-scheduling of pods of that node is necessary.

mdibaiee commented 1 year ago

See https://github.com/estuary/ops/pull/285