CaaSP 4.x settings for a better tuned CAP

gaktive commented 3 years ago

We need to ensure CaaSP 4.x has these settings in place for a more stable CAP:

The setting that affects nested containers in /etc/crio/crio.conf:

Maximum number of processes allowed in a container. pids_limit = 1024

After raising that number, on a running system, you will see containers with more than 1024, which is expected with CAP. For running CATs on a single diego-cell, pids_limit=1024 is not enough. Due to the nature of nested containers, this might also be true for production systems running Diego. At least this should be documented for "Running CAP on CaaSP".

Additional limits need to be set for crio in "/usr/lib/systemd/system/crio.service" and raised those, too: LimitNOFILE LimitNPROC

btat commented 3 years ago

Hi @svollath , for pids_limit in etc/crio/crio.conf.d/00-default.conf which value should the docs use as a guideline? Should it be -1 (from your Confluence notes) or 32768 (from your more recent response in Bugzilla 1179109)? Thanks

Martin-Weiss commented 3 years ago

Not sure if we should modify /crio/crio.conf.d/00-default.conf in CaaSP 4.5 -> I believe we have to adjust ./addons/cri/conf.d/99-custom.conf for skuba based deployments which I believe ends up as /etc/crio/crio.conf.d/99-custom.conf..

For CaaSP 4.2 we might have to adjust crio.conf directly.

Adjusting /usr/lib/systemd/system/crio.service is also not the best way to go - AFAIK custom adjustments have to go to a systemd overlay file in /etc/systemd....

svollath commented 3 years ago

In the first place, current conclusion is to just document adjustments related to pids_limit only.

For CaaSP-4.2.X:

modify /etc/crio/crio.conf directly on each node, like with # sed -i -e 's|pids_limit = 1024|pids_limit = 32768|g' /etc/crio/crio.conf
we need a boot persistent way to set /sys/fs/cgroup/pids/kubepods/pids.max, like with # sudo bash -c \"echo '32768' > /sys/fs/cgroup/pids/kubepods/pids.max\" ... to be answered by the CaaSP team

For CaaSP-4.5.X:

like Martin said, we should introduce a custom config file within /etc/crio/crio.conf.d/, containing pids_limit = 32768
/sys/fs/cgroup/pids/kubepods/pids.max is at 32768 already in 4.5.X

For all of these settings, we need to be sure, that changes are enabled (e.g. restart crio), boot and update persistent (e.g. aren't lost, or overwritten by skuba). => to be answered by the CaaSP team.

In addition to that, some systemd tunings made CaaSP (CAP/CATs) more performant and stable, while those are experimental, still. The common tunings for 4.2.X and 4.5.X would be:

"sudo bash -c \"echo 'DefaultTimeoutStartSec=300s' >> /etc/systemd/system.conf\""
"sudo bash -c \"echo 'DefaultTimeoutStopSec=5s' >> /etc/systemd/system.conf\""
"sudo bash -c \"echo 'DefaultStartLimitIntervalSec=1s' >> /etc/systemd/system.conf\""
"sudo bash -c \"echo 'DefaultTasksMax=infinity' >> /etc/systemd/system.conf\""
"sudo bash -c \"echo 'DefaultTasksAccounting=no' >> /etc/systemd/system.conf\""

I woud not suggest to document those for now.

saschagrunert commented 3 years ago

Hey, raising the pids_limit just a bit seems fine, if you can guarantee that the machine does not run out of available process IDs. Generally I just can recommend to keep it as low as possible. We decrease security with raising the PIDs limit, because fork bombs could attack a worker nodes. This applies to the Kubernetes pod and node PIDs limit as well as the CRI-O configuration.

We already increased the default pids_limit multiple times in CRI-O, whereas having multiple thousand PIDs inside of a container seems to be an application configuration or delivery issue. Is it possible to configure the amount of created processes inside of the container so that the workload can scale the Kubernetes’ way?

/cc @rhafer

svollath commented 3 years ago

When I set pids_limit to "max", running CATs, several containers peak at more than 2500 pids - finally it is related to the design of our diego-cell pod, running containers in containers. The diego-cell is expected to start containers, currently limited by available memory and disk space, only (more or less). If it's about calculating "needed" pids, this will depend on the possible number of apps, the code of the apps itself, also it depends on the fixed amount of pids (real hard limit) vs. the fixed sum of pids in usage currently (limited throughput/timeouts) - maybe it's too hard to calculate all of that for a diego-cell.

saschagrunert commented 3 years ago

I see. I'm thinking if we can set the PIDs limit only for that specific pod, but this seems not possible with the current set of features. Maybe raising the limit to 3072 would be enough?

satadruroy commented 3 years ago

Thanks for your inputs @saschagrunert - are you saying it's not currently possible to limit the pids for a specific pod? The customer doesn't want to increase this across the board (seems rightly so) and If we cannot do it at the pod level would it be advisable for them to run these pods on dedicated nodes with higher pid limits?

Finally, in terms of how we increase this limit in a boot-persistent way, would the kubelet pid limit setting override the cri-o pid limit settings? I was looking at https://github.com/cri-o/cri-o/issues/1921 and there was a suggestion at the end to configure this at the kubelet level for the node.

saschagrunert commented 3 years ago

Thanks for your inputs @saschagrunert - are you saying it's not currently possible to limit the pids for a specific pod? The customer doesn't want to increase this across the board (seems rightly so) and If we cannot do it at the pod level would it be advisable for them to run these pods on dedicated nodes with higher pid limits?

Yes, there is right now no support to assign a PID limit to a specific workload, only to all of them.

Finally, in terms of how we increase this limit in a boot-persistent way, would the kubelet pid limit setting override the cri-o pid limit settings? I was looking at cri-o/cri-o#1921 and there was a suggestion at the end to configure this at the kubelet level for the node.

Technically the PID limit in CRI-O gets passed down to the OCI runtime (runc), which assigns the limit via the TasksMax systemd option, which maps to the pids.max cgroup attribute. The kubelet enforces the PID limit directly via the PID cgroup, it does not pass it to the container runtime in any way. I expect that both settings overwrite each other. Therefore I recommend to keep them in sync.

Martin-Weiss commented 3 years ago

Just for a general understanding - how many PIDs can be used on SLES in general? As we support 110 PODs per worker - 32k PIDs would allow running 3.604.480 PIDs within containers plus the PIDs on the OS.. Would that be a problem?

saschagrunert commented 3 years ago

Just for a general understanding - how many PIDs can be used on SLES in general?

Most Linux systems have the maximum PID set to 32768

As we support 110 PODs per worker - 32k PIDs would allow running 3.604.480 PIDs within containers plus the PIDs on the OS.. Would that be a problem?

We have to take other processes into account which are required to run a Kubernetes node plus leaving some room for possible other applications user might run.

rhafer commented 3 years ago

Just for a general understanding - how many PIDs can be used on SLES in general?

Most Linux systems have the maximum PID set to 32768

It's possible to increase that limit via the kernel.pid_max sysctl knob. Though I am not sure on what the maximum value we support on SLES is.

gaktive commented 3 years ago

At this time, would it be safe to say that setting the pids_limit and cgroups to 3072 is something CAP customers can be advised on? There would be caveats indicating that based on the number of apps deployed, that number can go higher but setting it to the maximum would lead to a potential vulnerability.

Otherwise, is there a way to set this on CaaSP so that it's persistent upon node or other restart?

mjura commented 3 years ago

Regarding CaaSP v4.2 crio configuration level, we will prepare fix for crio.conf which will include higher pids_limit.

@gaktive In my opinion comment https://github.com/SUSE/doc-cap/issues/1031#issuecomment-734385508 explain this change for CaaSP quite good. We will prepare also CaaSP update which will persist this change.

gaktive commented 3 years ago

Thanks @mjura, especially for the feature request for the persistent change.

@btat we should have enough to write up something here in the docs. Let me know if you need help with wording.

SUSE / doc-cap

CaaSP 4.x settings for a better tuned CAP #1031