Closed artntek closed 1 month ago
Looks like K8s has an internal way of handling this, but might not be safe for what we need to change here. I can also change the setting on all the node VMs outside of K8s.
https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
When we were running LXD containers, all our hosts were configured with these settings: https://documentation.ubuntu.com/lxd/en/latest/reference/server_settings/
I'm going to change these settings all all the K8s nodes, selectively taken from the LXD docs:
Add /etc/security/limits.d/k8s-limits.conf
* soft nofile 1048576
* hard nofile 1048576
root soft nofile 1048576
root hard nofile 1048576
Add /etc/sysctl.d/20-k8s.conf
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
The docs recommend rebooting the VMs after making the changes.
Here are the current settings on k8s-dev-node-1
:
outin@k8s-dev-node-1:~$ sudo sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 1004077
outin@k8s-dev-node-1:~$ ulimit -n
1024
root@k8s-dev-node-1:~# ulimit -n
1024
I deployed the two config files to all the K8s VMs via Ansible: https://github.nceas.ucsb.edu/NCEAS/ansible-nceas/blob/master/k8s.yml
Looks like we might need to reboot all the nodes to apply the settings.
Rebooted k8s-dev-node-1
and verified the changes have been applied:
outin@k8s-dev-node-1:~$ sudo sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
outin@k8s-dev-node-1:~$ ulimit -n
1048576
root@k8s-dev-node-1:~# ulimit -n
1048576
thank you, @nickatnceas !
Hi @nickatnceas -- would you mind please repeating this process for the prod cluster, as soon as you have a spare moment? While trying to get the ADC instance working, I'm running into failed to create fsnotify watcher: too many open files
when trying to view logs.
Thanks!
I deployed the settings to all K8s nodes, but did not reboot everything to make the setting apply. I will attempt to reboot all the nodes later today.
outin@k8s-node-1:~$ uptime -s
2024-04-15 20:28:40
I finished rebooting all the K8s nodes.
On the dev cluster, I'm unable to see logs in times of heavy cluster usage (specifically 100 metadig workers and 20 indexer workers). The log is truncated with the messsage
failed to create fsnotify watcher: too many open files
See this discussion: https://serverfault.com/questions/1137211/failed-to-create-fsnotify-watcher-too-many-open-files
(I don't think we should max them out - see discussion - but increasing them "sensibly" would be good)