DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Dev Cluster: "failed to create fsnotify watcher: too many open files" #46

Closed artntek closed 1 month ago

artntek commented 7 months ago

On the dev cluster, I'm unable to see logs in times of heavy cluster usage (specifically 100 metadig workers and 20 indexer workers). The log is truncated with the messsage failed to create fsnotify watcher: too many open files

brooke@magnum-pi.local:~ $ kubectl logs -f metacatknb-0
# [...some log lines...]
128.111.85.143 - - [16/Apr/2024:18:16:52 +0000] "GET /metacat/ HTTP/1.1" 200 82
128.111.85.143 - - [16/Apr/2024:18:16:52 +0000] "GET /metacat/admin HTTP/1.1" 200 2314
failed to create fsnotify watcher: too many open files%
brooke@magnum-pi.local:~ $ 

See this discussion: https://serverfault.com/questions/1137211/failed-to-create-fsnotify-watcher-too-many-open-files

(I don't think we should max them out - see discussion - but increasing them "sensibly" would be good)

nickatnceas commented 7 months ago

Looks like K8s has an internal way of handling this, but might not be safe for what we need to change here. I can also change the setting on all the node VMs outside of K8s.

https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/

nickatnceas commented 7 months ago

When we were running LXD containers, all our hosts were configured with these settings: https://documentation.ubuntu.com/lxd/en/latest/reference/server_settings/

nickatnceas commented 7 months ago

I'm going to change these settings all all the K8s nodes, selectively taken from the LXD docs:

Add /etc/security/limits.d/k8s-limits.conf

*               soft    nofile          1048576
*               hard    nofile          1048576
root            soft    nofile          1048576
root            hard    nofile          1048576

Add /etc/sysctl.d/20-k8s.conf

fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576

The docs recommend rebooting the VMs after making the changes.

nickatnceas commented 7 months ago

Here are the current settings on k8s-dev-node-1:

outin@k8s-dev-node-1:~$ sudo sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 1004077

outin@k8s-dev-node-1:~$ ulimit -n
1024

root@k8s-dev-node-1:~# ulimit -n
1024
nickatnceas commented 7 months ago

I deployed the two config files to all the K8s VMs via Ansible: https://github.nceas.ucsb.edu/NCEAS/ansible-nceas/blob/master/k8s.yml

Looks like we might need to reboot all the nodes to apply the settings.

nickatnceas commented 7 months ago

Rebooted k8s-dev-node-1 and verified the changes have been applied:

outin@k8s-dev-node-1:~$ sudo sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576

outin@k8s-dev-node-1:~$ ulimit -n
1048576

root@k8s-dev-node-1:~# ulimit -n
1048576
artntek commented 7 months ago

thank you, @nickatnceas !

artntek commented 1 month ago

Hi @nickatnceas -- would you mind please repeating this process for the prod cluster, as soon as you have a spare moment? While trying to get the ADC instance working, I'm running into failed to create fsnotify watcher: too many open files when trying to view logs.

Thanks!

nickatnceas commented 1 month ago

I deployed the settings to all K8s nodes, but did not reboot everything to make the setting apply. I will attempt to reboot all the nodes later today.

image
outin@k8s-node-1:~$ uptime -s
2024-04-15 20:28:40
nickatnceas commented 1 month ago

I finished rebooting all the K8s nodes.