contiv / ansible

ansible scripts for contiv cluster
Other
14 stars 28 forks source link

Need to increase the number of max files that can be open. #166

Closed vvb closed 8 years ago

vvb commented 8 years ago

current limits are too low and don't even allow to successfully launch 2k containers before netplugin,volplugin run out of fds

vvb commented 8 years ago
[admin@ucsb-blade2 compose]$ sudo systemctl status etcd
� etcd.service - Etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2016-04-18 09:56:41 EDT; 24min ago
  Process: 688257 ExecStopPost=/usr/bin/etcd.sh post-stop (code=exited, status=0/SUCCESS)
  Process: 688238 ExecStop=/usr/bin/etcd.sh stop (code=exited, status=0/SUCCESS)
 Main PID: 135370 (code=exited, status=1/FAILURE)

Apr 18 09:56:13 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:13 etcdserver: 80% of the file descriptor limit is used [used = 825, limit = 1024]
Apr 18 09:56:18 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:18 etcdserver: 80% of the file descriptor limit is used [used = 883, limit = 1024]
Apr 18 09:56:23 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:23 etcdserver: 80% of the file descriptor limit is used [used = 928, limit = 1024]
Apr 18 09:56:28 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:28 etcdserver: 80% of the file descriptor limit is used [used = 982, limit = 1024]
Apr 18 09:56:33 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:33 etcdserver: cannot monitor file descriptor usage (open /proc/self/fd: too many open files)
Apr 18 09:56:40 ucsb-blade2 etcd.sh[135370]: 2016/04/18 09:56:40 etcdserver: failed to purge wal file open /var/lib/etcd/member/wal: too many open files
mapuri commented 8 years ago

@vvb

interesting issue, just curious how does container scale ties into FD limit in etcd? I hope we are not leaking FDs.

We can perhaps do something as suggested here to setup ulimit for the systemd units.

vvb commented 8 years ago

@mapuri I am trying to figure myself.. but there seems to be something for sure.. look at the results below.. 690752 is pid of etcd.. compose.sh launches 20 new containers. and the fd count for netplugin and etcd jump by about 40

[admin@ucsb-blade2 compose]$ sudo lsof -p `pidof netplugin` | wc -l
32
[admin@ucsb-blade2 compose]$ sudo lsof -p 690752 | wc -l
28
[admin@ucsb-blade2 compose]$ ./compose.sh 10
Creating and starting 1 ... done
Creating and starting 2 ... done
Creating and starting 3 ... done
Creating and starting 4 ... done
Creating and starting 5 ... done
Creating and starting 6 ... done
Creating and starting 7 ... done
Creating and starting 8 ... done
Creating and starting 9 ... done
Creating and starting 10 ... done
Creating and starting 1 ... done
Creating and starting 2 ... done
Creating and starting 3 ... done
Creating and starting 4 ... done
Creating and starting 5 ... done
Creating and starting 6 ... done
Creating and starting 7 ... done
Creating and starting 8 ... done
Creating and starting 9 ... done
Creating and starting 10 ... done
[admin@ucsb-blade2 compose]$ sudo lsof -p 690752 | wc -l
70
[admin@ucsb-blade2 compose]$ sudo lsof -p `pidof netplugin` | wc -l
74
[admin@ucsb-blade2 compose]$

all these are new ones

etcd    690752 root   19u     IPv6           16197181      0t0       TCP localhost:newoak->localhost:53102 (ESTABLISHED)
etcd    690752 root   20u     IPv6           16707539      0t0       TCP localhost:newoak->localhost:53065 (ESTABLISHED)
etcd    690752 root   21u     IPv6           16204524      0t0       TCP localhost:newoak->localhost:53068 (ESTABLISHED)
etcd    690752 root   22u     IPv6           16303736      0t0       TCP localhost:newoak->localhost:53108 (ESTABLISHED)
etcd    690752 root   23u     IPv6           16487921      0t0       TCP localhost:newoak->localhost:53091 (ESTABLISHED)
etcd    690752 root   24u     IPv6           16717349      0t0       TCP localhost:newoak->localhost:53095 (ESTABLISHED)
etcd    690752 root   25u     IPv6           16487934      0t0       TCP localhost:newoak->localhost:53096 (ESTABLISHED)
etcd    690752 root   26u     IPv6           16487936      0t0       TCP localhost:newoak->localhost:53097 (ESTABLISHED)
etcd    690752 root   27u     IPv6           16724060      0t0       TCP localhost:newoak->localhost:53118 (ESTABLISHED)
etcd    690752 root   28u     IPv6           16717353      0t0       TCP localhost:newoak->localhost:53100 (ESTABLISHED)
etcd    690752 root   29u     IPv6           16717355      0t0       TCP localhost:newoak->localhost:53101 (ESTABLISHED)
etcd    690752 root   30u     IPv6           16197183      0t0       TCP localhost:newoak->localhost:53103 (ESTABLISHED)
etcd    690752 root   31u     IPv6           16200260      0t0       TCP localhost:newoak->localhost:53111 (ESTABLISHED)
etcd    690752 root   32u     IPv6           16717359      0t0       TCP localhost:newoak->localhost:53106 (ESTABLISHED)
etcd    690752 root   33u     IPv6           16717361      0t0       TCP localhost:newoak->localhost:53107 (ESTABLISHED)
etcd    690752 root   34u     IPv6           16551575      0t0       TCP ucs-blade2.cisco.com:2379->ucs-blade2.cisco.com:33866 (ESTABLISHED)
etcd    690752 root   35u     IPv6           16681257      0t0       TCP localhost:newoak->localhost:53110 (ESTABLISHED)
etcd    690752 root   36u     IPv6           16200262      0t0       TCP localhost:newoak->localhost:53112 (ESTABLISHED)
etcd    690752 root   37u     IPv6           16204665      0t0       TCP localhost:newoak->localhost:53113 (ESTABLISHED)
etcd    690752 root   38u     IPv6           16720687      0t0       TCP localhost:newoak->localhost:53114 (ESTABLISHED)
etcd    690752 root   39u     IPv6           16720689      0t0       TCP localhost:newoak->localhost:53115 (ESTABLISHED)
etcd    690752 root   40u     IPv6           16738318      0t0       TCP localhost:newoak->localhost:53130 (ESTABLISHED)
etcd    690752 root   41u     IPv6           16720693      0t0       TCP localhost:newoak->localhost:53117 (ESTABLISHED)
etcd    690752 root   42u     IPv6           16724062      0t0       TCP localhost:newoak->localhost:53119 (ESTABLISHED)
etcd    690752 root   43u     IPv6           16738320      0t0       TCP localhost:newoak->localhost:53131 (ESTABLISHED)
etcd    690752 root   44u     IPv6           16199305      0t0       TCP localhost:newoak->localhost:53122 (ESTABLISHED)
etcd    690752 root   45u     IPv6           16197262      0t0       TCP localhost:newoak->localhost:53136 (ESTABLISHED)
etcd    690752 root   46u     IPv6           16738322      0t0       TCP localhost:newoak->localhost:53132 (ESTABLISHED)
etcd    690752 root   47u     IPv6           16740486      0t0       TCP localhost:newoak->localhost:53158 (ESTABLISHED)
etcd    690752 root   48u     IPv6           16726381      0t0       TCP localhost:newoak->localhost:53148 (ESTABLISHED)
etcd    690752 root   49u     IPv6           16738326      0t0       TCP localhost:newoak->localhost:53134 (ESTABLISHED)
etcd    690752 root   50u     IPv6           16725053      0t0       TCP localhost:newoak->localhost:53135 (ESTABLISHED)
etcd    690752 root   51u     IPv6           16197264      0t0       TCP localhost:newoak->localhost:53137 (ESTABLISHED)
etcd    690752 root   52u     IPv6           16736533      0t0       TCP localhost:newoak->localhost:53143 (ESTABLISHED)
etcd    690752 root   53u     IPv6           16197268      0t0       TCP localhost:newoak->localhost:53140 (ESTABLISHED)
etcd    690752 root   54u     IPv6           16303802      0t0       TCP localhost:newoak->localhost:53142 (ESTABLISHED)
etcd    690752 root   55u     IPv6           16735486      0t0       TCP localhost:newoak->localhost:53144 (ESTABLISHED)
etcd    690752 root   56u     IPv6           16638713      0t0       TCP localhost:newoak->localhost:53154 (ESTABLISHED)
etcd    690752 root   57u     IPv6           16736535      0t0       TCP localhost:newoak->localhost:53146 (ESTABLISHED)
etcd    690752 root   58u     IPv6           16199350      0t0       TCP localhost:newoak->localhost:53147 (ESTABLISHED)
etcd    690752 root   59u     IPv6           16740480      0t0       TCP localhost:newoak->localhost:53149 (ESTABLISHED)
etcd    690752 root   61u     IPv6           16740484      0t0       TCP localhost:newoak->localhost:53151 (ESTABLISHED)
etcd    690752 root   62u     IPv6           16638711      0t0       TCP localhost:newoak->localhost:53153 (ESTABLISHED)
etcd    690752 root   63u     IPv6           16638715      0t0       TCP localhost:newoak->localhost:53155 (ESTABLISHED)
etcd    690752 root   64u     IPv6           16638717      0t0       TCP localhost:newoak->localhost:53156 (ESTABLISHED)
etcd    690752 root   65u     IPv6           16638719      0t0       TCP localhost:newoak->localhost:53157 (ESTABLISHED)
etcd    690752 root   67u     IPv6           16740490      0t0       TCP localhost:newoak->localhost:53160 (ESTABLISHED)
mapuri commented 8 years ago

i see, thanks for checking @vvb .

What are these new fds, new tcp sockets? May be we are leaking etcd client connections.

vvb commented 8 years ago

The leaks seem to be between netplugin and etcd..

[admin@ucsb-blade2 compose]$ sudo ls -l /proc/48483/fd/28
lrwx------ 1 root root 64 Apr 18 17:09 /proc/48483/fd/28 -> socket:[213809]
[admin@ucsb-blade2 compose]$
[admin@ucsb-blade2 compose]$ sudo lsof | grep 213809
etcd      48483                    root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48485              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48486              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48487              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48488              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48566              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
etcd      48483 48567              root   28u     IPv6             213809       0t0        TCP localhost:newoak->localhost:36302 (ESTABLISHED)
[admin@ucsb-blade2 compose]$
[admin@ucsb-blade2 compose]$ sudo netstat -tnp | grep 36302
tcp        0      0 127.0.0.1:36302         127.0.0.1:4001          ESTABLISHED 51710/netplugin
tcp6       0      0 127.0.0.1:4001          127.0.0.1:36302         ESTABLISHED 48483/etcd
[admin@ucsb-blade2 compose]$

potential fix in https://github.com/contiv/netplugin/pull/325 need to be verified - MasterPostReq creates a newclient every time..and doesn’t seem to close it

mapuri commented 8 years ago

awesome, thanks for digging in @vvb .

yekaifeng commented 8 years ago

The same happens in my test environment. The opening socket between neplugin and etcd is keep increasing when doing container creation and deletion. The session number increased by 2 after create container, increase by 1 after delete container. The number stay the same even after one day.

[root@baymax-2 ~]# lsof -p 27857 |wc -l 148 [root@baymax-2 ~]# docker run -d --name centos-c8 --net vlan3096net centos /bin/bash -c 'while true;do echo test;sleep 5;done' 069bb202deb2de6397cc585ec6398a0bf9f6ea764e97ab0cdc0c03086b33ab23 [root@baymax-2 ~]# lsof -p 27857 |wc -l 150 [root@baymax-2 ~]# docker rm -f centos-c8 centos-c8 [root@baymax-2 ~]# lsof -p 27857 |wc -l 151 [root@baymax-2 ~]# ps 27857 PID TTY STAT TIME COMMAND 27857 ? Sl 10:39 /opt/netplugin-v0.1-04-09-2016.02-54-52.UTC/netplugin -plugin-mode docker -vlan-if bond1

mapuri commented 8 years ago

closing this assuming that https://github.com/contiv/netplugin/pull/325 has addressed this.

Please reopen in case that is not the case.