During interpretation (currently stage 20, flatMapToPair at GroupNonMergingWindowsFunctions.java:116..., but I don't know if that's relevant) there are a lot of connections from the running tasks to Zookeeper.
Overnight, there were so many connection attempts that ZK began rejecting connections from some nodes, until ZK's limit was increased. The processes then seemed to continue.
Seeing connections per node: netstat -nt | grep 2181 | awk '{ print $5 }' | cut -d : -f 1 | sort | uniq -c (on each C5 master)
Listing them for one node: netstat -nt | grep 130.225.43.59 | awk '{print $5}' (put in a file)
From that node, seeing the process with the connections: netstat -npt | grep -f z1 -f z2 -f z3 | awk '{ print $7 }' | grep java | sort | uniq -c
During interpretation (currently stage 20, flatMapToPair at GroupNonMergingWindowsFunctions.java:116..., but I don't know if that's relevant) there are a lot of connections from the running tasks to Zookeeper.
Overnight, there were so many connection attempts that ZK began rejecting connections from some nodes, until ZK's limit was increased. The processes then seemed to continue.
Seeing connections per node:
netstat -nt | grep 2181 | awk '{ print $5 }' | cut -d : -f 1 | sort | uniq -c
(on each C5 master)Listing them for one node:
netstat -nt | grep 130.225.43.59 | awk '{print $5}'
(put in a file)From that node, seeing the process with the connections:
netstat -npt | grep -f z1 -f z2 -f z3 | awk '{ print $7 }' | grep java | sort | uniq -c
Where "ps" shows the 35902-35905 are/were interpretation for https://registry.gbif.org/dataset/67fabcac-a638-40a6-9bea-aeca8aced9f1. Overnight it was iNaturalist with the same result. ZK clients see "Too many connections" exceptions.
This needs reviewing to see how many connections are expected. Is something (KVS and it's HBase connection?) not being closed or reused correctly?
Process 72194 is the HDFS data node, which also has more connections that I would expect. These seem to end when the interpretation step completes.