gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Many Zookeeper connections during interpretation #932

Open MattBlissett opened 1 year ago

MattBlissett commented 1 year ago

During interpretation (currently stage 20, flatMapToPair at GroupNonMergingWindowsFunctions.java:116..., but I don't know if that's relevant) there are a lot of connections from the running tasks to Zookeeper.

Overnight, there were so many connection attempts that ZK began rejecting connections from some nodes, until ZK's limit was increased. The processes then seemed to continue.

Seeing connections per node: netstat -nt | grep 2181 | awk '{ print $5 }' | cut -d : -f 1 | sort | uniq -c (on each C5 master)

Listing them for one node: netstat -nt | grep 130.225.43.59 | awk '{print $5}' (put in a file)

From that node, seeing the process with the connections: netstat -npt | grep -f z1 -f z2 -f z3 | awk '{ print $7 }' | grep java | sort | uniq -c

      4 34743/java
     18 35902/java
     20 35903/java
     27 35904/java
     24 35905/java
      3 38503/java
      5 38506/java
      5 60123/java
      8 60184/java
      5 61331/java
      6 62295/java
      8 69584/java
      1 71466/java
     28 72194/java

Where "ps" shows the 35902-35905 are/were interpretation for https://registry.gbif.org/dataset/67fabcac-a638-40a6-9bea-aeca8aced9f1. Overnight it was iNaturalist with the same result. ZK clients see "Too many connections" exceptions.

This needs reviewing to see how many connections are expected. Is something (KVS and it's HBase connection?) not being closed or reused correctly?

Process 72194 is the HDFS data node, which also has more connections that I would expect. These seem to end when the interpretation step completes.