dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
285 stars 136 forks source link

[prometheus] ZK connection problem leads to domain disappearing #3790

Open paulmillar opened 6 years ago

paulmillar commented 6 years ago

Prometheus Build 1421 failed due to the inability to upload data. The failed uploads where logged as:

31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iLm9g] Internal server error: org.dcache.webdav.WebDavException: Unable to reserve space: no linkgroups configured.
31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iSz5g] Unable to reserve space: no linkgroups configured.

This, in turn, was cause by SpaceManager believing there are no pools contributing to the linkgroup:

[prometheus] (local) admin > \s SpaceManager ls link groups
FLAGS   CNT RESVD AVAIL FREE          UPDATED NAME
-orc:no 0   0 +   0 =   0 2018-01-31 10:33:39 default-linkgroup

[prometheus] (local) admin > 

On prometheus, there are two pools (pool1 and pool2) both in the pools domain. PoolManager does not know about these pools:

[prometheus] (local) admin > \sp psu ls pool
[prometheus] (local) admin > 

The pools domain is running:

[root@prometheus ~]# dcache status
DOMAIN       STATUS                PID   USER   LOG                              
dCacheDomain running (for 4 hours) 32324 dcache /var/log/dcache/dCacheDomain.log 
nfsServer    running (for 4 hours) 32384 dcache /var/log/dcache/nfsServer.log    
pools        running (for 4 hours) 32439 dcache /var/log/dcache/pools.log        
[root@prometheus ~]# 

However, the two pools are no available through the admin interface, and topo does not show pools domain.

[root@prometheus ~]# ssh -p 22224 admin@localhost
dCache (4.1.0-SNAPSHOT)
Type "\?" for help.

[prometheus] (local) admin > \c pool1
(1) Cell does not exist.
[prometheus] (local) admin > \c pool2
(1) Cell does not exist.
[prometheus] (local) admin > \c System@pools
(1) Cell does not exist.
[prometheus] (local) admin > \s topo ls
dCacheDomain

[prometheus] (local) admin > 

The topo should show the dCacheDomain, pools and nfsServer domains.

The dCacheDomain log file shows (edited for brevity):

2018-01-31 06:01:05 Launching /bin/java -[...]BootLoader start dCacheDomain
[...]
31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
[... similar connection errors ...]
31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished
31 Jan 2018 06:02:30 (lm) [] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na]
        at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
        at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

These initial ZooKeeper connection to localhost/127.0.0.1:2181 failed messages are annoying but (I believe) not the cause of the problem. They are present when dCache is working correctly and are caused by the System cell attempts to connect to ZK before the embedded ZK is started. This connection blocks for ~1 minute before returning, with the ZK client retrying in the background, as mentioned in the message

31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished

Later on, the message version of ZooKeeper being used indicates that the client was eventually able to connect to ZK:

31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

The stack-trace is interesting and new. This indicates that the PathChildrenCache is no longer maintained; for example, if a domain suffers this problem and a core domain restarts (or starts up) then the satellite would never learn of the core domain appearing, so would never initiate a fresh connection to that domain.

Both the nfsServer and pools domains contain a similar stack-trace:

For pools domain:

2018-01-31 06:01:07 Launching /bin/java -server [...] BootLoader start pools
[...]
31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
[...]
31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na]
        at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
        at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]

and for nfsServer domain:

2018-01-28 06:01:02 Launching /bin/java -server [...]BootLoader start nfsServer
[...]
31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
[...]
31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect.
31 Jan 2018 06:02:31 (lm) [] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1]
        at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na]
        at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
        at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
kofemann commented 6 years ago

It will be useful next time to make a thread dump to see is it the missing extra executor trigger this issue.

-kofemann /* caffeinated mutations of the core personality /

On Wed, Jan 31, 2018 at 10:39 AM, Paul Millar notifications@github.com wrote:

Prometheus Build 1421 https://ci.dcache.org/view/dCache%20master/job/Deploy%20on%20prometheus/1421/ failed due to the inability to upload data. The failed uploads where logged as:

31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iLm9g] Internal server error: org.dcache.webdav.WebDavException: Unable to reserve space: no linkgroups configured. 31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iSz5g] Unable to reserve space: no linkgroups configured.

This, in turn, was cause by SpaceManager believing there are no pools contributing to the linkgroup:

[prometheus] (local) admin > \s SpaceManager ls link groups FLAGS CNT RESVD AVAIL FREE UPDATED NAME -orc:no 0 0 + 0 = 0 2018-01-31 10:33:39 default-linkgroup

[prometheus] (local) admin >

On prometheus, there are two pools (pool1 and pool2) both in the pools domain. PoolManager does not know about these pools:

[prometheus] (local) admin > \sp psu ls pool [prometheus] (local) admin >

The pools domain is running:

[root@prometheus ~]# dcache status DOMAIN STATUS PID USER LOG dCacheDomain running (for 4 hours) 32324 dcache /var/log/dcache/dCacheDomain.log nfsServer running (for 4 hours) 32384 dcache /var/log/dcache/nfsServer.log pools running (for 4 hours) 32439 dcache /var/log/dcache/pools.log [root@prometheus ~]#

However, the two pools are no available through the admin interface, and topo does not show pools domain.

[root@prometheus ~]# ssh -p 22224 admin@localhost dCache (4.1.0-SNAPSHOT) Type "\?" for help.

[prometheus] (local) admin > \c pool1 (1) Cell does not exist. [prometheus] (local) admin > \c pool2 (1) Cell does not exist. [prometheus] (local) admin > \c System@pools (1) Cell does not exist. [prometheus] (local) admin > \s topo ls dCacheDomain

[prometheus] (local) admin >

The topo should show the dCacheDomain, pools and nfsServer domains.

The dCacheDomain log file shows (edited for brevity):

2018-01-31 06:01:05 Launching /bin/java -[...]BootLoader start dCacheDomain [...] 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. [... similar connection errors ...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished 31 Jan 2018 06:02:30 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161] 31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

These initial ZooKeeper connection to localhost/127.0.0.1:2181 failed messages are annoying but (I believe) not the cause of the problem. They are present when dCache is working correctly and are caused by the System cell attempts to connect to ZK before the embedded ZK is started. This connection blocks for ~1 minute before returning, with the ZK client retrying in the background, as mentioned in the message

31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished

Later on, the message version of ZooKeeper being used indicates that the client was eventually able to connect to ZK:

31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

The stack-trace is interesting and new. This indicates that the PathChildrenCache is no longer maintained; for example, if a domain suffers this problem and a core domain restarts (or starts up) then the satellite would never learn of the core domain appearing, so would never initiate a fresh connection to that domain.

Both the nfsServer and pools domains contain a similar stack-trace:

For pools domain:

2018-01-31 06:01:07 Launching /bin/java -server [...] BootLoader start pools [...] 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. [...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]

and for nfsServer domain:

2018-01-28 06:01:02 Launching /bin/java -server [...]BootLoader start nfsServer [...] 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. [...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/3790, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjJ3ep5jEcT6kbeeVMVzQ-PbWk0dP_Iks5tQDTYgaJpZM4RzrSf .

paulmillar commented 6 years ago

It crashed again this morning, so I was able to capture the thread-dumps.

Thread dump of dCacheDomain: the core domain Thread dump of nfsServer: a satellite domain Thread dump of pools: a satellite domain