Open paulmillar opened 6 years ago
It will be useful next time to make a thread dump to see is it the missing extra executor trigger this issue.
-kofemann /* caffeinated mutations of the core personality /
On Wed, Jan 31, 2018 at 10:39 AM, Paul Millar notifications@github.com wrote:
Prometheus Build 1421 https://ci.dcache.org/view/dCache%20master/job/Deploy%20on%20prometheus/1421/ failed due to the inability to upload data. The failed uploads where logged as:
31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iLm9g] Internal server error: org.dcache.webdav.WebDavException: Unable to reserve space: no linkgroups configured. 31 Jan 2018 06:04:05 (webdav-secure-web) [door:webdav-secure-web@dCacheDomain:AAVkC2iSz5g] Unable to reserve space: no linkgroups configured.
This, in turn, was cause by SpaceManager believing there are no pools contributing to the linkgroup:
[prometheus] (local) admin > \s SpaceManager ls link groups FLAGS CNT RESVD AVAIL FREE UPDATED NAME -orc:no 0 0 + 0 = 0 2018-01-31 10:33:39 default-linkgroup
[prometheus] (local) admin >
On prometheus, there are two pools (pool1 and pool2) both in the pools domain. PoolManager does not know about these pools:
[prometheus] (local) admin > \sp psu ls pool [prometheus] (local) admin >
The pools domain is running:
[root@prometheus ~]# dcache status DOMAIN STATUS PID USER LOG dCacheDomain running (for 4 hours) 32324 dcache /var/log/dcache/dCacheDomain.log nfsServer running (for 4 hours) 32384 dcache /var/log/dcache/nfsServer.log pools running (for 4 hours) 32439 dcache /var/log/dcache/pools.log [root@prometheus ~]#
However, the two pools are no available through the admin interface, and topo does not show pools domain.
[root@prometheus ~]# ssh -p 22224 admin@localhost dCache (4.1.0-SNAPSHOT) Type "\?" for help.
[prometheus] (local) admin > \c pool1 (1) Cell does not exist. [prometheus] (local) admin > \c pool2 (1) Cell does not exist. [prometheus] (local) admin > \c System@pools (1) Cell does not exist. [prometheus] (local) admin > \s topo ls dCacheDomain
[prometheus] (local) admin >
The topo should show the dCacheDomain, pools and nfsServer domains.
The dCacheDomain log file shows (edited for brevity):
2018-01-31 06:01:05 Launching /bin/java -[...]BootLoader start dCacheDomain [...] 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. [... similar connection errors ...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished 31 Jan 2018 06:02:30 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161] 31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
These initial ZooKeeper connection to localhost/127.0.0.1:2181 failed messages are annoying but (I believe) not the cause of the problem. They are present when dCache is working correctly and are caused by the System cell attempts to connect to ZK before the embedded ZK is started. This connection blocks for ~1 minute before returning, with the ZK client retrying in the background, as mentioned in the message
31 Jan 2018 06:02:30 (lm) [] Failed to connect to zookeeper, using mode none until connection reestablished
Later on, the message version of ZooKeeper being used indicates that the client was eventually able to connect to ZK:
31 Jan 2018 06:02:33 (System) [] The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
The stack-trace is interesting and new. This indicates that the PathChildrenCache is no longer maintained; for example, if a domain suffers this problem and a core domain restarts (or starts up) then the satellite would never learn of the core domain appearing, so would never initiate a fresh connection to that domain.
Both the nfsServer and pools domains contain a similar stack-trace:
For pools domain:
2018-01-31 06:01:07 Launching /bin/java -server [...] BootLoader start pools [...] 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. [...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
and for nfsServer domain:
2018-01-28 06:01:02 Launching /bin/java -server [...]BootLoader start nfsServer [...] 31 Jan 2018 06:01:23 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:01:24 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. [...] 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:29 (System) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/0:0:0:0:0:0:0:1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] ZooKeeper connection to localhost/127.0.0.1:2181 failed (Connection refused), attempting reconnect. 31 Jan 2018 06:02:31 (lm) [] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dcache at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1130) ~[zookeeper-3.4.8.jar:3.4.8--1] at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[curator-client-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194) ~[curator-framework-2.11.1.jar:na] at dmg.cells.zookeeper.CellCuratorFramework.createContainers(CellCuratorFramework.java:265) ~[cells-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53) ~[curator-framework-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35) ~[curator-recipes-2.11.1.jar:na] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) ~[curator-recipes-2.11.1.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_161] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/3790, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjJ3ep5jEcT6kbeeVMVzQ-PbWk0dP_Iks5tQDTYgaJpZM4RzrSf .
It crashed again this morning, so I was able to capture the thread-dumps.
Thread dump of dCacheDomain: the core domain Thread dump of nfsServer: a satellite domain Thread dump of pools: a satellite domain
Prometheus Build 1421 failed due to the inability to upload data. The failed uploads where logged as:
This, in turn, was cause by SpaceManager believing there are no pools contributing to the linkgroup:
On prometheus, there are two pools (
pool1
andpool2
) both in thepools
domain. PoolManager does not know about these pools:The
pools
domain is running:However, the two pools are no available through the admin interface, and topo does not show
pools
domain.The topo should show the
dCacheDomain
,pools
andnfsServer
domains.The dCacheDomain log file shows (edited for brevity):
These initial
ZooKeeper connection to localhost/127.0.0.1:2181 failed
messages are annoying but (I believe) not the cause of the problem. They are present when dCache is working correctly and are caused by theSystem
cell attempts to connect to ZK before the embedded ZK is started. This connection blocks for ~1 minute before returning, with the ZK client retrying in the background, as mentioned in the messageLater on, the message
version of ZooKeeper being used
indicates that the client was eventually able to connect to ZK:The stack-trace is interesting and new. This indicates that the PathChildrenCache is no longer maintained; for example, if a domain suffers this problem and a core domain restarts (or starts up) then the satellite would never learn of the core domain appearing, so would never initiate a fresh connection to that domain.
Both the
nfsServer
andpools
domains contain a similar stack-trace:For
pools
domain:and for
nfsServer
domain: