Open calestyo opened 4 years ago
Anything new on this... ? This makes dCache pretty fragile, cause everytime domains start up, some may not be registered correctly
Still happens with 7.2.10. ... and actually seems to have gotten worse:
CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
DOMAIN *@srm_lcg-lrz-srm
QUEUE srm_lcg-lrz-srm@* *@srm_lcg-lrz-srm
QUEUE SrmManager@* *@srm_lcg-lrz-srm
TOPIC LoginBrokerRequestTopic@* *@srm_lcg-lrz-srm
TOPIC WatchedServiceTopic@* *@srm_lcg-lrz-srm
TOPIC LoginBrokerTopic@* *@srm_lcg-lrz-srm
TOPIC PoolMonitorTopic@* *@srm_lcg-lrz-srm
between our 3 core domains (core0, core1, core2)... even after all pools have been restarted often enough until each of them correctly showed up in the routing tables, the above are still left.
Why do they show up in core0 and core2,. but not core1?
That srm door was then also basically broken,... it gave endless of these:
Feb 22 20:43:16 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:16+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:37 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:37+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:46 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:46+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:01 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:01+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:07 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:07+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
and didn't recover that broken routing table entry by itself.
After restarting the srm door.. the routing tables were back to (what I'd consider) ok.
Before today’s upgrade (from 6.0.18 -> 7.2.10), this only ever happened with pools so far.. never anything else... since at least 7.2.10... it also happens to other components.
:-(
I restarte ZK (one by one, so that at least 2 were always running),.. but still one pool fell out:
Feb 23 01:59:15 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 01:59:15+01:00 (c-core1-AAXYoMmquGg-AAXYoMmtb7g) [core1,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:04 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:04+01:00 (System) [] Invalid config event received: {server.2=lcg-lrz-dcache1.grid.lrz.de:2182:2183:participant, server.1=lcg-lrz-dcache0.grid.lrz.de:2182:2183:participant, server.3=lcg-lrz-dcache2.grid.lrz.de:2182:2183:participant, version=0}
Feb 23 02:00:06 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:06+01:00 (c-core2-AAXYoMmpGlg-AAXYoMms7tA) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:17 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:17+01:00 (c-core2-AAXYpPxn8eA-AAXYpPxoAYE) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
and also leading to inconsistencies in the routing table:
core0 ↔ core1:
DOMAIN *@pool_lcg-lrz-dc27_1
QUEUE lcg-lrz-dc27_1@* *@pool_lcg-lrz-dc27_1
core0 ↔ core2:
DOMAIN *@pool_lcg-lrz-dc27_1
QUEUE lcg-lrz-dc27_1@* *@pool_lcg-lrz-dc27_1
btw: This is the check I'm using to compare the routing tables: https://github.com/dCache/monitoring/tree/master/monitoring-plugins/check_compare_core_domains_routing_tables
Hi Chris.
Which ZooKeeper version are you running?
Thanks, Lea
Could you verify that each of your ZK servers has the same information about the core nodes?
To do this, you can use a command like:
bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri
I'm assuming you have your ZK nodes configured to use the standard port (2181
). Also NODE
should be replaced with the hostname of a ZK node.
The output from this command should list the code domains:
[root@prometheus zookeeper]# bin/zkCli.sh -server localhost:2181 ls /dcache/lm/cores-uri
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[dCacheDomain]
[root@prometheus zookeeper]#
Here, there is just one core domain dCacheDomain
@lemora
Which ZooKeeper version are you running?
It's 3.5.8-3
, i.e. a Debianification of Tigran’s RPM.
@paulmillar
bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri
Gives (when querying each of the ZKs directly) on all:
2022-02-23 13:50:23,560 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1394] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x30004fb1d67009c, negotiated timeout = 30000
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[core0, core1, core2]
(the first INFO log line differs, but that shouldn't matter) Right after all in dCache was stopped, and just the core nodes (i.e. no pools, no doors, no srm) was running.
Then I started all the rest of dCache, which resulting my test with the following differences in routing:
CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
DOMAIN *@pool_lcg-lrz-dc24_6
DOMAIN *@pool_lcg-lrz-dc36_5
QUEUE lcg-lrz-dc24_6@* *@pool_lcg-lrz-dc24_6
QUEUE lcg-lrz-dc36_5@* *@pool_lcg-lrz-dc36_5
core0 ↔ core2:
DOMAIN *@pool_lcg-lrz-dc24_6
DOMAIN *@pool_lcg-lrz-dc52_10
DOMAIN *@pool_lcg-lrz-dc53_3
QUEUE lcg-lrz-dc52_10@* *@pool_lcg-lrz-dc52_10
QUEUE lcg-lrz-dc24_6@* *@pool_lcg-lrz-dc24_6
QUEUE lcg-lrz-dc53_3@* *@pool_lcg-lrz-dc53_3
which then, at leas per experience, never resolve automatically, but require restarts of these pools.
However, even with the issue present, the above zkCli
command still shows all three core nodes.
You should find a .zookeeper
file on the SRM node. The file should have a name like /var/log/dcache/srm_lcg-lrz-srm.zookeeper
.
Could you somehow attach an extract from that file?
The extract should end at the time you noticed the problem above (core1 missing the SRM domain). The extract should start when that domain started most recently, just before you noticed the problem? (i.e., from the domain start until you noticed the problem).
If you mean the one from https://github.com/dCache/dcache/issues/5326#issuecomment-1048149959 ... than that's unfortunately already gone, and I had cleared the files after completely restarting everything.
It contains now just:
022-02-23 13:47:06,266 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:06,269 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
That's output from a zookeeper log, from one of the domains that was affected with this earlier today:
# cat pool_lcg-lrz-dc24_6.log
2022-02-23 13:47:13,090 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:13,175 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-23 15:03:20,393 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-23 15:03:21,501 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-23 15:03:21,534 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-23 15:03:24,016 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 15:03:24,016 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
(including already the restart of the domain)
Thanks for sharing. It doesn't look like there was any problems, at least at the curator level.
Is there anything else I can do for debugging?
I mean it's rather annoying to reproduce the problem (since that means breaking transfers) - but it's also pretty easy (simply restarting a bunch of pools at the same time... and surely some will break) ... so I should be able to do it if necessary.
btw: Is it forseen in the code that... whatever the domains do when they initially register... they retry this when it fails? Or is there something like a test, whether it succeeded with all core domains?
Thanks for the info so far.
I have another question.
All domains should establish a TCP connection to all core domains. This uses port 11111 by default.
If a domain is missing from one of the core domain's routing table, is there also a corresponding absence of this messaging TCP connection from the missing domain to that core domain?
Cheers, Paul.
I've repeatedly restarted some currently unused pools until the issue was triggered (needed only two restarts ^^):
# check_compare_core_domains_routing_tables core0 core1 core2
CRITICAL: Routing tables of core domains differ.
core0 ↔ core2:
DOMAIN *@pool_lcg-lrz-dc51_5
QUEUE lcg-lrz-dc51_5@* *@pool_lcg-lrz-dc51_5
In general:
grep dcache-cells-plain /etc/services
dcache-cells-plain 11111/tcp # dCache Cells Messaging
dcache-cells-plain 11111/udp # dCache Cells Messaging
And on that node in specific:
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:41548 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 336185/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:51670 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 336185/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33050 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 336185/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42100 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678165/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52224 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678165/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33600 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678165/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42116 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678166/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52232 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678166/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33620 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678166/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42072 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678167/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52194 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678167/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33572 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678167/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42064 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678168/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52188 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678168/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33566 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678168/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42088 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678169/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52210 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678169/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33588 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678169/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42086 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678170/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52206 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678170/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33584 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678170/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42076 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678171/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52200 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678171/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42118 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678172/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52236 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678172/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33608 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678172/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42112 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678173/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52234 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678173/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33610 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678173/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42124 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678174/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52248 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678174/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33626 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678174/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:42096 lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain ESTABLISHED 678175/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:52218 lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain ESTABLISHED 678175/java
tcp6 0 0 lcg-lrz-dc51.grid.lrz.de:33596 lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain ESTABLISHED 678175/java
# dcache status | grep 678171
pool_lcg-lrz-dc51_5 running (for 11 minutes) 678171 dcache /var/log/dcache/pool_lcg-lrz-dc51_5.log
One sees that PID 678171
has only connections to core0 and core1, which matches what's shown by my check (which always diffs the first specified core domain, with each others - not the 2nd itself with the 3rd or so).
# cat /var/log/dcache/pool_lcg-lrz-dc51_5.log
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: An illegal reflective access operation has occurred
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: All illegal access operations will be denied in a future release
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:08:23 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-24 17:08:23+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:09:39 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:39+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
there you see my last two restarts... and that the pool thinks it would have been correctly enabled (in PoolManager).
# cat /var/log/dcache/zookeeper/pool_lcg-lrz-dc51_5.log
2022-02-24 17:08:23,951 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:08:26,640 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:04,565 [CURATOR: STARTED] connection state now CONNECTED
2022-02-24 17:09:04,566 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:39,747 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:09:42,121 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:51,071 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:51,089 [CURATOR: STARTED] connection state now CONNECTED
Cheers, Chris.
OK, good.
So, the most likely explanation is that the problematic domain (the pool domain, in the latest example) believes there are only two domains. It simply doesn't connect to the third domain (core2
).
We just need to find out why.
Anything further I can do here?
btw: It's really great that we can look into this issue... but for that there is a good workaround... much more pressing would be the problem reported in the other ticket as well as on support@dcache.org, that space reservation reporting in the info
service is broken and thus SRR data... we already got ticketed because of that.
Since our upgrade from 8.2.11 to 9.1.2 (plus also Debian upgrade from 11 to 12 and PostrgreSQL 13->15) this whole issue seems to have gotten considerably worse:
Previously, only pool nodes were ever affected.. and when I restarted the whole dCache cluster... perhaps ~4 of the pools were affected.
That doubled, + core domains and doors are also affected and need to be restarted to get their routing fixed.
Hey.
I've reported on this several times before at least in RT issue 9641.
What is a major problem (and persists up to including 6.0.9) is, that quite often after restarting either the whole dCache or parts of it, the routing tables of the core domains get in an inconsistent state.
When this happened, then e.g. the PoolManager thinks the pools would be there and fine, however they are not... transfers fail with crude messages and there doesn't seem to be any sort of fallback to the core domains where the routes are correct.... the whole ZK thing seems to be pretty fragile while providing no real benefits (i.e. despite running three instances and three core domains, we're actually less stable).
The following is what I just got after restarting all our dCache cluster:
And:
dCache self seems to never automatically recover from such a state, one really has to restart the affected domains... which is hard to understand... the domains should just retry to register themselves... and dcache should perhaps notice as well, that the routing tables differ.
2019 I wrote a Icinga check that tests this automatically, but the problem existed already longer. git://git.srv.scientia.net/calestyo/lmu/tools/dcache/check_compare_core_domains_routing_tables
There is most likely a similar problem in the topo map, i.e. the output of \s topo@core0 gettopomap which can differ between different core domains. However, given that output format is ... uhm... strange and in no way sorted, I wasn't able to write a check for this (without considerable effort).
Cheers, Chris.