dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
285 stars 136 forks source link

routing tables inconsistent #5326

Open calestyo opened 4 years ago

calestyo commented 4 years ago

Hey.

I've reported on this several times before at least in RT issue 9641.

What is a major problem (and persists up to including 6.0.9) is, that quite often after restarting either the whole dCache or parts of it, the routing tables of the core domains get in an inconsistent state.

When this happened, then e.g. the PoolManager thinks the pools would be there and fine, however they are not... transfers fail with crude messages and there doesn't seem to be any sort of fallback to the core domains where the routes are correct.... the whole ZK thing seems to be pretty fragile while providing no real benefits (i.e. despite running three instances and three core domains, we're actually less stable).

The following is what I just got after restarting all our dCache cluster:

# diff <( dcache_admin "\s System@core0 route" | sort | sed "s/l-.*//" ) <( dcache_admin "\s System@core1 route" | sort | sed "s/l-.*//"  )
6c6
< DOMAIN                         *@core1               
---
> DOMAIN                         *@core0                   c-core0-AAWfXEK-Vfg-AAWfXELBQBA@core1              
120d119
< DOMAIN                         *@pool_lcg-lrz-dc15_7 
319d317
< DOMAIN                         *@pool_lcg-lrz-dc64_6 
355,356c353,354
< QUEUE                        acm@*                                                   acm@core0              
< QUEUE                        acm@*                                                     *@core1              
---
> QUEUE                        acm@*                                                   acm@core1              
> QUEUE                        acm@*                                                     *@core0              
457d454
< QUEUE             lcg-lrz-dc15_7@*                                                     *@pool_lcg-lrz-dc15_7
656d652
< QUEUE             lcg-lrz-dc64_6@*                                                     *@pool_lcg-lrz-dc64_6
694c690
< QUEUE                       topo@*                                                     *@core1              
---
> QUEUE                       topo@*                                                     *@core0              
696c692
< QUEUE                       topo@*                                                  topo@core0              
---
> QUEUE                       topo@*                                                  topo@core1              

And:

# diff <( dcache_admin "\s System@core0 route" | sort | sed "s/l-.*//" ) <( dcache_admin "\s System@core2 route" | sort | sed "s/l-.*//"  )
6,7c6,7
< DOMAIN                         *@core1               
< DOMAIN                         *@core2               
---
> DOMAIN                         *@core0                   c-core0-AAWfXELYUlA-AAWfXELZmnA@core2              
> DOMAIN                         *@core1                   c-core1-AAWfXELXhzA-AAWfXELZtcg@core2              
47d46
< DOMAIN                         *@pool_lcg-lrz-dc08_4 
95d93
< DOMAIN                         *@pool_lcg-lrz-dc13_2 
110d107
< DOMAIN                         *@pool_lcg-lrz-dc14_7 
116d112
< DOMAIN                         *@pool_lcg-lrz-dc15_3 
144d139
< DOMAIN                         *@pool_lcg-lrz-dc18_1 
192d186
< DOMAIN                         *@pool_lcg-lrz-dc22_9 
246d239
< DOMAIN                         *@pool_lcg-lrz-dc28_3 
355c348,349
< QUEUE                        acm@*                                                   acm@core0              
---
> QUEUE                        acm@*                                                   acm@core2              
> QUEUE                        acm@*                                                     *@core0              
357d350
< QUEUE                        acm@*                                                     *@core2              
384d376
< QUEUE             lcg-lrz-dc08_4@*                                                     *@pool_lcg-lrz-dc08_4
432d423
< QUEUE             lcg-lrz-dc13_2@*                                                     *@pool_lcg-lrz-dc13_2
447d437
< QUEUE             lcg-lrz-dc14_7@*                                                     *@pool_lcg-lrz-dc14_7
453d442
< QUEUE             lcg-lrz-dc15_3@*                                                     *@pool_lcg-lrz-dc15_3
481d469
< QUEUE             lcg-lrz-dc18_1@*                                                     *@pool_lcg-lrz-dc18_1
529d516
< QUEUE             lcg-lrz-dc22_9@*                                                     *@pool_lcg-lrz-dc22_9
583d569
< QUEUE             lcg-lrz-dc28_3@*                                                     *@pool_lcg-lrz-dc28_3
693a680
> QUEUE                       topo@*                                                     *@core0              
695,696c682
< QUEUE                       topo@*                                                     *@core2              
< QUEUE                       topo@*                                                  topo@core0              
---
> QUEUE                       topo@*                                                  topo@core2              

dCache self seems to never automatically recover from such a state, one really has to restart the affected domains... which is hard to understand... the domains should just retry to register themselves... and dcache should perhaps notice as well, that the routing tables differ.

2019 I wrote a Icinga check that tests this automatically, but the problem existed already longer. git://git.srv.scientia.net/calestyo/lmu/tools/dcache/check_compare_core_domains_routing_tables

There is most likely a similar problem in the topo map, i.e. the output of \s topo@core0 gettopomap which can differ between different core domains. However, given that output format is ... uhm... strange and in no way sorted, I wasn't able to write a check for this (without considerable effort).

Cheers, Chris.

calestyo commented 2 years ago

Anything new on this... ? This makes dCache pretty fragile, cause everytime domains start up, some may not be registered correctly

Still happens with 7.2.10. ... and actually seems to have gotten worse:

CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
    DOMAIN  *@srm_lcg-lrz-srm          
    QUEUE   srm_lcg-lrz-srm@*          *@srm_lcg-lrz-srm
    QUEUE   SrmManager@*               *@srm_lcg-lrz-srm
    TOPIC   LoginBrokerRequestTopic@*  *@srm_lcg-lrz-srm
    TOPIC   WatchedServiceTopic@*      *@srm_lcg-lrz-srm
    TOPIC   LoginBrokerTopic@*         *@srm_lcg-lrz-srm
    TOPIC   PoolMonitorTopic@*         *@srm_lcg-lrz-srm

between our 3 core domains (core0, core1, core2)... even after all pools have been restarted often enough until each of them correctly showed up in the routing tables, the above are still left.

Why do they show up in core0 and core2,. but not core1?

calestyo commented 2 years ago

That srm door was then also basically broken,... it gave endless of these:

Feb 22 20:43:16 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:16+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:37 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:37+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:46 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:46+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:01 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:01+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:07 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:07+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.

and didn't recover that broken routing table entry by itself.

After restarting the srm door.. the routing tables were back to (what I'd consider) ok.

Before today’s upgrade (from 6.0.18 -> 7.2.10), this only ever happened with pools so far.. never anything else... since at least 7.2.10... it also happens to other components.

:-(

calestyo commented 2 years ago

I restarte ZK (one by one, so that at least 2 were always running),.. but still one pool fell out:

Feb 23 01:59:15 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 01:59:15+01:00 (c-core1-AAXYoMmquGg-AAXYoMmtb7g) [core1,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:04 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:04+01:00 (System) [] Invalid config event received: {server.2=lcg-lrz-dcache1.grid.lrz.de:2182:2183:participant, server.1=lcg-lrz-dcache0.grid.lrz.de:2182:2183:participant, server.3=lcg-lrz-dcache2.grid.lrz.de:2182:2183:participant, version=0}
Feb 23 02:00:06 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:06+01:00 (c-core2-AAXYoMmpGlg-AAXYoMms7tA) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:17 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:17+01:00 (c-core2-AAXYpPxn8eA-AAXYpPxoAYE) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed

and also leading to inconsistencies in the routing table:

core0 ↔ core1:
DOMAIN  *@pool_lcg-lrz-dc27_1  
QUEUE   lcg-lrz-dc27_1@*       *@pool_lcg-lrz-dc27_1

core0 ↔ core2:
DOMAIN  *@pool_lcg-lrz-dc27_1  
QUEUE   lcg-lrz-dc27_1@*       *@pool_lcg-lrz-dc27_1

btw: This is the check I'm using to compare the routing tables: https://github.com/dCache/monitoring/tree/master/monitoring-plugins/check_compare_core_domains_routing_tables

lemora commented 2 years ago

Hi Chris.

Which ZooKeeper version are you running?

Thanks, Lea

paulmillar commented 2 years ago

Could you verify that each of your ZK servers has the same information about the core nodes?

To do this, you can use a command like:

bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri

I'm assuming you have your ZK nodes configured to use the standard port (2181). Also NODE should be replaced with the hostname of a ZK node.

The output from this command should list the code domains:

[root@prometheus zookeeper]# bin/zkCli.sh -server localhost:2181 ls /dcache/lm/cores-uri
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[dCacheDomain]
[root@prometheus zookeeper]# 

Here, there is just one core domain dCacheDomain

calestyo commented 2 years ago

@lemora

Which ZooKeeper version are you running?

It's 3.5.8-3, i.e. a Debianification of Tigran’s RPM.

@paulmillar

bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri

Gives (when querying each of the ZKs directly) on all:

2022-02-23 13:50:23,560 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1394] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x30004fb1d67009c, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[core0, core1, core2]

(the first INFO log line differs, but that shouldn't matter) Right after all in dCache was stopped, and just the core nodes (i.e. no pools, no doors, no srm) was running.

Then I started all the rest of dCache, which resulting my test with the following differences in routing:

CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
    DOMAIN  *@pool_lcg-lrz-dc24_6   
    DOMAIN  *@pool_lcg-lrz-dc36_5   
    QUEUE   lcg-lrz-dc24_6@*        *@pool_lcg-lrz-dc24_6
    QUEUE   lcg-lrz-dc36_5@*        *@pool_lcg-lrz-dc36_5

core0 ↔ core2:
    DOMAIN  *@pool_lcg-lrz-dc24_6   
    DOMAIN  *@pool_lcg-lrz-dc52_10  
    DOMAIN  *@pool_lcg-lrz-dc53_3   
    QUEUE   lcg-lrz-dc52_10@*       *@pool_lcg-lrz-dc52_10
    QUEUE   lcg-lrz-dc24_6@*        *@pool_lcg-lrz-dc24_6
    QUEUE   lcg-lrz-dc53_3@*        *@pool_lcg-lrz-dc53_3

which then, at leas per experience, never resolve automatically, but require restarts of these pools.

However, even with the issue present, the above zkCli command still shows all three core nodes.

paulmillar commented 2 years ago

You should find a .zookeeper file on the SRM node. The file should have a name like /var/log/dcache/srm_lcg-lrz-srm.zookeeper.

Could you somehow attach an extract from that file?

The extract should end at the time you noticed the problem above (core1 missing the SRM domain). The extract should start when that domain started most recently, just before you noticed the problem? (i.e., from the domain start until you noticed the problem).

calestyo commented 2 years ago

If you mean the one from https://github.com/dCache/dcache/issues/5326#issuecomment-1048149959 ... than that's unfortunately already gone, and I had cleared the files after completely restarting everything.

It contains now just:

022-02-23 13:47:06,266 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:06,269 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
calestyo commented 2 years ago

That's output from a zookeeper log, from one of the domains that was affected with this earlier today:

# cat pool_lcg-lrz-dc24_6.log
2022-02-23 13:47:13,090 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:13,175 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-23 15:03:20,393 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-23 15:03:21,501 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-23 15:03:21,534 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-23 15:03:24,016 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 15:03:24,016 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null

(including already the restart of the domain)

paulmillar commented 2 years ago

Thanks for sharing. It doesn't look like there was any problems, at least at the curator level.

calestyo commented 2 years ago

Is there anything else I can do for debugging?

I mean it's rather annoying to reproduce the problem (since that means breaking transfers) - but it's also pretty easy (simply restarting a bunch of pools at the same time... and surely some will break) ... so I should be able to do it if necessary.

calestyo commented 2 years ago

btw: Is it forseen in the code that... whatever the domains do when they initially register... they retry this when it fails? Or is there something like a test, whether it succeeded with all core domains?

paulmillar commented 2 years ago

Thanks for the info so far.

I have another question.

All domains should establish a TCP connection to all core domains. This uses port 11111 by default.

If a domain is missing from one of the core domain's routing table, is there also a corresponding absence of this messaging TCP connection from the missing domain to that core domain?

Cheers, Paul.

calestyo commented 2 years ago

I've repeatedly restarted some currently unused pools until the issue was triggered (needed only two restarts ^^):

# check_compare_core_domains_routing_tables core0 core1 core2
CRITICAL: Routing tables of core domains differ.
core0 ↔ core2:
    DOMAIN  *@pool_lcg-lrz-dc51_5   
    QUEUE   lcg-lrz-dc51_5@*        *@pool_lcg-lrz-dc51_5

In general:

grep dcache-cells-plain /etc/services 
dcache-cells-plain  11111/tcp       # dCache Cells Messaging
dcache-cells-plain  11111/udp       # dCache Cells Messaging

And on that node in specific:

tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:41548  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 336185/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:51670  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 336185/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33050  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 336185/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42100  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678165/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52224  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678165/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33600  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678165/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42116  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678166/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52232  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678166/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33620  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678166/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42072  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678167/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52194  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678167/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33572  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678167/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42064  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678168/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52188  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678168/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33566  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678168/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42088  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678169/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52210  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678169/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33588  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678169/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42086  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678170/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52206  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678170/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33584  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678170/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42076  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678171/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52200  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678171/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42118  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678172/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52236  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678172/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33608  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678172/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42112  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678173/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52234  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678173/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33610  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678173/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42124  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678174/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52248  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678174/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33626  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678174/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:42096  lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678175/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:52218  lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678175/java
tcp6    0   0   lcg-lrz-dc51.grid.lrz.de:33596  lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain  ESTABLISHED 678175/java
# dcache status | grep 678171
pool_lcg-lrz-dc51_5  running (for 11 minutes) 678171 dcache /var/log/dcache/pool_lcg-lrz-dc51_5.log  

One sees that PID 678171 has only connections to core0 and core1, which matches what's shown by my check (which always diffs the first specified core domain, with each others - not the 2nd itself with the 3rd or so).

# cat  /var/log/dcache/pool_lcg-lrz-dc51_5.log
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: An illegal reflective access operation has occurred
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: All illegal access operations will be denied in a future release
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:08:23 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-24 17:08:23+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:09:39 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:39+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled

there you see my last two restarts... and that the pool thinks it would have been correctly enabled (in PoolManager).

# cat  /var/log/dcache/zookeeper/pool_lcg-lrz-dc51_5.log
2022-02-24 17:08:23,951 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:08:26,640 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:04,565 [CURATOR: STARTED] connection state now CONNECTED
2022-02-24 17:09:04,566 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:39,747 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:09:42,121 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:51,071 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:51,089 [CURATOR: STARTED] connection state now CONNECTED

Cheers, Chris.

paulmillar commented 2 years ago

OK, good.

So, the most likely explanation is that the problematic domain (the pool domain, in the latest example) believes there are only two domains. It simply doesn't connect to the third domain (core2).

We just need to find out why.

calestyo commented 2 years ago

Anything further I can do here?

btw: It's really great that we can look into this issue... but for that there is a good workaround... much more pressing would be the problem reported in the other ticket as well as on support@dcache.org, that space reservation reporting in the info service is broken and thus SRR data... we already got ticketed because of that.

calestyo commented 1 year ago

Since our upgrade from 8.2.11 to 9.1.2 (plus also Debian upgrade from 11 to 12 and PostrgreSQL 13->15) this whole issue seems to have gotten considerably worse:

Previously, only pool nodes were ever affected.. and when I restarted the whole dCache cluster... perhaps ~4 of the pools were affected.

That doubled, + core domains and doors are also affected and need to be restarted to get their routing fixed.