dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
285 stars 136 forks source link

xrootd proxy mode Uncaught exception in thread xrootd-net-17 #6912

Closed cfgamboa closed 1 year ago

cfgamboa commented 1 year ago

Hello Al

Another issue here. While the fix seemed to address the issue described on the ticket #6875, I have now observed that if that a file hosted in a pool at the same host where the door runs the transfer does not work for read.

This works:

[cgamboa@lxplus733 ~]$ xrdcp -f root://dcqos002.usatlas.bnl.gov:1094/pnfs/usatlas.bnl.gov/diskdata/gamboa/test_carlos_diskonly.8.2.4 test.1
[4.738MB/4.738MB][100%][==================================================][808.6kB/s]  

The pool hosting the file is, (dcqos005.usatlas.bnl.gov) is internal on IPV6/IPV4:

[dcqos001] (dcqos005_9@dcqos005nineDomain) admin > \sn cacheinfoof /pnfs/usatlas.bnl.gov/diskdata/gamboa/test_carlos_diskonly.8.2.4
 dcqos005_9

On this case the pool hosting the file is located on the door server. The door server has external/internal connectivity on IPV6/IPV4.

[dcqos001] (dcqos005_9@dcqos005nineDomain) admin > \sn cacheinfoof /pnfs/usatlas.bnl.gov/diskdata/new_test_hsi_20
 dcqos001_8

xrootd log

Dec  6 14:30:14 dcqos001 dcache@xrootd-dcqos001Domain: 06 Dec 2022 14:30:14 (Xrootd-dcqos001-proxy) [] proxy 7cb23dcb-c6ca-467b-8130-00719a220449, channel inactive event received on [id: 0x6392b5a5, L:/10.42.38.45:55642 ! R:/10.42.38.45:33313].
Dec  6 14:30:19 dcqos001 dcache@xrootd-dcqos001Domain: 06 Dec 2022 14:30:19 (Xrootd-dcqos001-proxy) [] proxy 7cb23dcb-c6ca-467b-8130-00719a220449, channel inactive event received on [id: 0x9fdc2d0d, L:::/0:0:0:0:0:0:0:0:32399].

Pool log shows

Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: 06 Dec 2022 14:30:14 (dcqos001_8) [] Uncaught exception in thread xrootd-net-17
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: java.lang.NoClassDefFoundError: org/dcache/xrootd/pool/ChunkedFileDescriptorReadResponse
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.xrootd.pool.XrootdPoolRequestHandler.doOnRead(XrootdPoolRequestHandler.java:618)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.xrootd.core.XrootdRequestHandler.getResponse(XrootdRequestHandler.java:200)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.xrootd.core.XrootdRequestHandler.requestReceived(XrootdRequestHandler.java:166)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.xrootd.core.XrootdRequestHandler.channelRead(XrootdRequestHandler.java:115)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.util.ChannelCdcSessionHandlerWrapper.channelRead(ChannelCdcSessionHandlerWrapper.java:158)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:42)
Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: at java.base/java.lang.Thread.run(Thread.java:829)

Thank you, Carlos

alrossi commented 1 year ago

I'll have another look at this tomorrow, Carlos, but at first blush this doesn't look like it might seem.

If the pool error is the only error you are seeing,

Dec  6 14:30:14 dcqos001 dcache@dcqos001eightDomain: java.lang.NoClassDefFoundError: org/dcache/xrootd/pool/ChunkedFileDescriptorReadResponse

Seems to indicate that you have conflicting .jars. I don't know how that might have happened.

The door and the pool share the same libraries, right?

ls -l  /usr/share/dcache/classes/*xroot*.jar

shows what?

My analysis may be wrong here. But if it is not, then I will probably need more logging information from the door.

Thanks ... Al

cfgamboa commented 1 year ago

Hi Al,

I am overwriting the dcache-xrootd-8.2.4.jar with your latest code updates.

[root@dcqos001 ~]# ls -l  /usr/share/dcache/classes/*xroot*.jar
-rw-r--r-- 1 root root 176095 Dec  4 15:40 /usr/share/dcache/classes/dcache-xrootd-8.2.4.jar
-rw-r--r-- 1 root root 269037 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-4.5.2.jar
-rw-r--r-- 1 root root  30944 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-authz-plugin-alice-1.2.0.jar
-rw-r--r-- 1 root root  92163 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-gsi-4.5.2.jar
-rw-r--r-- 1 root root   6787 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-scitokens-4.5.2.jar
-rw-r--r-- 1 root root   8175 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-unix-4.5.2.jar
-rw-r--r-- 1 root root  15507 Nov  9 09:33 /usr/share/dcache/classes/xrootd4j-ztn-4.5.2.jar

All the best, Carlos

alrossi commented 1 year ago

When you overwrote the .jars, did you restart the pool as well as the door?

alrossi commented 1 year ago

(P.S., I'm kind of hamstrung for the moment because I am lacking new host certs for my testbed pools, but beyond that, I don't have a dual-stack node to be able to try out your scenario ... at least partially; so we're going to have to crawl through this issue together again.)

cfgamboa commented 1 year ago

This the log from the proxy door

06 Dec 2022 15:20:09 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Time since last server cert refresh 458061
06 Dec 2022 15:20:09 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Loading server certificates. Current refresh interval: 60000 ms
06 Dec 2022 15:20:09 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Verifying host certificate
06 Dec 2022 15:20:12 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Client protocol version was 10600, using GSIPost49ServerRequestHandler.
06 Dec 2022 15:20:12 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Time since last server cert refresh 2477
06 Dec 2022 15:20:12 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Using padded DH secret generation.
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Opening pnfs/usatlas.bnl.gov/diskdata/new_test_hsi_20 for read
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] OPAQUE : {}
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] current local address for proxied door: dcqos001.usatlas.bnl.gov/2620:0:210:8803:0:0:0:45:1096.
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] matching local address for proxied door: /10.42.38.45:1096.
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Proxy d2d5096a-dd8b-493d-b98a-5c1062b14160 started, listening on /2620:0:210:1:0:0:0:b:41052.
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Redirecting to dcqos001.usatlas.bnl.gov, 41052
06 Dec 2022 15:20:12 [pool-7-thread-20] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Sending the following host information to the client: dcqos001.usatlas.bnl.gov
06 Dec 2022 15:20:12 [xrootd-net-27] [] Proxy d2d5096a-dd8b-493d-b98a-5c1062b14160, connected to /10.42.38.45:33203, pool channel [id: 0xa2e87017, L:/10.42.38.45:45226 - R:/10.42.38.45:33203].
06 Dec 2022 15:20:13 [xrootd-proxy-client-11] [] proxy d2d5096a-dd8b-493d-b98a-5c1062b14160, channel inactive event received on [id: 0xa2e87017, L:/10.42.38.45:45226 ! R:/10.42.38.45:33203].
06 Dec 2022 15:21:13 [xrootd-net-27] [] proxy d2d5096a-dd8b-493d-b98a-5c1062b14160, channel inactive event received on [id: 0xbd1cba52, L:/2620:0:210:1:0:0:0:b:41052 ! R:/2001:1458:d00:16:0:0:0:327:55636].
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Opening pnfs/usatlas.bnl.gov/diskdata/new_test_hsi_20 for read
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] OPAQUE : {}
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] current local address for proxied door: dcqos001.usatlas.bnl.gov/2620:0:210:8803:0:0:0:45:1096.
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] matching local address for proxied door: /10.42.38.45:1096.
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Proxy f93a0600-af14-4cb0-84fe-73ae23ebe8df started, listening on /2620:0:210:1:0:0:0:b:28774.
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Redirecting to dcqos001.usatlas.bnl.gov, 28774
06 Dec 2022 15:21:13 [pool-7-thread-21] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] Sending the following host information to the client: dcqos001.usatlas.bnl.gov
06 Dec 2022 15:21:13 [xrootd-net-28] [] Proxy f93a0600-af14-4cb0-84fe-73ae23ebe8df, connected to /10.42.38.45:33203, pool channel [id: 0x08e41be5, L:/10.42.38.45:45246 - R:/10.42.38.45:33203].
06 Dec 2022 15:21:13 [xrootd-proxy-client-12] [] proxy f93a0600-af14-4cb0-84fe-73ae23ebe8df, channel inactive event received on [id: 0x08e41be5, L:/10.42.38.45:45246 ! R:/10.42.38.45:33203].
06 Dec 2022 15:21:22 [xrootd-net-28] [] proxy f93a0600-af14-4cb0-84fe-73ae23ebe8df, channel inactive event received on [id: 0xf55deadb, L:/2620:0:210:1:0:0:0:b:28774 ! R:/2001:1458:d00:16:0:0:0:327:55982].
06 Dec 2022 15:21:22 [xrootd-net-26] [door:Xrootd-dcqos001-proxy@xrootd-dcqos001Domain:AAXvLofP/BA] channel inactive event received on [id: 0x389db5c1, L:/2620:0:210:1:0:0:0:b:1096 ! R:lxplus733.cern.ch/2001:1458:d00:16:0:0:0:327:57010].
cfgamboa commented 1 year ago

Well Al, you nailed it: After restarting the pool containing the file

[root@dcqos001 ~]# systemctl restart dcache@dcqos001eightDomain.service

This worked

[cgamboa@lxplus733 ~]$ xrdcp -f root://dcqos002.usatlas.bnl.gov:1096/pnfs/usatlas.bnl.gov/diskdata/new_test_hsi_20 test.1
[63.37kB/63.37kB][100%][==================================================][21.12kB/s]  

And

[cgamboa@lxplus733 ~]$ xrdcp -f root://dcqos001.usatlas.bnl.gov:1096/pnfs/usatlas.bnl.gov/diskdata/new_test_hsi_20 test.1
[63.37kB/63.37kB][100%][==================================================][31.68kB/s]  

Thank you, Carlos

alrossi commented 1 year ago

Too easy. I'm skeptical (::smile::).

But glad it worked!

Cheers, Al

(can we close this?)

cfgamboa commented 1 year ago

yes, please. until next time... ;)